← Writing index
July 22, 2025 · 12 min read · Yash Chouriya

One Interface, Many Models: Integrating Anthropic, OpenAI, Gemini, and Open-Source LLMs

Over the last few years I've shipped production systems on Anthropic's Claude, OpenAI's GPT, Google's Gemini, and self-hosted open-source models. The single most valuable architectural decision across all of them was the same: never let a vendor's SDK leak past one file.

Models change monthly. Pricing changes quarterly. The system you build around them should survive both.


The Adapter Layer

Every provider speaks a slightly different dialect: message formats differ, system prompts attach differently, tool calls have different shapes, streaming chunks arrive differently. The fix is boring and effective — define your interface and adapt each vendor to it:

1interface ChatModel { 2 generate(req: ChatRequest): Promise<ChatResponse>; 3 stream(req: ChatRequest): AsyncIterable<Delta>; 4} 5 6interface ChatRequest { 7 system?: string; 8 messages: Message[]; // your canonical format 9 tools?: ToolSpec[]; // your canonical tool schema 10 maxOutputTokens?: number; 11 temperature?: number; 12}

Each adapter is 100–200 lines of translation code. Tedious to write, trivial to test, and it turns "migrate to the new model" from a rewrite into a config change.

(If you'd rather not maintain adapters yourself, libraries like Vercel's AI SDK do this normalization for you — this site's own chat endpoint switched providers in one line during an upgrade. The principle is the same: depend on the abstraction, not the vendor.)


Where the Dialects Bite

A few differences that cost me real debugging hours:

  • System prompts. Some APIs take a dedicated system parameter; others want it as the first message. Your adapter should own this, not your application code.
  • Tool-call shapes. JSON schema dialects and argument encodings differ subtly. Normalize before validation so your tool layer sees one format.
  • Streaming. Token deltas, role headers, and tool-call fragments arrive in provider-specific framings. Convert to your own delta type at the edge.
  • Token accounting. Tokenizers differ. If you bill or budget by tokens, count with the provider's own usage numbers from the response, never your local estimate.

Routing: the Right Model for the Job

Once models are swappable, you stop asking "which model is best?" and start asking "which model is best for this call?"

A routing table that has served me well:

TaskGood fitWhy
Complex reasoning, long documentsFrontier models (Claude, GPT, Gemini Pro)Quality dominates cost
High-volume classification/extractionSmall fast models (Flash/Mini tier)10–50× cheaper, plenty accurate
Privacy-sensitive or offline workloadsSelf-hosted open-source (LLAMA family)Data never leaves your infra
Latency-critical UX (autocomplete, hints)Small models, often localRound-trip time is the feature

Two rules of thumb:

  1. Route by task, not by loyalty. The cheapest adequate model wins each call.
  2. Re-evaluate quarterly. The frontier moves; last year's premium capability is this year's commodity tier.

Fallbacks and Resilience

Providers have incidents. Rate limits bite at the worst time. A production system needs a failure story better than a 500 page:

1const chain = [primary, secondary, lastResort]; 2 3async function generateWithFallback(req: ChatRequest) { 4 for (const model of chain) { 5 try { 6 return await withTimeout(model.generate(req), 30_000); 7 } catch (err) { 8 if (!isRetryable(err)) throw err; 9 log.warn("falling back", { from: model.id, err }); 10 } 11 } 12 throw new AllProvidersFailedError(); 13}

Caveats from production:

  • Fallback changes behavior. A prompt tuned for one model can underperform on another. Keep per-model prompt overrides for your most important flows, and run your eval suite against every model in the chain — not just the primary.
  • Degrade visibly. If you served a weaker model, mark it internally. It explains quality dips in your metrics later.

Open-Source Models Are a Different Contract

APIs sell you tokens; self-hosting sells you control and a pager. With LLAMA-class models you gain data locality, fixed costs at scale, and fine-tuning freedom — and you take on GPU capacity planning, quantization tradeoffs, inference servers, and upgrades. My rule: start on APIs, move specific high-volume or sensitive workloads in-house once the economics are proven with real traffic numbers.


Closing Thought

Multi-model isn't a buzzword, it's insurance. The adapter layer costs you a week once. Vendor lock-in costs you a quarter every time the landscape shifts — and in this market, it shifts every few months.