July 22, 2025 · 12 min read · Yash Chouriya

One Interface, Many Models: Integrating Anthropic, OpenAI, Gemini, and Open-Source LLMs

Over the last few years I've shipped production systems on Anthropic's Claude, OpenAI's GPT, Google's Gemini, and self-hosted open-source models. The single most valuable architectural decision across all of them was the same: never let a vendor's SDK leak past one file.

Models change monthly. Pricing changes quarterly. The system you build around them should survive both.

The Adapter Layer

Every provider speaks a slightly different dialect: message formats differ, system prompts attach differently, tool calls have different shapes, streaming chunks arrive differently. The fix is boring and effective — define your interface and adapt each vendor to it:

1interface ChatModel {
2  generate(req: ChatRequest): Promise<ChatResponse>;
3  stream(req: ChatRequest): AsyncIterable<Delta>;
4}
5
6interface ChatRequest {
7  system?: string;
8  messages: Message[];        // your canonical format
9  tools?: ToolSpec[];         // your canonical tool schema
10  maxOutputTokens?: number;
11  temperature?: number;
12}

Each adapter is 100–200 lines of translation code. Tedious to write, trivial to test, and it turns "migrate to the new model" from a rewrite into a config change.

(If you'd rather not maintain adapters yourself, libraries like Vercel's AI SDK do this normalization for you — this site's own chat endpoint switched providers in one line during an upgrade. The principle is the same: depend on the abstraction, not the vendor.)

Where the Dialects Bite

A few differences that cost me real debugging hours:

—System prompts. Some APIs take a dedicated system parameter; others want it as the first message. Your adapter should own this, not your application code.
—Tool-call shapes. JSON schema dialects and argument encodings differ subtly. Normalize before validation so your tool layer sees one format.
—Streaming. Token deltas, role headers, and tool-call fragments arrive in provider-specific framings. Convert to your own delta type at the edge.
—Token accounting. Tokenizers differ. If you bill or budget by tokens, count with the provider's own usage numbers from the response, never your local estimate.

Routing: the Right Model for the Job

Once models are swappable, you stop asking "which model is best?" and start asking "which model is best for this call?"

A routing table that has served me well:

Task	Good fit	Why
Complex reasoning, long documents	Frontier models (Claude, GPT, Gemini Pro)	Quality dominates cost
High-volume classification/extraction	Small fast models (Flash/Mini tier)	10–50× cheaper, plenty accurate
Privacy-sensitive or offline workloads	Self-hosted open-source (LLAMA family)	Data never leaves your infra
Latency-critical UX (autocomplete, hints)	Small models, often local	Round-trip time is the feature

Two rules of thumb:

—Route by task, not by loyalty. The cheapest adequate model wins each call.
—Re-evaluate quarterly. The frontier moves; last year's premium capability is this year's commodity tier.

Fallbacks and Resilience

Providers have incidents. Rate limits bite at the worst time. A production system needs a failure story better than a 500 page:

1const chain = [primary, secondary, lastResort];
2
3async function generateWithFallback(req: ChatRequest) {
4  for (const model of chain) {
5    try {
6      return await withTimeout(model.generate(req), 30_000);
7    } catch (err) {
8      if (!isRetryable(err)) throw err;
9      log.warn("falling back", { from: model.id, err });
10    }
11  }
12  throw new AllProvidersFailedError();
13}

Caveats from production:

—Fallback changes behavior. A prompt tuned for one model can underperform on another. Keep per-model prompt overrides for your most important flows, and run your eval suite against every model in the chain — not just the primary.
—Degrade visibly. If you served a weaker model, mark it internally. It explains quality dips in your metrics later.

Open-Source Models Are a Different Contract

APIs sell you tokens; self-hosting sells you control and a pager. With LLAMA-class models you gain data locality, fixed costs at scale, and fine-tuning freedom — and you take on GPU capacity planning, quantization tradeoffs, inference servers, and upgrades. My rule: start on APIs, move specific high-volume or sensitive workloads in-house once the economics are proven with real traffic numbers.

Closing Thought

Multi-model isn't a buzzword, it's insurance. The adapter layer costs you a week once. Vendor lock-in costs you a quarter every time the landscape shifts — and in this market, it shifts every few months.