One Interface, Many Models: Integrating Anthropic, OpenAI, Gemini, and Open-Source LLMs
Over the last few years I've shipped production systems on Anthropic's Claude, OpenAI's GPT, Google's Gemini, and self-hosted open-source models. The single most valuable architectural decision across all of them was the same: never let a vendor's SDK leak past one file.
Models change monthly. Pricing changes quarterly. The system you build around them should survive both.
The Adapter Layer
Every provider speaks a slightly different dialect: message formats differ, system prompts attach differently, tool calls have different shapes, streaming chunks arrive differently. The fix is boring and effective — define your interface and adapt each vendor to it:
1interface ChatModel { 2 generate(req: ChatRequest): Promise<ChatResponse>; 3 stream(req: ChatRequest): AsyncIterable<Delta>; 4} 5 6interface ChatRequest { 7 system?: string; 8 messages: Message[]; // your canonical format 9 tools?: ToolSpec[]; // your canonical tool schema 10 maxOutputTokens?: number; 11 temperature?: number; 12}
Each adapter is 100–200 lines of translation code. Tedious to write, trivial to test, and it turns "migrate to the new model" from a rewrite into a config change.
(If you'd rather not maintain adapters yourself, libraries like Vercel's AI SDK do this normalization for you — this site's own chat endpoint switched providers in one line during an upgrade. The principle is the same: depend on the abstraction, not the vendor.)
Where the Dialects Bite
A few differences that cost me real debugging hours:
- —System prompts. Some APIs take a dedicated system parameter; others want it as the first message. Your adapter should own this, not your application code.
- —Tool-call shapes. JSON schema dialects and argument encodings differ subtly. Normalize before validation so your tool layer sees one format.
- —Streaming. Token deltas, role headers, and tool-call fragments arrive in provider-specific framings. Convert to your own delta type at the edge.
- —Token accounting. Tokenizers differ. If you bill or budget by tokens, count with the provider's own usage numbers from the response, never your local estimate.
Routing: the Right Model for the Job
Once models are swappable, you stop asking "which model is best?" and start asking "which model is best for this call?"
A routing table that has served me well:
| Task | Good fit | Why |
|---|---|---|
| Complex reasoning, long documents | Frontier models (Claude, GPT, Gemini Pro) | Quality dominates cost |
| High-volume classification/extraction | Small fast models (Flash/Mini tier) | 10–50× cheaper, plenty accurate |
| Privacy-sensitive or offline workloads | Self-hosted open-source (LLAMA family) | Data never leaves your infra |
| Latency-critical UX (autocomplete, hints) | Small models, often local | Round-trip time is the feature |
Two rules of thumb:
- —Route by task, not by loyalty. The cheapest adequate model wins each call.
- —Re-evaluate quarterly. The frontier moves; last year's premium capability is this year's commodity tier.
Fallbacks and Resilience
Providers have incidents. Rate limits bite at the worst time. A production system needs a failure story better than a 500 page:
1const chain = [primary, secondary, lastResort]; 2 3async function generateWithFallback(req: ChatRequest) { 4 for (const model of chain) { 5 try { 6 return await withTimeout(model.generate(req), 30_000); 7 } catch (err) { 8 if (!isRetryable(err)) throw err; 9 log.warn("falling back", { from: model.id, err }); 10 } 11 } 12 throw new AllProvidersFailedError(); 13}
Caveats from production:
- —Fallback changes behavior. A prompt tuned for one model can underperform on another. Keep per-model prompt overrides for your most important flows, and run your eval suite against every model in the chain — not just the primary.
- —Degrade visibly. If you served a weaker model, mark it internally. It explains quality dips in your metrics later.
Open-Source Models Are a Different Contract
APIs sell you tokens; self-hosting sells you control and a pager. With LLAMA-class models you gain data locality, fixed costs at scale, and fine-tuning freedom — and you take on GPU capacity planning, quantization tradeoffs, inference servers, and upgrades. My rule: start on APIs, move specific high-volume or sensitive workloads in-house once the economics are proven with real traffic numbers.
Closing Thought
Multi-model isn't a buzzword, it's insurance. The adapter layer costs you a week once. Vendor lock-in costs you a quarter every time the landscape shifts — and in this market, it shifts every few months.