Shipping LLMs at Scale: Inference Economics and Feedback Loops
Two questions decide whether an LLM feature survives contact with production:
- —Can you afford to run it? (inference economics)
- —Does it get better or worse over time? (feedback loops)
Most teams discover both questions late — usually via an invoice and an angry user. Here's how I approach them from day one.
Part 1: Inference Economics
Latency is a UX budget
Users tolerate about a second of dead air. Everything past that needs either speed or perceived speed:
- —Stream everything. Time-to-first-token is the metric that matters for chat UX, not total generation time. A response that starts in 400ms feels fast even if it takes 8 seconds to finish.
- —Spend your budget where it shows. Reasoning-heavy calls can afford big models; an autocomplete hint cannot. Match model size to interaction speed.
- —Cut output before input. Generation is the slow part — output tokens cost more time (and usually more money) than input tokens. Tight max-token limits and "answer concisely" instructions are real latency optimizations.
The cost levers, in order of impact
- —Model selection. The gap between frontier and small-tier pricing is 10–50×. Most calls in a real product (classification, extraction, routing, summarization) don't need the frontier.
- —Prompt caching. If your prompts share a long static prefix — system prompt, schema, examples — provider-side caching can cut input costs dramatically. Structure prompts so the static part comes first and stays byte-identical.
- —Response caching. Identical question, identical answer? Hash the normalized input and skip the model entirely. FAQ-shaped traffic has shocking hit rates.
- —Output discipline. JSON-mode with a minimal schema beats "explain your reasoning then answer" by hundreds of tokens per call. If you need reasoning, keep it internal and short.
- —Batching. For offline pipelines (we processed documents in bulk), batch APIs trade latency for a meaningful discount.
Watch unit economics, not the invoice
Track cost per user action — per document processed, per conversation, per generated report. A growing invoice with flat unit costs is success; a flat invoice with growing unit costs is a leak. We caught a 3× cost regression once only because the per-document metric spiked while total spend looked "normal for growth."
Part 2: Feedback Loops
Models don't improve on their own — and prompts silently rot as inputs drift. A production LLM system needs a loop: capture → evaluate → improve → verify.
Capture signals
- —Explicit: thumbs up/down, "report this answer", user edits to generated drafts. Low volume, high signal.
- —Implicit: Did the user copy the answer? Retry the question? Abandon the flow? Accept the suggestion? High volume, noisier — but it's the signal you'll actually have at scale.
- —Operational: retrieval scores, tool-call failures, refusals, truncations. These catch system failures users never report.
Log every interaction with its full context — prompt version, model, retrieved chunks, parameters. An unreproducible bad answer is an unfixable bad answer.
Evaluate continuously
The core asset is a golden dataset: real inputs with verified-correct outputs, grown continuously from production failures. Every bad answer a user flags becomes a test case. Then:
- —Run the suite on every prompt change, model upgrade, and pipeline tweak — in CI, like any other regression test.
- —Use LLM-as-judge for scale, humans for calibration. A judge model grading "is this answer supported by the provided context?" catches most regressions; a weekly human review of a sample keeps the judge honest.
Improve deliberately
With evals in place, improvements become safe and boring (the good kind of boring):
- —Prompt edits ship like code: branch, eval, review, merge — versioned and rollback-able.
- —Failure clusters tell you where to work: if 80% of bad answers trace to retrieval misses, no prompt edit will save you.
- —Model upgrades become an afternoon: run the suite, compare, switch. This is the payoff for all the discipline above.
In our medical-coding system, this loop — human corrections feeding the eval set feeding weekly improvements — is what carried accuracy to the level where automation was actually trustworthy. Not a smarter model. The loop.
Closing Thought
Inference economics keep your product alive this quarter; feedback loops keep it alive next year. Neither is glamorous, both are just engineering — and that's exactly why they're a durable advantage while everyone else is chasing the next model release.