← Writing index
March 14, 2025 · 13 min read · Yash Chouriya

Shipping LLMs at Scale: Inference Economics and Feedback Loops

Two questions decide whether an LLM feature survives contact with production:

  1. Can you afford to run it? (inference economics)
  2. Does it get better or worse over time? (feedback loops)

Most teams discover both questions late — usually via an invoice and an angry user. Here's how I approach them from day one.


Part 1: Inference Economics

Latency is a UX budget

Users tolerate about a second of dead air. Everything past that needs either speed or perceived speed:

  • Stream everything. Time-to-first-token is the metric that matters for chat UX, not total generation time. A response that starts in 400ms feels fast even if it takes 8 seconds to finish.
  • Spend your budget where it shows. Reasoning-heavy calls can afford big models; an autocomplete hint cannot. Match model size to interaction speed.
  • Cut output before input. Generation is the slow part — output tokens cost more time (and usually more money) than input tokens. Tight max-token limits and "answer concisely" instructions are real latency optimizations.

The cost levers, in order of impact

  1. Model selection. The gap between frontier and small-tier pricing is 10–50×. Most calls in a real product (classification, extraction, routing, summarization) don't need the frontier.
  2. Prompt caching. If your prompts share a long static prefix — system prompt, schema, examples — provider-side caching can cut input costs dramatically. Structure prompts so the static part comes first and stays byte-identical.
  3. Response caching. Identical question, identical answer? Hash the normalized input and skip the model entirely. FAQ-shaped traffic has shocking hit rates.
  4. Output discipline. JSON-mode with a minimal schema beats "explain your reasoning then answer" by hundreds of tokens per call. If you need reasoning, keep it internal and short.
  5. Batching. For offline pipelines (we processed documents in bulk), batch APIs trade latency for a meaningful discount.

Watch unit economics, not the invoice

Track cost per user action — per document processed, per conversation, per generated report. A growing invoice with flat unit costs is success; a flat invoice with growing unit costs is a leak. We caught a 3× cost regression once only because the per-document metric spiked while total spend looked "normal for growth."


Part 2: Feedback Loops

Models don't improve on their own — and prompts silently rot as inputs drift. A production LLM system needs a loop: capture → evaluate → improve → verify.

Capture signals

  • Explicit: thumbs up/down, "report this answer", user edits to generated drafts. Low volume, high signal.
  • Implicit: Did the user copy the answer? Retry the question? Abandon the flow? Accept the suggestion? High volume, noisier — but it's the signal you'll actually have at scale.
  • Operational: retrieval scores, tool-call failures, refusals, truncations. These catch system failures users never report.

Log every interaction with its full context — prompt version, model, retrieved chunks, parameters. An unreproducible bad answer is an unfixable bad answer.

Evaluate continuously

The core asset is a golden dataset: real inputs with verified-correct outputs, grown continuously from production failures. Every bad answer a user flags becomes a test case. Then:

  • Run the suite on every prompt change, model upgrade, and pipeline tweak — in CI, like any other regression test.
  • Use LLM-as-judge for scale, humans for calibration. A judge model grading "is this answer supported by the provided context?" catches most regressions; a weekly human review of a sample keeps the judge honest.

Improve deliberately

With evals in place, improvements become safe and boring (the good kind of boring):

  • Prompt edits ship like code: branch, eval, review, merge — versioned and rollback-able.
  • Failure clusters tell you where to work: if 80% of bad answers trace to retrieval misses, no prompt edit will save you.
  • Model upgrades become an afternoon: run the suite, compare, switch. This is the payoff for all the discipline above.

In our medical-coding system, this loop — human corrections feeding the eval set feeding weekly improvements — is what carried accuracy to the level where automation was actually trustworthy. Not a smarter model. The loop.


Closing Thought

Inference economics keep your product alive this quarter; feedback loops keep it alive next year. Neither is glamorous, both are just engineering — and that's exactly why they're a durable advantage while everyone else is chasing the next model release.