← Writing index
February 10, 2026 · 12 min read · Yash Chouriya

Building Production RAG Pipelines: Lessons from the Trenches

A RAG demo takes an afternoon. A RAG system that users trust takes months. I learned this the hard way building document-heavy AI products, where a wrong answer isn't a quirky screenshot for social media — it's a real cost for a real user.

This post is a collection of things I wish someone had told me before my first production RAG deployment.


The Demo-to-Production Gap

The naive pipeline everyone starts with looks like this:

  1. Split documents into chunks
  2. Embed the chunks, store vectors
  3. Embed the query, fetch top-k by cosine similarity
  4. Stuff the chunks into the prompt and generate

It works shockingly well on the five documents you test with. Then you load ten thousand real documents and the cracks appear: retrieval pulls the wrong sections, answers cite stale versions, and the model confidently summarizes context that has nothing to do with the question.

Every fix below addresses one of those cracks.


Chunking Is a Product Decision, Not a Preprocessing Step

The most common mistake I see is treating chunking as a fixed constant — chunk_size=1000, overlap=200 — copied from a tutorial.

Chunk boundaries decide what the model can see together. If a medical document has a diagnosis on page 2 and the supporting lab values on page 4, fixed-size chunks guarantee the model never sees both at once.

What works better in practice:

  • Structure-aware splitting. Split on headings, sections, and tables first; only fall back to fixed sizes inside long sections.
  • Parent-child chunks. Embed small chunks (precise retrieval), but hand the model the parent section (full context). This single change fixed more bad answers for me than any model upgrade.
  • Metadata on every chunk. Source, section title, page, document date. You will need it for filtering and for citations.
1def build_chunks(document): 2 for section in split_by_structure(document): 3 parent_id = store_parent(section) 4 for child in sliding_window(section.text, size=400, overlap=80): 5 yield Chunk( 6 text=child, 7 parent_id=parent_id, 8 metadata={ 9 "source": document.source, 10 "section": section.title, 11 "updated_at": document.updated_at, 12 }, 13 )

Retrieval: Cosine Similarity Is Not Enough

Pure vector search fails in two predictable ways: it misses exact identifiers (codes, names, SKUs — embeddings blur them), and it happily returns similar-looking but wrong sections.

The production answer is hybrid retrieval + reranking:

  • Combine vector search with keyword search (BM25 or Postgres full-text) and merge results. With PgVector this lives in one database, one query away.
  • Rerank the merged candidates with a cross-encoder or an LLM scoring pass, then keep the top handful.
1-- PgVector + full-text in one shot, merged in the application layer 2SELECT id, text, 1 - (embedding <=> $1) AS vec_score 3FROM chunks 4WHERE metadata->>'source' = ANY($2) 5ORDER BY embedding <=> $1 6LIMIT 25;

Two practical notes:

  • Filter before you search. Tenant, document set, date range. Most "hallucinations" I debugged were actually retrieval pulling from the wrong corpus.
  • Top-k is a latency and cost dial. Retrieve wide (20–50), rerank hard, pass 3–6 chunks to the model. Stuffing 20 chunks into the prompt makes answers worse, not better — the model anchors on irrelevant context.

The Part Everyone Skips: Evaluation

You cannot improve what you don't measure, and "the answers feel better" is not a measurement.

The minimum viable eval setup:

  1. A golden set. 50–200 real questions with known-correct answers and the chunks that support them. Build it from actual user queries, not invented ones.
  2. Retrieval metrics. Did the right chunk land in the top-k? (Recall@k). This isolates retrieval failures from generation failures — they need different fixes.
  3. Answer grading. An LLM-as-judge comparing generated answers against the golden answers works well enough to catch regressions, as long as a human spot-checks the judge.

Run this on every change — new embedding model, new chunking, new prompt. RAG systems are full of coupled parts; evals are the only way to change one without silently breaking another.


Operational Lessons

  • Embed once, cache forever. Hash the chunk text; re-embed only what changed. Re-embedding an entire corpus because one document updated is a surprisingly common money fire.
  • Version your index. When you change chunking or embedding models, build the new index alongside the old one and cut over after evals pass.
  • Show citations. Clickable sources turn "I don't trust this" into "I can verify this" — and they make your own debugging dramatically faster.
  • Log the full retrieval trace. Query, candidates, scores, final context. When a user reports a bad answer, this trace is the difference between a five-minute fix and a shrug.

Closing Thought

RAG isn't a model feature — it's an information-retrieval system with a language model at the end. Treat the retrieval half with the same engineering seriousness as the generation half, measure everything, and the "AI magic" gets a lot more reliable.