Most RAG tutorials walk you through five lines of LangChain code and call it a pipeline.
That works for a demo. It does not survive contact with production traffic, a real corpus, or anyone who expects the answers to be reliably correct.
This is a walkthrough of how to actually assemble a RAG pipeline — the components, the decisions inside each component, and the parts of the system that exist outside the happy path but determine whether the whole thing holds up.
The audience here is engineers who are going to build and operate the thing. The framing is opinionated; pick the parts that match your constraints.
The pipeline at a glance
A production RAG system is really two pipelines plus a harness around them.
The indexing pipeline runs offline and on a schedule. It ingests source documents, parses them, splits them into retrievable units, generates embeddings, and writes the result to a vector store and a lexical index.
The query pipeline runs online per request. It takes a user query, rewrites or decomposes it, retrieves candidate passages from both indices, reranks them, assembles a prompt, generates a response, and emits telemetry.
The evaluation harness wraps both. It is a golden set, a regression runner, an LLM-as-judge with calibration, and the observability layer that tells you which queries are failing in production.
Skip any of the three and the system will appear to work until it does not.
INDEXING PIPELINE
Documents → Parsing → Chunking → Embeddings → Vector + Lexical Indices
QUERY PIPELINE
User Query → Rewrite / Decompose → Hybrid Retrieval → Reranking → Generation
MEASUREMENT HARNESS
Evaluation ↔ Observability ↔ Feedback Loop Part 1: The indexing pipeline
1. Ingestion and parsing
Ingestion is the most underestimated stage. Whatever quality problems exist in your source corpus get baked into the index, and "garbage in, garbage out" is not a metaphor here — it is a precise description of how retrieval will behave.
Concretely, you need a parser per source type. PDFs are the hardest case and deserve serious attention:
- Text-native PDFs: a layout-aware parser (Unstructured, PyMuPDF with layout heuristics, or a commercial extraction service) that preserves reading order, headings, lists, and tables.
- Scanned PDFs: OCR with a layout model — Tesseract is rarely good enough for production. Use a vision-language model or a dedicated document understanding service for anything where accuracy matters.
- HTML: strip navigation, headers, footers, and sidebars before extracting body content. Readability-style extractors get you most of the way; site-specific selectors handle the rest.
- Office documents: use the structured representation directly (docx, xlsx) rather than converting through PDF and re-parsing.
- Tables: preserve them as structured data where possible. Flattening a table into prose loses the column-row relationships that make the data useful.
Attach metadata at ingestion time: source URI, document title, section path, author or owner, published date, last-modified date, version, status (current, draft, superseded), and any authority signals available. Every one of these is a filter or a ranking input later. You cannot retrofit metadata you did not capture.
2. Chunking
Chunking is where a surprising number of production accuracy issues are quietly decided.
The defaults you see in tutorials — 1000-token fixed-size chunks with 200-token overlap — are a starting point, not an answer. The right strategy depends on document structure:
- Structured documents with clear hierarchy (manuals, policies, wikis): split on headings first, then sub-split sections that exceed a max token budget. Carry the heading path as metadata so the retrieved chunk knows where it came from.
- Long-form prose (articles, reports): semantic chunking — group sentences that cluster together in embedding space — outperforms fixed-size splits, at the cost of more compute at index time.
- Code and config: split on syntactic boundaries (functions, classes, top-level blocks), not lines or tokens.
- Conversations and transcripts: split on speaker turns or topic shifts, with a window of preceding context attached.
- Tables: represent each row as a chunk with the column headers prepended, plus a separate chunk for the table-level summary.
Two patterns that consistently help, regardless of strategy:
Parent-child chunking. Index small chunks (200–400 tokens) for retrieval precision, but return the larger parent chunk (1500–3000 tokens) to the model for generation context. The model sees enough surrounding text to interpret the passage correctly.
Contextual chunk headers. Prepend each chunk with a short generated description of what document and section it is from, so the embedding captures topical context the bare chunk would not. This is essentially the technique Anthropic published as "contextual retrieval" — the gains are real.
3. Embeddings
Embedding model selection matters less than people think, up to a point — then it matters a lot.
For most enterprise corpora, the top embedding models from the major providers (OpenAI text-embedding-3-large, Voyage voyage-3, Cohere embed-v3, or a strong open model like bge-large) are interchangeable to within a few points of retrieval accuracy. Pick one based on cost, latency, and whether you need on-prem.
The decisions that actually move the needle:
- Dimensionality. Higher-dimensional embeddings are more accurate but cost more in storage and search time. Some models (text-embedding-3-large, voyage-3) support Matryoshka truncation — you can store 1024 or 1536 dimensions instead of the full size with a small accuracy hit. Use it.
- Domain. For specialized domains — legal, biomedical, code — a domain-tuned embedding model can outperform a generalist by a meaningful margin. Evaluate before assuming.
- Symmetric vs. asymmetric. Some embedding models distinguish between query-side and passage-side encoding. Use the right side at the right stage; mixing them quietly degrades retrieval.
- Versioning. Pin the embedding model version and treat upgrades as a re-indexing event. Mixing embedding versions in the same vector store is a quiet correctness problem.
4. Vector store and lexical index
Run two indices, not one.
A vector index for dense semantic retrieval. Options worth considering:
- pgvector on Postgres — best choice when your team already runs Postgres and the corpus fits comfortably (millions of vectors, not hundreds of millions). Transactional, joinable, operationally familiar.
- Qdrant, Weaviate, Milvus — purpose-built vector databases. Use when scale or feature requirements (named vectors, payload filtering, hybrid search built in) exceed what pgvector handles cleanly.
- OpenSearch / Elasticsearch with vector support — strong choice when you already run one and want a single system for lexical and vector retrieval.
- FAISS — embed it as a library when you need maximum control and minimal operational surface. Not a database; you handle persistence and updates yourself.
A lexical index for BM25 or similar. OpenSearch, Elasticsearch, or a Postgres full-text index will all do. Skipping this is the single most common architectural mistake in stalled RAG systems.
Index design choices that matter in production:
- Metadata filters as first-class. Most queries should be filtered before similarity search — by tenant, document type, date range, status. Pre-filtering is faster and more accurate than post-filtering a similarity search result set.
- Index parameters. HNSW with sensible
ef_constructionandMvalues is the default. Tuneef_searchper query for the recall/latency tradeoff you want. - Reindex strategy. Have one. Either a blue-green index swap on full reindex, or an incremental update path with deletion handling. Discovering you cannot reindex without downtime is unpleasant.
Part 2: The query pipeline
5. Query processing
Do not pass the raw user query to retrieval. It is almost always the wrong query.
Three transformations are usually worth running, often in combination:
Query rewriting. A small, fast model (Haiku-class) expands the query to be more retrieval-friendly — resolving pronouns, expanding acronyms, adding implicit context.
system: Rewrite the user query so it is self-contained and retrieval-ready.
Preserve entities and intent. Output the rewritten query only.
user: what about Q3?
[conversation context: ... discussing 2025 revenue ...]
output: What were the Q3 2025 revenue figures and key drivers? Multi-query generation. Generate three to five reformulations of the query, retrieve against each, and union the results before reranking. Improves recall for queries that can be phrased many ways.
Decomposition. For multi-hop questions, generate sub-queries, retrieve for each, and pass all results into reranking. A question like "How does our Q3 revenue compare to the guidance we gave in Q2?" decomposes into "Q3 revenue actuals" and "Q2 forward guidance."
These transformations cost a small-model call per query — typically under 200ms and a fraction of a cent. The accuracy gains on real enterprise traffic are large.
6. Hybrid retrieval
Run dense and lexical retrieval in parallel, then fuse the results.
The standard fusion technique is Reciprocal Rank Fusion (RRF):
def rrf(rankings: list[list[str]], k: int = 60) -> list[tuple[str, float]]:
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: -x[1]) RRF is rank-based, not score-based, which sidesteps the calibration problem between dense and lexical scores. It is hard to beat without learning a fusion model on your own data.
Retrieve broadly at this stage — 50 to 100 candidates per retriever — and let the reranker do the precision work.
7. Reranking
Rerankers are where most stalled RAG systems find their biggest single accuracy gain.
A cross-encoder reranker (Cohere Rerank, Voyage Rerank, bge-reranker-v2) takes the query and each candidate passage together and produces a relevance score based on actual content understanding, not vector proximity. The compute cost is real — typically 50–200ms for a few dozen candidates — but the precision improvement is usually decisive.
Practical guidance:
- Retrieve 50–100 candidates from hybrid search; rerank all of them; keep the top 5–10 for the generation context.
- Apply a relevance score threshold — if no candidate clears it, the right answer is probably "I do not have information on that," not whatever the model would confabulate from low-relevance context.
- Reranker latency stacks. Budget it deliberately and consider running it asynchronously where the UX allows.
8. Prompt assembly and generation
The prompt you send to the generation model is the contract between the retrieval pipeline and the user-visible answer. Treat it as code.
A defensible structure:
system: You answer questions strictly from the provided sources.
Cite sources inline as [1], [2], etc.
If the sources do not contain the answer, say so explicitly.
Do not use prior knowledge beyond the sources.
user: Question: {user_query}
Sources:
[1] {source_1_title} ({source_1_date})
{source_1_passage}
[2] {source_2_title} ({source_2_date})
{source_2_passage}
... Things worth getting right:
- Citation as a first-class output. Require the model to cite sources by index, then map back to URIs in the response. Without citations, you have no grounding signal for users or for eval.
- Source metadata in the prompt. Title, date, section — gives the model a way to prefer current sources and acknowledge conflicts.
- Refusal as a valid output. The system prompt should explicitly authorize "I do not have information on that." A model that always answers is a model that hallucinates when retrieval misses.
- Context budget management. Even with large context windows, more is not better. Past a certain point, additional context degrades answer quality (the "lost in the middle" effect). Five well-ranked passages usually beat twenty mediocre ones.
- Prompt caching. The system prompt and any static instructions should be cached. On Anthropic and OpenAI APIs, this is a meaningful cost reduction at production scale.
Part 3: The harness around the pipeline
9. Evaluation
Without evaluation, every change is a guess. The eval system has three layers.
Golden set. A versioned collection of queries with expected answers and expected source documents. Cover the long tail intentionally: ambiguous queries, multi-hop questions, queries with no good answer, queries where the corpus contains conflicts. A hundred well-chosen examples is more useful than a thousand cherry-picked ones.
Separated metrics. Measure retrieval and generation independently:
- Retrieval: recall@k, mean reciprocal rank, and a hit-rate on the expected source documents. If retrieval fails, generation cannot succeed.
- Generation: faithfulness (does the answer follow from the cited sources), correctness (is the answer right), and citation accuracy (do the citations actually support the claims). LLM-as-judge with periodic human calibration is the standard approach.
Regression runs on every change. The eval harness should run on every prompt change, model swap, chunking adjustment, or reranker update. CI-style integration is the right pattern. A change that improves the overall score but regresses a specific query class is worth knowing about before deployment.
10. Observability and feedback
Production telemetry is the part of the system that tells you which questions you failed to anticipate.
The minimum useful instrumentation per query:
- The raw query, the rewritten query, and any sub-queries
- The retrieved candidates with scores from each retriever
- The reranked top-k passed to generation
- The final prompt, the generated answer, and the citations
- Latency per stage
- User feedback signals — thumbs, follow-up queries, abandonment
Sample and review systematically. The queries that produce low-confidence answers, low reranker scores, or thumbs-down feedback are where the next round of golden-set additions and corpus improvements come from. A RAG system without this feedback loop drifts.
11. Cost, latency, and caching
A few patterns that consistently pay off:
- Embedding cache. Cache embeddings by content hash. Re-embedding identical chunks during reindex is pure waste.
- Query embedding cache. For high-traffic applications, cache query embeddings keyed by normalized query string. The hit rate on real traffic is higher than you would guess.
- Result cache. For deterministic queries (FAQs, common lookups), cache the full response with a short TTL. Invalidate on corpus updates.
- Model tiering. Use a smaller model for query rewriting and decomposition; reserve the largest model for generation. The cost difference compounds at scale.
- Streaming. Stream tokens to the user as they are generated. Perceived latency drops by half even when total latency is unchanged.
What to build vs. what to buy
Frameworks like LangChain and LlamaIndex are useful for prototyping and not always the right primitive for production. The abstractions that make a demo fast often hide the parameters you most need to tune.
A defensible split:
- Use a library for: document parsers (especially PDFs), embedding model SDKs, vector store clients, reranker APIs. These are well-defined integrations where reinvention is wasteful.
- Own the code for: chunking, query rewriting, retrieval orchestration, prompt assembly, and the eval harness. These are where your application's behavior actually lives. Burying them inside a framework abstraction makes them hard to evolve and harder to debug.
The same logic applies to managed RAG services. They are reasonable for a low-stakes internal tool. For a system you intend to operate seriously, the abstraction usually costs you more than it saves once you need to tune anything non-trivial.
Closing
A RAG pipeline that works in production is not a single clever model call. It is a dozen unremarkable components, each tuned to its job, with a measurement harness that tells you when one of them regresses.
The teams that ship reliable RAG systems do so by treating retrieval as a first-class engineering surface, owning the components that determine answer quality, and refusing to operate without evaluation. There are no clever shortcuts to that posture, and the systems that try to skip it tend to plateau in the same place.
Build the pipeline deliberately. Measure it. Iterate. The accuracy gains are there for teams that do the work.
For the strategic framing of why these systems plateau: Why Most Enterprise RAG Projects Stall at 70% Accuracy — And What Actually Fixes It.
If you are building or operating a production RAG pipeline and want a technical review or hands-on help with retrieval, evaluation, or the surrounding infrastructure, get in touch.