All Posts
June 2, 2026·14 min read

How to Build a Production RAG Pipeline — A Technical Walkthrough

A component-by-component guide to assembling a RAG pipeline that holds up under real production load — ingestion, chunking, embeddings, retrieval, reranking, generation, and the eval harness around it.

Most RAG tutorials walk you through five lines of LangChain code and call it a pipeline.

That works for a demo. It does not survive contact with production traffic, a real corpus, or anyone who expects the answers to be reliably correct.

This is a walkthrough of how to actually assemble a RAG pipeline — the components, the decisions inside each component, and the parts of the system that exist outside the happy path but determine whether the whole thing holds up.

The audience here is engineers who are going to build and operate the thing. The framing is opinionated; pick the parts that match your constraints.


The pipeline at a glance

A production RAG system is really two pipelines plus a harness around them.

The indexing pipeline runs offline and on a schedule. It ingests source documents, parses them, splits them into retrievable units, generates embeddings, and writes the result to a vector store and a lexical index.

The query pipeline runs online per request. It takes a user query, rewrites or decomposes it, retrieves candidate passages from both indices, reranks them, assembles a prompt, generates a response, and emits telemetry.

The evaluation harness wraps both. It is a golden set, a regression runner, an LLM-as-judge with calibration, and the observability layer that tells you which queries are failing in production.

Skip any of the three and the system will appear to work until it does not.

INDEXING PIPELINE
Documents → Parsing → Chunking → Embeddings → Vector + Lexical Indices

QUERY PIPELINE
User Query → Rewrite / Decompose → Hybrid Retrieval → Reranking → Generation

MEASUREMENT HARNESS
Evaluation ↔ Observability ↔ Feedback Loop

Part 1: The indexing pipeline

1. Ingestion and parsing

Ingestion is the most underestimated stage. Whatever quality problems exist in your source corpus get baked into the index, and "garbage in, garbage out" is not a metaphor here — it is a precise description of how retrieval will behave.

Concretely, you need a parser per source type. PDFs are the hardest case and deserve serious attention:

Attach metadata at ingestion time: source URI, document title, section path, author or owner, published date, last-modified date, version, status (current, draft, superseded), and any authority signals available. Every one of these is a filter or a ranking input later. You cannot retrofit metadata you did not capture.

2. Chunking

Chunking is where a surprising number of production accuracy issues are quietly decided.

The defaults you see in tutorials — 1000-token fixed-size chunks with 200-token overlap — are a starting point, not an answer. The right strategy depends on document structure:

Two patterns that consistently help, regardless of strategy:

Parent-child chunking. Index small chunks (200–400 tokens) for retrieval precision, but return the larger parent chunk (1500–3000 tokens) to the model for generation context. The model sees enough surrounding text to interpret the passage correctly.

Contextual chunk headers. Prepend each chunk with a short generated description of what document and section it is from, so the embedding captures topical context the bare chunk would not. This is essentially the technique Anthropic published as "contextual retrieval" — the gains are real.

3. Embeddings

Embedding model selection matters less than people think, up to a point — then it matters a lot.

For most enterprise corpora, the top embedding models from the major providers (OpenAI text-embedding-3-large, Voyage voyage-3, Cohere embed-v3, or a strong open model like bge-large) are interchangeable to within a few points of retrieval accuracy. Pick one based on cost, latency, and whether you need on-prem.

The decisions that actually move the needle:

4. Vector store and lexical index

Run two indices, not one.

A vector index for dense semantic retrieval. Options worth considering:

A lexical index for BM25 or similar. OpenSearch, Elasticsearch, or a Postgres full-text index will all do. Skipping this is the single most common architectural mistake in stalled RAG systems.

Index design choices that matter in production:


Part 2: The query pipeline

5. Query processing

Do not pass the raw user query to retrieval. It is almost always the wrong query.

Three transformations are usually worth running, often in combination:

Query rewriting. A small, fast model (Haiku-class) expands the query to be more retrieval-friendly — resolving pronouns, expanding acronyms, adding implicit context.

system: Rewrite the user query so it is self-contained and retrieval-ready.
       Preserve entities and intent. Output the rewritten query only.
user:  what about Q3?
       [conversation context: ... discussing 2025 revenue ...]
output: What were the Q3 2025 revenue figures and key drivers?

Multi-query generation. Generate three to five reformulations of the query, retrieve against each, and union the results before reranking. Improves recall for queries that can be phrased many ways.

Decomposition. For multi-hop questions, generate sub-queries, retrieve for each, and pass all results into reranking. A question like "How does our Q3 revenue compare to the guidance we gave in Q2?" decomposes into "Q3 revenue actuals" and "Q2 forward guidance."

These transformations cost a small-model call per query — typically under 200ms and a fraction of a cent. The accuracy gains on real enterprise traffic are large.

6. Hybrid retrieval

Run dense and lexical retrieval in parallel, then fuse the results.

The standard fusion technique is Reciprocal Rank Fusion (RRF):

def rrf(rankings: list[list[str]], k: int = 60) -> list[tuple[str, float]]:
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: -x[1])

RRF is rank-based, not score-based, which sidesteps the calibration problem between dense and lexical scores. It is hard to beat without learning a fusion model on your own data.

Retrieve broadly at this stage — 50 to 100 candidates per retriever — and let the reranker do the precision work.

7. Reranking

Rerankers are where most stalled RAG systems find their biggest single accuracy gain.

A cross-encoder reranker (Cohere Rerank, Voyage Rerank, bge-reranker-v2) takes the query and each candidate passage together and produces a relevance score based on actual content understanding, not vector proximity. The compute cost is real — typically 50–200ms for a few dozen candidates — but the precision improvement is usually decisive.

Practical guidance:

8. Prompt assembly and generation

The prompt you send to the generation model is the contract between the retrieval pipeline and the user-visible answer. Treat it as code.

A defensible structure:

system: You answer questions strictly from the provided sources.
        Cite sources inline as [1], [2], etc.
        If the sources do not contain the answer, say so explicitly.
        Do not use prior knowledge beyond the sources.

user:   Question: {user_query}

        Sources:
        [1] {source_1_title} ({source_1_date})
        {source_1_passage}

        [2] {source_2_title} ({source_2_date})
        {source_2_passage}
        ...

Things worth getting right:


Part 3: The harness around the pipeline

9. Evaluation

Without evaluation, every change is a guess. The eval system has three layers.

Golden set. A versioned collection of queries with expected answers and expected source documents. Cover the long tail intentionally: ambiguous queries, multi-hop questions, queries with no good answer, queries where the corpus contains conflicts. A hundred well-chosen examples is more useful than a thousand cherry-picked ones.

Separated metrics. Measure retrieval and generation independently:

Regression runs on every change. The eval harness should run on every prompt change, model swap, chunking adjustment, or reranker update. CI-style integration is the right pattern. A change that improves the overall score but regresses a specific query class is worth knowing about before deployment.

10. Observability and feedback

Production telemetry is the part of the system that tells you which questions you failed to anticipate.

The minimum useful instrumentation per query:

Sample and review systematically. The queries that produce low-confidence answers, low reranker scores, or thumbs-down feedback are where the next round of golden-set additions and corpus improvements come from. A RAG system without this feedback loop drifts.

11. Cost, latency, and caching

A few patterns that consistently pay off:


What to build vs. what to buy

Frameworks like LangChain and LlamaIndex are useful for prototyping and not always the right primitive for production. The abstractions that make a demo fast often hide the parameters you most need to tune.

A defensible split:

The same logic applies to managed RAG services. They are reasonable for a low-stakes internal tool. For a system you intend to operate seriously, the abstraction usually costs you more than it saves once you need to tune anything non-trivial.


Closing

A RAG pipeline that works in production is not a single clever model call. It is a dozen unremarkable components, each tuned to its job, with a measurement harness that tells you when one of them regresses.

The teams that ship reliable RAG systems do so by treating retrieval as a first-class engineering surface, owning the components that determine answer quality, and refusing to operate without evaluation. There are no clever shortcuts to that posture, and the systems that try to skip it tend to plateau in the same place.

Build the pipeline deliberately. Measure it. Iterate. The accuracy gains are there for teams that do the work.

For the strategic framing of why these systems plateau: Why Most Enterprise RAG Projects Stall at 70% Accuracy — And What Actually Fixes It.

If you are building or operating a production RAG pipeline and want a technical review or hands-on help with retrieval, evaluation, or the surrounding infrastructure, get in touch.

Subscribe for more

Get posts on LLMOps, RAG, agentic AI, and production AI delivered to your inbox.

Subscribe on Substack