Why Most Enterprise RAG Projects Stall at 70% Accuracy — And What Actually Fixes It

Almost every enterprise RAG project hits the same wall.

The first prototype answers seven out of ten questions correctly. Stakeholders are impressed. A roadmap gets drawn up. Six months later, the system still answers seven out of ten questions correctly — but now leadership wants nine, the model has been swapped twice, and no one can explain why the remaining three failures persist.

The instinct at this point is to assume the model is the limiting factor. It rarely is.

The vast majority of enterprise RAG systems are not constrained by the language model. They are constrained by the retrieval layer, the corpus, the query path, and the absence of a measurement discipline that would let anyone know which of those three is actually failing.

What follows is a practical view of why these systems stall — and what actually moves them past the plateau.

1. The diagnostic mistake: blaming the model

When a RAG system gives a wrong answer, the visible artifact is the model output. The natural conclusion is that the model is the problem. So teams swap models, upgrade context windows, fine-tune, or wait for the next generation of a frontier model.

In production RAG systems, the failure almost always occurs upstream of the model. Either the right document was not retrieved, the right passage within the document was not surfaced, or the retrieved context was noisy enough that the model had no reliable basis for an answer.

A useful test: take the failing queries and manually inject the correct source passages into the prompt. If the model now answers correctly, the model is fine — the retrieval pipeline is the bottleneck. In most engagements I have seen, this test resolves the majority of failures without changing the model at all.

Until a team runs this test, every model swap is guessing.

2. Chunking is upstream of everything

Chunking decisions made in the first week of a RAG project quietly determine the ceiling of the system months later. Teams tend to pick a chunk size, ship it, and never revisit the decision — even when it is the dominant source of retrieval failure.

The problems chunking creates are not subtle:

Fixed-size chunks cut across semantic boundaries. A definition lives in chunk 14 and the qualifier that inverts its meaning lives in chunk 15. The model sees only the definition and confidently gives the wrong answer.
Too-small chunks retrieve cleanly but lack the surrounding context the model needs to interpret them. Embeddings match, answers are still wrong.
Too-large chunks dilute the embedding signal. A relevant sentence buried in 2,000 tokens of unrelated material does not retrieve.
Tables, lists, and structured content get mangled by naive splitters. A pricing table broken across two chunks is worse than no pricing table at all.
Headings and document hierarchy are usually stripped, which removes the strongest signal a human reviewer would have used to judge relevance.

The fix is not a single chunk size. It is a chunking strategy that respects document structure — splitting on semantic boundaries, preserving hierarchy as metadata, handling tables and code blocks distinctly, and attaching neighboring context to retrieved chunks at query time so the model sees the passage and what surrounds it.

This is unglamorous work. It is also where the largest accuracy gains in stalled RAG systems usually come from.

3. Embeddings alone are not enough

Dense vector search is the default retrieval mechanism in most RAG implementations, and it has a well-known failure mode: it retrieves documents that are semantically similar to the query, which is not always the same thing as documents that answer the query.

Three patterns reliably fail pure embedding-based retrieval:

Exact-match queries. A user asks about "error code E-4471." Embedding similarity may surface documents about error handling generally, while the document that names E-4471 ranks tenth.
Acronyms and proper nouns. Embeddings often treat domain-specific acronyms as low-information tokens. The document containing the acronym is not retrieved; the document discussing the surrounding topic is.
Negation and constraints. "Which products are not covered under the standard warranty" tends to retrieve documents about coverage, not exclusions — because the embedding is dominated by the topical content, not the constraint.

Two changes consistently address this:

Hybrid search. Combine dense vector retrieval with lexical search (BM25 or similar) and merge results with reciprocal rank fusion or a learned ranker. This catches both semantic and exact-match cases. The cost is modest; the accuracy gain on enterprise corpora is consistently meaningful.

Reranking. Retrieve a wider initial candidate set — 50 to 100 documents instead of 5 — and apply a cross-encoder reranker to score each candidate against the query. Rerankers are slower than embedding similarity but operate on the model's actual understanding of the query-passage relationship, not just vector proximity. For most enterprise RAG systems, adding a reranker delivers larger accuracy gains than any change to the generation model.

4. Query understanding: what the user typed vs. what they meant

Enterprise users do not write queries the way search engines expect them to.

They write fragments. They write multi-part questions. They write follow-ups that reference prior context implicitly. They write queries that assume domain knowledge the retrieval system does not have access to. And they write queries that are answerable only by combining information from multiple documents — which a single retrieval pass cannot satisfy.

The interventions that help:

Query rewriting. Use the model itself to expand, clarify, or reformulate the user's query before retrieval. A vague "what about Q3?" becomes "What were the Q3 2025 revenue figures and key variance drivers?" — which has a much higher chance of retrieving the right document.
Multi-query retrieval. Generate several reformulations of the query, retrieve against each, and union the results. This improves recall for queries that could be phrased many ways.
Decomposition for multi-hop questions. Some queries require two or three pieces of information that live in different documents. Detect these, decompose into sub-queries, retrieve for each, and synthesize. Single-pass retrieval will never answer them reliably.
Conversation-aware retrieval. In multi-turn applications, resolve references against the prior turns before retrieval. "Show me their renewal terms" is unanswerable without knowing who they refers to.

Most stalled RAG systems treat the user query as the retrieval query verbatim. That is a default, not a design decision — and it is the one that fails most often on real enterprise traffic.

5. The corpus is dirtier than you think

Every RAG project begins with an assumption that the source corpus is a reliable ground truth. Every RAG project that reaches production discovers it is not.

Real enterprise corpora contain:

Multiple versions of the same document, with no clear signal which is current
Drafts that look authoritative but never got approved
Contradictory information across departments — sales collateral that conflicts with policy documents that conflict with engineering specs
Stale content that is technically still in the source system but has been superseded
Documents written for an audience the user is not part of, with assumptions baked in that lead the model astray

A RAG system retrieves the most relevant document, not the most correct one. If the corpus contains three versions of a policy and two of them are out of date, the model has a two-thirds chance of grounding its answer in stale information — and will sound completely confident doing it.

The discipline that addresses this is not glamorous. It is corpus curation: an explicit pipeline that decides what is ingested, deduplicates near-duplicates, attaches authority signals (date, owner, status), filters or down-weights drafts and superseded versions, and gives the retrieval layer something to do besides treat every document as equally valid.

Teams that skip this step end up with a system that answers fluently from a polluted ground truth — which is worse than a system that admits it does not know.

6. You cannot fix what you do not measure

The single most predictive signal of whether a RAG project will reach production reliability is whether the team has a real evaluation system. Not "we spot-check the outputs." Not "the demo questions still work." A measurement discipline.

The components of one that actually works:

A golden set. A curated collection of representative queries with known correct answers and known relevant source documents. Built deliberately to cover the long tail of question types, not just the obvious ones. Versioned and reviewed.

Retrieval metrics separated from generation metrics. Measure whether the right documents were retrieved (recall at k, mean reciprocal rank) independently from whether the model's answer was correct. Without this separation, you cannot tell whether a failure is a retrieval problem or a generation problem — which means you cannot fix it.

LLM-as-judge with calibration. Use a strong model to grade outputs against the golden set, but calibrate it against human judgment regularly. Uncalibrated LLM judges drift, and a high score against a drifted judge is not the same thing as a high-quality system.

Production telemetry, not just offline eval. What questions are users actually asking? What is the distribution of query types? Which queries result in low-confidence answers, retries, or thumbs-down feedback? Offline eval tells you how the system performs on questions you anticipated. Production telemetry tells you which questions you failed to anticipate.

Regression testing on every change. Prompt changes, model swaps, chunking adjustments, reranker updates — every one of them should run against the golden set before deployment. Without this, improvements in one area silently break another, and no one notices until a user does.

Teams without this measurement layer are not really iterating. They are guessing in a loop. Some guesses help; some hurt; over time, the system drifts in a direction no one can characterize.

7. The ownership problem

RAG systems sit at an awkward organizational seam. The retrieval layer looks like a search problem and belongs naturally to a data or platform team. The generation layer looks like an ML problem and belongs to the AI team. The corpus belongs to whichever business function owns the source documents — usually several of them, none of whom signed up to maintain a retrieval-quality dataset.

The predictable result is that no one owns end-to-end accuracy. The AI team improves the model and is surprised that accuracy does not move. The platform team optimizes retrieval latency and considers their job done. The document owners keep writing for human readers, which is what they have always done.

Closing this gap requires explicit ownership of the system as a whole — someone accountable for the user-visible answer quality, with authority to make changes across retrieval, generation, and corpus curation. Without that role, every team optimizes its own surface and the system stays stuck at the plateau.

What moving past the plateau actually looks like

RAG systems that reach production reliability share a profile.

They treat retrieval as the primary engineering surface, not an afterthought. They use hybrid search and rerankers as a baseline, not a stretch goal. They invest in chunking and corpus curation before they invest in fine-tuning. They rewrite or decompose queries before retrieval rather than hoping the user phrased things well. They measure retrieval and generation separately. They run regression tests on every change. And they have a named owner accountable for end-to-end answer quality.

None of this is novel. None of it requires frontier model capability. It is engineering discipline applied to a system whose failure modes are well understood but rarely addressed in order.

The teams stuck at 70% are not stuck because the technology is not ready. They are stuck because they are still treating the model as the variable to tune, when the model is one of the few things in the pipeline that is already working.

Moving past the plateau is mostly a matter of looking in the right place.

For engineers building the system: How to Build a Production RAG Pipeline — A Technical Walkthrough.

If your RAG system has plateaued and you need an honest read on where the bottleneck actually is, I work with teams to diagnose retrieval and corpus issues and build the measurement discipline that makes iteration possible. Get in touch.