The Demo is One File. Production RAG is Nine Layers.
There’s a version of RAG you see in every tutorial: load a PDF, chunk it, embed it, query it. Fifty lines. Works great in the demo.
Then there’s the version you need when real users are hitting it at 3am, when queries are ambiguous, when your context window is getting stuffed with irrelevant chunks, when someone asks something that could get the system in legal trouble.
Those are not the same problem.
A template recently circulated from Shivani Virdi that shows what production RAG actually looks like. Not one file. Nine layers. Let’s break them down — and be honest about what each one is solving.
The Nine Layers
1. services/ — The Core Pipeline (Five Components, Not One)
Most tutorials put everything in a single retriever function. Production splits it into at least five distinct services:
- RAG pipeline — the core orchestration: query → retrieve → augment → generate
- Semantic cache — before you call an LLM, check if you’ve answered a semantically equivalent question before. Cuts latency and cost significantly for repeated patterns.
- Memory — conversation state across turns. Without this, every message is a fresh start and the system feels broken.
- Query rewriter — the raw user query is often not the best retrieval query. Rewrite it before you embed it.
- Router — not every query should go to RAG. Some should go to a database lookup, some to a calculator, some straight to the LLM.
The single-file version collapses all of this into one function. That’s fine until the cache has stale hits, or the router routes everything to RAG when 30% of queries don’t need it.
2. agents/ — Self-Correcting by Design
Three components here that most systems skip:
- Document grader — after retrieval, score the chunks. Are they actually relevant to the query? If the top-k results are garbage, don’t send them to the LLM. Return “I don’t know” or reroute.
- Query decomposer — complex multi-part questions fail at retrieval. Break them into sub-questions, retrieve for each, synthesize.
- Adaptive router — this is a second routing layer at the agent level, distinct from the service-level router. The service router decides how to answer; the agent router decides whether to keep trying or give up.
Self-correcting doesn’t mean magic. It means: retrieve → grade → if bad, rewrite and retry → if still bad, escalate.
3. prompts/ — Versioned, Typed, Registered
Hardcoded prompt strings are a maintenance disaster. When you have twenty prompts scattered across your codebase with no versioning, you have no idea which prompt caused a regression.
A proper prompts layer has:
- Version control — prompt v1 vs v2, with the ability to roll back
- Types — strongly typed inputs so you can’t accidentally pass the wrong variable
- Registry — a central catalog so you know every prompt in the system
This sounds over-engineered until the day you push a bad prompt to production and need to know exactly what changed.
4. security/ — Three Guards, Not One
Single-layer security (usually “check the output for bad words”) is not security. Production systems need three:
- Input guard — before the query hits retrieval. Catch prompt injection, jailbreak attempts, malformed input.
- Content guard — during retrieval. Filter documents that shouldn’t be surfaced given the user’s context or permissions.
- Output guard — before the response leaves the system. PII detection, hallucination flags, policy violations.
Each layer catches different failure modes. Skip one and you have a gap.
5. evaluation/ — The Layer Most Teams Skip
This is the most important layer and the one most commonly absent. Production without evaluation is shipping blind.
Three components:
- Golden dataset — a hand-curated set of query/expected-answer pairs that represents real user intent. This is your ground truth.
- Offline eval — run every change against the golden dataset before it ships. Did retrieval quality go up or down? Did answer faithfulness change?
- Online monitor — production traffic analysis. Are users thumbs-downing certain query types? Are there queries with unusually low confidence scores?
If you ship RAG without an evaluation layer, you don’t know if it’s working. You find out from user complaints.
6. observability/ — Per-Stage Tracing
“The AI gave a bad answer” is not a useful bug report. You need to know where in the pipeline things went wrong.
- Per-stage tracing — did the query rewriter change the query in a way that hurt retrieval? Did the document grader wrongly filter a good chunk?
- Feedback linked to traces — when a user says “this answer was wrong,” you need to find the exact trace for that query: what was retrieved, what prompt was used, what the model saw.
- Cost per query — not optional for production. LLM calls are expensive. You need to know which query patterns are burning money.
Without per-stage tracing, debugging is archaeology. You’re digging through logs hoping to find the failure.
7. .claude/ — Agent Context for Your AI Coding Assistant
This one is underrated. When you’re using AI coding tools (Claude, Copilot, Cursor), they don’t know your codebase. .claude/ is where you put context: what each module does, architectural decisions, conventions, constraints.
Without this, your AI assistant refactors the query rewriter without knowing it’s supposed to be idempotent, or adds a new retrieval path that bypasses the security layer.
Agent context is documentation that your tools can actually use.
What the Tutorial Version Gets Right
To be fair: the single-file version is the right starting point. You need to understand the happy path before you can design the error paths. The tutorial version teaches you what RAG is.
The nine-layer version teaches you what it takes.
The Honest Gap
Here’s what even the nine-layer template doesn’t show:
- Chunking strategy — how you split documents is as important as how you retrieve them. A
chunking/module should be a first-class citizen, not hardcoded in a seed script. - Index versioning — when you re-embed with a new model, your evaluation results from the old index are invalid. You need version-locked eval.
- Data lineage — which version of the raw data produced which index? Without this, you can’t reproduce a retrieval result from six months ago.
These aren’t criticisms of the template. They’re the next layer — the things you discover after you’ve been running production RAG for six months.
The Takeaway
The demo works because the demo controls everything: the data, the queries, the expectations. Production works because the architecture handles the things you didn’t control for.
The difference between a RAG demo and a RAG system isn’t the retrieval algorithm. It’s the nine layers of infrastructure around it.
Credit to Shivani Virdi and Tech with Mak for the template that sparked this breakdown.