When are embeddings overkill for RAG?

Under 10K documents, BM25 keyword matching with good preprocessing can be surprisingly effective. The precision gains from semantic search often don't justify the infrastructure, cost, and latency overhead.

What is LLM reranking for retrieval?

Retrieve more candidates than needed via keyword search, then ask the LLM to score which are actually relevant. Catches cases where keywords match but context doesn't, or semantic similarity is high but chunks don't answer the question.

What problems does the embedding orthodoxy have?

Cost (embedding APIs at scale), latency (multiple round-trips), infrastructure (another database to manage), and semantic blurring (similar ≠ relevant — cosine similarity can miss context).

What's the hybrid retrieval pattern?

Stage 1: BM25 + Embeddings for recall (wide net). Stage 2: LLM scoring for reranking. Stage 3: Dynamic snippet extraction. Stage 4: LLM for final answer. Each stage uses the right tool.

Can you build RAG without embeddings?

Yes — tag-based SQL retrieval (LLM generates searchable tags), multi-tag aggregation scoring, LLM reranking, and dynamic snippet extraction. Just SQLite and clever LLM use, no vector database needed.

When should I still use embeddings?

Large-scale semantic search with millions of documents where precision matters. But start simple — try BM25 first, add reranking, measure answer quality not just retrieval recall. The simple thing often works.

The Embeddings Backlash: When Simpler Retrieval Works Better

By Prahlad Menon Published 2026-03-04 3 min read

There’s a quiet rebellion happening in the RAG community.

After two years of “just embed everything and throw it in a vector database,” developers are discovering that simpler approaches often work better—especially for smaller document collections.

The Embedding Orthodoxy

The standard RAG playbook looks like this:

Chunk your documents
Embed each chunk with OpenAI/Cohere/etc.
Store embeddings in Pinecone/Weaviate/Qdrant
At query time, embed the question
Find similar chunks via cosine similarity
Pass chunks to LLM for answer

This works. But it comes with baggage:

Cost: Embedding APIs aren’t free, especially at scale
Latency: Multiple round-trips (embed query → vector search → LLM)
Infrastructure: Yet another database to manage
Semantic blurring: Similar ≠ relevant (cosine similarity can miss context)

The Rebels

A PHP developer recently shared a RAG system that skips embeddings entirely:

“I’ve been experimenting with building a RAG system that completely skips embeddings and vector databases.”

Their approach:

Tag-based SQL retrieval — LLM generates searchable tags for each document
Multi-tag aggregation scoring — SQL query ranks by tag overlap
LLM reranking — Before generating, the LLM scores and filters results
Dynamic snippet extraction — Pull context windows around keyword matches

No embeddings. No vector database. Just SQLite, PHP, and clever use of the LLM.

When Simpler Wins

This isn’t a contrarian take for its own sake. There are real scenarios where embeddings add complexity without value:

1. Small to Medium Collections

If you have 1,000 documents instead of 1 million, the precision gains from semantic search often don’t justify the infrastructure. BM25 (keyword matching) with good preprocessing can be surprisingly effective.

2. Structured Data

For database schema discovery (like soul-schema), embeddings don’t help. You need the LLM to understand column names like cust_ltv and generate descriptions. The “retrieval” is just querying metadata—SQL does that fine.

3. When You Control the Query

If users are searching your docs through an LLM interface, you can:

Have the LLM expand/reformulate the query
Use the LLM to rerank results
Let the LLM request more context if needed

The LLM becomes part of the retrieval loop, not just the generation step.

The Reranking Pattern

The most interesting technique from the PHP project is LLM reranking:

Query → Keyword Search → Top 20 results → LLM Reranks → Top 5 → Generate Answer

Instead of trusting vector similarity to find the best matches, you retrieve more candidates than you need, then ask the LLM: “Which of these are actually relevant to the question?”

This catches cases where:

Keywords match but context doesn’t
Semantic similarity is high but the chunk doesn’t answer the question
The best answer is in a document that scores medium on both keyword and semantic

We’re adding this to soul.py’s roadmap. The hybrid RAG+RLM routing already helps, but explicit reranking could improve precision further.

The Hybrid Future

I don’t think embeddings are going away. For large-scale semantic search, they’re still the best tool. But the “embed everything” default is being questioned.

The emerging pattern is hybrid retrieval:

Stage	Method	Purpose
1. Recall	BM25 + Embeddings	Cast a wide net
2. Rerank	LLM scoring	Filter for relevance
3. Extract	Dynamic snippets	Optimize context window
4. Generate	LLM	Final answer

Each stage uses the right tool for the job. Keywords for recall, semantics for similarity, LLM for judgment.

Practical Takeaways

Start simple: Try BM25 before reaching for embeddings. You might be surprised.
Consider your scale: Under 10K documents? Embeddings might be overkill.
Add reranking: Even with embeddings, an LLM reranking step can significantly improve precision.
Measure what matters: Track answer quality, not just retrieval recall. Sometimes fewer, better results beat more, fuzzier ones.
Use the LLM in the loop: Query expansion, reranking, and snippet extraction are all places where an LLM can help—not just the final generation.

The embedding orthodoxy served us well for bootstrapping the RAG ecosystem. But like any orthodoxy, it’s worth questioning. Sometimes the simple thing just works.

Building RAG systems? Check out soul.py for persistent agent memory with hybrid RAG+RLM routing.