Semantic vs Keyword vs Hybrid Search: What Every RAG Demo Skips

Every RAG demo shows embeddings and stops there. Real production search almost always mixes keyword and semantic retrieval. Here's what's happening under the hood, why hybrid wins, and a runnable Postgres example in ~40 lines.

Banner

Prefer to watch or listen? ▶ YouTube ♫ Spotify ✈ Telegram

Every RAG tutorial starts the same way: chunk your docs, embed them, throw them in a vector store, query with cosine similarity. Done.

It's a great demo. It's also not how serious search systems work.

The moment a user types error code E_1042 or Llama-3.1-70B or a product SKU, pure semantic search starts quietly failing — because an embedding of E_1042 is a vector of noise. Meanwhile, keyword search has the opposite problem: type "how do I cancel my subscription" and it misses the document titled "ending your membership".

So real systems use both. This post is about what each one is actually doing, why hybrid beats either alone, and how to build it in Postgres in ~40 lines.


Keyword Search: BM25 in One Paragraph

Keyword search finds documents containing your query terms. The question is how to rank them.

BM25 (the de-facto ranking function since the 90s) is basically three ideas stacked:

  1. Term frequency — the more a word appears in a doc, the more relevant, but with diminishing returns (the 20th occurrence of "database" doesn't help much).
  2. Inverse document frequency — rare words count more. "the" is useless, "pgvector" is a strong signal.
  3. Length normalization — longer documents would otherwise win by accident, so scores are normalized by length.

That's it. No machine learning, no GPU, no training. Under the hood it's an inverted index (word → list of documents containing it), which makes it blindingly fast even at billions of documents.

What it's good at: exact matches, IDs, rare tokens, acronyms, product names, filenames, version numbers, anything literal.

What it fails at: synonyms ("car""automobile"), paraphrase, conceptual queries, cross-language.


Semantic Search: Embeddings in One Paragraph

An embedding model turns text into a vector — say, 768 floats — such that texts with similar meaning end up close in that vector space. "How do I cancel?" and "ending your subscription" become near-neighbors.

To search, you embed the query and find the nearest vectors. Naively that's O(N) — check distance to every doc — which doesn't scale past a few million. So we use approximate nearest neighbor (ANN) indexes:

  • HNSW (Hierarchical Navigable Small World) — a graph where each node links to its near neighbors at multiple resolutions. Fast, accurate, memory-hungry.
  • IVF (Inverted File) — cluster all vectors first, only search the nearest clusters at query time. Lighter, slightly less accurate.

What it's good at: paraphrase, synonyms, conceptual similarity, cross-language, fuzzy intent.

What it fails at: rare tokens (the embedding has never seen them), acronyms, identifiers, numbers, exact-match requirements. Also expensive — every query means an embedding model call plus a vector search.


Why Hybrid Beats Both

Run the same query through both systems and you get two ranked lists. How do you merge them?

The surprisingly simple answer that keeps winning benchmarks: Reciprocal Rank Fusion (RRF). For each document d that appears in any ranker r, sum 1 / (k + rank_r(d)) across all the rankers:

           ┌── keyword ranker (BM25) ──┐
query ────┤                            ├──► RRF merge ──► final ranking
           └── semantic ranker (vec) ──┘

           score(d) =  Σ   1 / (k + rank_r(d))
                      r∈R

k is a smoothing constant — typically 60. It dampens the gap between rank 1 and rank 2 so the top result doesn't utterly dominate the fused score. Raise k and lower ranks contribute more; lower k and the top ranks dominate. The default works for almost everyone.

def rrf(ranked_lists, k=60):
    scores = {}
    for ranking in ranked_lists:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

No calibration between score scales, no tuning, no training. A document that ranks well in either list bubbles up; a document that ranks in both lists bubbles up strongly. That's why it works — the two systems have different failure modes, and RRF exploits that.


Build It In Postgres

Postgres does both keyword (via tsvector + GIN index) and semantic (via pgvector + HNSW index) natively. One table, two indexes, one hybrid query.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE docs (
  id SERIAL PRIMARY KEY,
  content TEXT,
  content_tsv TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', content)) STORED,
  embedding VECTOR(768)
);

CREATE INDEX docs_tsv_idx ON docs USING GIN (content_tsv);
CREATE INDEX docs_vec_idx ON docs USING HNSW (embedding vector_cosine_ops);

Heads-up: HNSW requires pgvector ≥ 0.5.0. On older versions, swap HNSW for IVFFLAT — slightly less accurate but identical from the query side.

Now a hybrid query with RRF, all in one SQL statement:

WITH kw AS (
  SELECT id, row_number() OVER (ORDER BY ts_rank_cd(content_tsv, query) DESC) AS rnk
  FROM docs, plainto_tsquery('english', 'how to cancel subscription') query
  WHERE content_tsv @@ query
  LIMIT 50
),
vec AS (
  SELECT id, row_number() OVER (ORDER BY embedding <=> $1::vector) AS rnk
  FROM docs
  ORDER BY embedding <=> $1::vector
  LIMIT 50
)
SELECT d.id, d.content,
       COALESCE(1.0/(60 + kw.rnk), 0) + COALESCE(1.0/(60 + vec.rnk), 0) AS score
FROM docs d
LEFT JOIN kw  ON kw.id  = d.id
LEFT JOIN vec ON vec.id = d.id
WHERE kw.id IS NOT NULL OR vec.id IS NOT NULL
ORDER BY score DESC
LIMIT 10;

$1 is the query embedding — a VECTOR(768) your application layer (Python, Node, Go) has already computed by calling the embedding model. Postgres doesn't embed on its own; it just stores and searches the result. So your app does: (1) embed the query, (2) send the vector as a parameter to this SQL. That's your hybrid retriever — one database, one query, no extra infrastructure.


What About Elasticsearch, Qdrant, Weaviate?

Postgres is a fine choice for most teams. The dedicated tools matter when scale, flexibility, or specific features push you past it.

ToolBest atWeak at
Postgres (pgvector + tsvector)Teams already on Postgres, moderate scale (< ~10M vectors), transactional data next to embeddingsBillion-scale vector search, complex BM25 tuning, multi-tenant reranking
Elasticsearch / OpenSearchMature BM25, aggregations, faceting, geo, log search; native hybrid via RRF since 8.xEmbeddings are a second-class citizen; heavier ops
QdrantPure vector workloads, clean Rust implementation, fast filters on payload, simple to runKeyword search is basic (no BM25) — you'll pair it with something else
WeaviateBuilt-in hybrid (BM25 + vectors) as a first-class feature, strong schema + modules for embedding pipelinesOpinionated architecture; lock-in on their query language
VespaThe only one that plays seriously in all three axes — scale, keyword quality, vector quality — at FAANG scaleSteep learning curve; overkill for most teams

Rule of thumb: start with Postgres. Move to Qdrant or Weaviate when vector count crosses ~10M or you need low-latency ANN at high QPS. Use Elasticsearch/OpenSearch when keyword quality and faceting are the main product. Reach for Vespa when all three dimensions matter and nothing else scales.


Decision Matrix

Your query looks like…Reach forWhy
Product SKUs, error codes, version numbers, filenamesKeywordEmbeddings have never seen these exact tokens; BM25 treats them as high-IDF signals
Conceptual, paraphrased, cross-languageSemanticDifferent words, same meaning — keyword can't see the connection, vectors can
Real user queries in a product — mixed and messyHybridUsers type both kinds in the same session; hybrid has no downside when one side is weak
"Most similar to this other document"SemanticPure vector problem — rank docs by distance in embedding space
Log search, structured fields, facetingKeyword / ElasticsearchExact matches + aggregations matter more than meaning

If you're not sure — and in production you're usually not — just use hybrid. RRF has no downside: if one side is useless for a particular query, it simply contributes near-zero and the other side wins the ranking.


Gotchas Nobody Tells You

  • Chunking matters more than the embedding model. A great model on badly chunked docs loses to a mediocre model on well-chunked docs. Start with ~300-token chunks with overlap.
  • Multilingual content breaks keyword search. to_tsvector('english', ...) silently butchers non-English text. Either detect language and use the right dictionary, or lean more on semantic.
  • Stop words cut both ways. Removing "the" helps keyword search. Removing "not" changes the meaning completely for semantic.
  • Rerankers beat tuning. After your hybrid retrieves 50 candidates, a cross-encoder reranker (e.g., bge-reranker) re-scores them pairwise against the query. It's the single biggest quality lift you can add in an afternoon.
  • Latency budget is a product decision. Pure keyword: ~5ms. Pure semantic with HNSW: ~20ms. Hybrid with a reranker: ~200ms. That last one can't live in a typeahead; it can live in RAG.

Closing

The "semantic vs keyword" framing is a false choice. They're complementary — each covers where the other is blind. Hybrid retrieval with RRF is almost free to add, needs no training, and works across every vector store that also supports BM25.

If you take one thing away: before you pick a new vector database, check whether Postgres already does what you need. For most teams, it does.

Next time someone shows you a RAG demo with just embeddings, you'll know what's missing.

Semantic vs Keyword vs Hybrid Search: What Every RAG Demo Skips | Software Engineer Blog