RAG Ingestion & Chunking — The Missing Engine Behind Hybrid Search

You tuned the embedding model. You went hybrid. Your RAG still misses. The bug is upstream — in how you split documents. Five chunking strategies, when each wins, and how to actually evaluate them.

Banner

You wired up a vector DB. You went hybrid — BM25 plus embeddings plus a reranker. You tuned top-k, you even tried a bigger embedding model.

And the answer still misses.

Before you blame the model or the reranker, look one step earlier in the pipeline. There's a preprocessing decision most teams treat as a two-line text_splitter = RecursiveCharacterTextSplitter(1000, 100) — and it's quietly setting the ceiling on everything that happens after it.

This post is about chunking: the unsexy, under-discussed step that decides whether your retrieval engine sees the right thing at all. Five strategies, when each one wins, the tradeoffs that nobody prints on the README, and how to evaluate chunking without kidding yourself.


Why Chunking Is an Architectural Decision, Not a Preprocessing Step

Retrieval is a search over embedded units. Every decision you make about what counts as a unit shapes the entire downstream experience:

  • Chunk too small → embeddings lose context. "The policy" on its own embeds nothing. The retrieved snippets are precise but fragmentary; the LLM has to reason across five chunks to answer one question.
  • Chunk too large → embeddings smear across topics. A 2000-token chunk covering refunds, shipping, and account setup produces a single vector that matches loosely to all three queries and strongly to none. Your top-5 looks reasonable; your answer is vague.
  • Chunk at the wrong boundaries → the model gets half of a table, a sentence split across two chunks, a code block whose closing brace lives in chunk 17. Every failure here looks like "the LLM hallucinated" but originated in the splitter.

The ceiling on your retrieval quality is set here. A perfectly tuned reranker can't rescue a chunk that was split through the middle of a key clause.


Strategy 1 — Fixed-Size Character / Token Splitting

The baseline everyone starts with.

def fixed_split(text: str, size: int = 1000, overlap: int = 100):
    step = size - overlap
    return [text[i : i + size] for i in range(0, len(text), step)]

Why it's popular: trivial, fast, deterministic. Works fine for homogeneous corpora (blog posts, news articles) where boundaries don't carry meaning.

Why it fails: it cuts through sentences, tables, code blocks, list items. It assumes every 1000-character window has equal semantic weight, which is almost never true.

Use it when: you're prototyping, you have an hour, and the cost of being wrong is low. Never ship it to production for anything structured.


Strategy 2 — Recursive Character Splitting

The LangChain default, and for good reason.

You provide a list of separators in priority order — ["\n\n", "\n", ". ", " ", ""] — and the splitter tries each one until the resulting chunk fits the target size. Paragraphs first, then lines, then sentences, then words.

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", ". ", " ", ""],
)

Why it's popular: respects natural document rhythm most of the time. Forgiving across prose, markdown, and lightly structured text.

Why it fails: it's still blind to content. It doesn't know a ## Heading should start a new chunk, or that this paragraph about refunds shouldn't be glued to the next paragraph about shipping just because they both fit in the window. And for code, it's a disaster — variable scopes, function bodies, and JSX trees don't care about your separator list.

Use it when: your corpus is prose, loosely structured markdown, or mixed-format general documents. This is the right default to start with. It's not the right default to stay with.


Strategy 3 — Document-Structure-Aware Splitting

Stop fighting the document. Use its structure.

Markdown has headers. HTML has <section>. PDFs (via pypdf or unstructured) have page breaks, tables, and sometimes outline levels. Source code has functions, classes, blocks. A JSON document has keys.

The rule: chunk along the boundaries the author already drew.

from langchain.text_splitter import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "h1"),
        ("##", "h2"),
        ("###", "h3"),
    ],
)
chunks = splitter.split_text(markdown_doc)

Each chunk now carries its heading as metadata — which you can prepend back into the embedding text so "Refund policy" appears inside every chunk under that section, not just its own isolated one.

Why it matters: retrieval now pulls coherent semantic units. When a query hits "refund window," you get the whole refund section, not paragraph 3 of it.

Use it when: your source is structured (docs site, Confluence export, Notion, well-formed PDFs, source code). This is the biggest quality lift for the least effort you'll get in a RAG pipeline. Underused.


Strategy 4 — Semantic Chunking

Let the embedding model decide where to cut.

The idea: walk the document sentence by sentence, embed each sentence, and look for the points where the embedding jumps — where the topic shifts. Cut there.

import numpy as np

def semantic_chunks(sentences, embed_fn, threshold_percentile=85):
    vecs = np.array([embed_fn(s) for s in sentences])
    sims = [np.dot(vecs[i], vecs[i+1]) /
            (np.linalg.norm(vecs[i]) * np.linalg.norm(vecs[i+1]))
            for i in range(len(vecs) - 1)]
    drop_threshold = np.percentile(sims, 100 - threshold_percentile)
    cut_points = [i+1 for i, s in enumerate(sims) if s < drop_threshold]
    return split_at(sentences, cut_points)

Why it's elegant: no hardcoded size, no separator list. The cuts happen at real topic boundaries, not arbitrary character counts.

Why it bites: it's expensive (you embed every sentence before retrieval even begins), the threshold is a hyperparameter you have to tune per corpus, and on noisy documents the topic-shift signal is weak. For a 10M-document corpus the embedding cost at ingest alone can be painful.

Use it when: document boundaries are genuinely ambiguous (interview transcripts, long-form articles without clear headers, chat logs). Skip it when you have structure — Strategy 3 will beat it at a fraction of the cost.


Strategy 5 — Late Chunking

The newest idea on this list, and worth knowing.

Traditional pipelines chunk first, then embed each chunk. The embedding model never sees the rest of the document. A chunk about "the refund window" loses any signal from the preceding section titled "Returns & Refunds."

Late chunking inverts the order: embed the whole document with a long-context embedding model first, producing a sequence of token-level embeddings that all saw each other. Then pool those token embeddings into chunks afterwards.

Document ──► long-context embedder ──► token embeddings
                                             │
                                             ▼
                                      chunk via pooling
                                      (mean over spans)

The resulting chunk embeddings inherit context from the full document. A chunk about the refund window now carries the flavor of "Returns & Refunds" even if the word never appears inside it.

Why it's attractive: you get contextual chunks without the engineering overhead of Strategy 3 (structure-aware) or the hyperparameter tuning of Strategy 4 (semantic).

Why it's not a silver bullet: requires a long-context embedding model (Jina v3, some newer open-weight options). Inference cost per document is higher. Best gains show up on documents where topical context flows — policies, long reports, legal text — and are smaller on documents that are already well-structured.

Use it when: you care about retrieval quality on long, flowing documents and you're willing to pay a higher ingest cost. Don't use it on short atomic docs (FAQ answers, product pages) — you'll see no measurable lift.


Chunk Size and Overlap — The Two Dials That Matter

Independent of strategy, every chunker exposes two knobs.

Chunk size. Roughly speaking:

  • 200–400 tokens — precise retrieval, high recall on narrow queries, but every answer requires stitching multiple chunks together.
  • 500–900 tokens — the practical sweet spot for most RAG apps. Enough context for the LLM to reason within a single chunk; still specific enough to embed cleanly.
  • 1500+ tokens — useful when your documents are question–answer pairs or self-contained sections that benefit from staying whole. Beware: retrieval precision drops because each vector covers more topical ground.

Chunk overlap. The piece everyone forgets to tune. Overlap lets a sentence that sits near a boundary appear in both neighboring chunks, so a query matching it retrieves at least one. Typical values: 10–20% of chunk size. More overlap = more storage, more duplicate hits in top-k; less overlap = more boundary-cut failures.

Rule of thumb: start at 600 tokens with 100-token overlap. Move from there based on the eval (see next section), not on vibes.


How to Actually Evaluate Chunking

Most teams tune chunking by running two versions through ChatGPT and asking "which answer looks better?" That's vibe-check evaluation. It drifts, it's subjective, and it tells you nothing about retrieval — only about the end-to-end LLM output.

A real eval separates retrieval from generation:

  1. Build a golden set. 50–200 real questions with the document passages that should be retrieved. Hand-label them once — you'll reuse this forever.
  2. Evaluate retrieval in isolation. For each question, run the retriever (with the chunking strategy you want to test) and measure: was the golden passage in the top-5? top-10? Compute recall@k and MRR (mean reciprocal rank). These are your real chunking metrics.
  3. Only then pipe to the LLM and measure answer correctness. Separating the stages means you know whether a failure is bad retrieval, bad chunking, or bad generation.

When you swap chunking strategies, recall@10 should move. If it doesn't, you're measuring noise and should widen your golden set before believing the comparison.


Decision Tree for Builders

Start here, adjust as evidence comes in:

  1. Is your corpus structured? (Markdown / HTML / code / outlined PDFs) → Strategy 3 (structure-aware). Prepend the heading path into each chunk's embedding text.
  2. Is it prose or mixed documents without clean structure?Strategy 2 (recursive) as the default. Size 600, overlap 100.
  3. Is it long-form flowing text where topic transitions matter (legal, policy, transcripts)?Strategy 5 (late chunking) if you have budget; Strategy 4 (semantic) if you don't.
  4. Are you still prototyping?Strategy 1 (fixed-size) for one weekend. Do not ship.

Then: build the eval set before you iterate. Every chunking change without a recall@k number next to it is wasted tuning.


The Part That Doesn't Fit on a Slide

The most common cause of a "RAG that works on demos and fails in prod" is chunking that was good enough for the first 20 queries and slowly leaks failures as the corpus grows. Nothing in your monitoring will tell you this directly — the pipeline doesn't throw errors, it just quietly retrieves the wrong things more often as documents get more diverse.

Treat chunking as a living part of the system. Re-evaluate when your corpus shifts. Re-chunk when you onboard a new document type. Keep the golden set warm and add to it as users report bad answers.

The embedding model and the reranker get all the attention. The chunker is what decides whether either of them had a chance.

RAG Ingestion & Chunking — The Missing Engine Behind Hybrid Search | Software Engineer Blog