How to chunk text for RAG the right way (size, overlap and boundaries)

Good retrieval starts with good chunks. Here's how to pick a chunk size and overlap, why sentence boundaries beat fixed windows, and a free browser tool that does it with exact token counts.

Retrieval-augmented generation lives or dies on its chunks. You embed pieces of your documents, store the vectors, and at query time you retrieve the closest pieces and feed them to the model. If the pieces are badly cut, retrieval surfaces the wrong thing and the model answers from noise. Here is how to chunk well, without overthinking it.

Pick a size in tokens, not characters

Embedding models and LLMs both think in tokens, so size your chunks in tokens too. A common sweet spot is 200 to 500 tokensper chunk. Smaller chunks give precise retrieval but lose context; larger chunks carry more context but dilute the embedding, so a query matches a chunk that's only partly relevant. If you're unsure, start around 300 and adjust based on how your answers look.

Always add overlap

Hard boundaries cut ideas in half. A definition might start at the end of one chunk and finish at the start of the next, so neither chunk contains the whole thought, and neither retrieves well for a question about it. Overlap fixes this: repeat the last 10 to 20% of each chunk at the start of the next. It costs a little storage and some duplicate text, and it markedly improves recall for questions whose answer sits near a boundary.

Respect natural boundaries

Splitting on a fixed token count, mid-sentence, mid-word, produces ugly chunks that embed poorly. Prefer to pack whole sentences up to your size limit, and for structured documents, keep paragraphstogether where they fit. Reserve fixed-window splitting for text with no usable punctuation (some logs, transcripts without sentence breaks). The goal is chunks that read as coherent passages on their own.

Do it in your browser

The Text Chunker does exactly this: choose a max token size and overlap, pick a sentence-, paragraph- or fixed-window strategy, and it produces chunks that each stay under your token budget, with exact counts (it runs the real GPT tokenizer). You can copy the chunks as a JSON array, ready to embed. Everything happens locally, so your documents are never uploaded.

A sensible default recipe

Get clean text first, with PDF to AI-ready text or Office to Text.
Chunk at ~300 tokens, ~50 tokens overlap, sentence-aware.
Check a few chunks read sensibly on their own; nudge the size if they don't.
Embed, store, retrieve.

Want to understand the token numbers behind all this? See how to count LLM tokens for free.