1000: The Magic Number in the World of LLMs
Posted on March 26, 2025 • 3 min read • 557 wordsWorking effectively with large language models (LLMs) involves much more than just prompting. When building a RAG (Retrieval-Augmented Generation) pipeline or integrating documents into an LLM-based system, text chunking becomes a critical performance lever.

The default value of 1000 tokens per chunk is not arbitrary:
In some cases, other sizes may be more appropriate:
It’s important to note that chunk sizes are not always strictly exact. Tools often prioritize semantic coherence and split at logical boundaries (paragraphs, sentences, words). As a result, some chunks may slightly exceed the set limit, e.g., reaching 1080 tokens, to avoid cutting off a sentence or idea mid-way. This flexibility leads to more natural and effective chunks for LLMs. These variations are controlled and rarely exceed a few dozen tokens.
Models like GPT-4 Turbo or Gemini 1.5 can accept up to 1 million tokens in input. However, this doesn’t mean you should inject an entire document unfiltered.
Two key reasons:
Smart chunking helps reduce the load and improve precision.
It’s better to provide less information, but more relevant, by selecting the chunks that are most related to the user’s question.
This usually involves:
top-k) before injecting into the model’s context window. Instead of using all results, we keep only the top k most similar chunks. This is known as top-k retrieval.
Example: k = 5 → Keep the 5 most relevant chunks.When chunking is done without overlap (chunk_overlap = 0), there’s no direct link between chunks. This can be problematic if critical information sits on the boundary between two chunks.
In these cases, an overlap of 200 tokens is often recommended. It allows:
This configuration is supported in modern preprocessing tools such as LangChain, LlamaIndex, and Haystack.
| Element | Default Recommendation |
|---|---|
| Chunk size | ~1000 tokens (with controlled tolerance) |
| Pre-injection filtering | Yes, via vector or hybrid search |
| Overlap | 200 tokens |
| Contextual relevance | Prioritized over quantity |
| Context window usage | Only inject relevant parts |
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # splits into 1000-character chunks
chunk_overlap=200, # each chunk shares 200 characters with the previous one
separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""], # logical separators for better semantic splits
length_function=len # counts characters (use tiktoken for actual token count)
)Best practices serve as a general framework. You’ll need to adapt these guidelines to your specific project and content type.
Chunk smart, filter aggressively, overlap carefully. These are simple yet powerful levers to boost the performance and reliability of your LLM-powered systems.