Simple Enough Blog logo
  • Home 
  • Projects 
  • Tags 

  •  Language
    • English
    • Français
  1.   Blogs
  1. Home
  2. Blogs
  3. 1000: The Magic Number in the World of LLMs

1000: The Magic Number in the World of LLMs

Posted on March 26, 2025 • 3 min read • 557 words
LLM   OpenAI   Helene  
LLM   OpenAI   Helene  
Share via
Simple Enough Blog
Link copied to clipboard

Working effectively with large language models (LLMs) involves much more than just prompting. When building a RAG (Retrieval-Augmented Generation) pipeline or integrating documents into an LLM-based system, text chunking becomes a critical performance lever.

On this page
I. Chunk Size: Why ~1000 Tokens?   II. Why Chunk Even With Long-Context LLMs?   III. Contextual Relevance: Filter Before Injecting   IV. Overlap: An Underestimated Lever   V. Best Practice Recap   VI. Example with LangChain   Conclusion  
1000: The Magic Number in the World of LLMs
Photo by Helene Hemmerter

I. Chunk Size: Why ~1000 Tokens?  

The default value of 1000 tokens per chunk is not arbitrary:

  • A chunk of this size generally contains enough information to remain semantically coherent without being too large.
  • It remains compatible with the context window of modern LLMs (4k, 8k, 32k, or even 1M tokens).
  • It helps avoid diluting meaning or breaking semantic units.

In some cases, other sizes may be more appropriate:

  • Highly dense documents: smaller chunks (300–500 tokens),
  • Structured content (code, tables): larger chunks if logical blocks are preserved.

It’s important to note that chunk sizes are not always strictly exact. Tools often prioritize semantic coherence and split at logical boundaries (paragraphs, sentences, words). As a result, some chunks may slightly exceed the set limit, e.g., reaching 1080 tokens, to avoid cutting off a sentence or idea mid-way. This flexibility leads to more natural and effective chunks for LLMs. These variations are controlled and rarely exceed a few dozen tokens.


II. Why Chunk Even With Long-Context LLMs?  

Models like GPT-4 Turbo or Gemini 1.5 can accept up to 1 million tokens in input. However, this doesn’t mean you should inject an entire document unfiltered.

Two key reasons:

  1. Cost: More tokens = higher usage cost (especially with commercial APIs).
  2. Quality: More noise = degraded output. This follows the classic “garbage in, garbage out” rule.

Smart chunking helps reduce the load and improve precision.


III. Contextual Relevance: Filter Before Injecting  

It’s better to provide less information, but more relevant, by selecting the chunks that are most related to the user’s question.

This usually involves:

  • Chunk vectorization (via embeddings),
  • A retrieval phase (e.g., FAISS, Weaviate, Qdrant). When a query is issued, it’s embedded and compared to all chunks to generate a ranked list based on similarity.
  • A final selection (top-k) before injecting into the model’s context window. Instead of using all results, we keep only the top k most similar chunks. This is known as top-k retrieval. Example: k = 5 → Keep the 5 most relevant chunks.

IV. Overlap: An Underestimated Lever  

When chunking is done without overlap (chunk_overlap = 0), there’s no direct link between chunks. This can be problematic if critical information sits on the boundary between two chunks.

In these cases, an overlap of 200 tokens is often recommended. It allows:

  • Preserving semantic continuity across chunks,
  • Avoiding edge-case loss during splitting,
  • Improving response quality by giving the model access to richer context.

This configuration is supported in modern preprocessing tools such as LangChain, LlamaIndex, and Haystack.


V. Best Practice Recap  

ElementDefault Recommendation
Chunk size~1000 tokens (with controlled tolerance)
Pre-injection filteringYes, via vector or hybrid search
Overlap200 tokens
Contextual relevancePrioritized over quantity
Context window usageOnly inject relevant parts

VI. Example with LangChain  

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # splits into 1000-character chunks
    chunk_overlap=200,  # each chunk shares 200 characters with the previous one
    separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],  # logical separators for better semantic splits
    length_function=len  # counts characters (use tiktoken for actual token count)
)

Conclusion  

Best practices serve as a general framework. You’ll need to adapt these guidelines to your specific project and content type.

Chunk smart, filter aggressively, overlap carefully. These are simple yet powerful levers to boost the performance and reliability of your LLM-powered systems.

 How to Count Tokens Effectively
Introduction to AWS ELB 
  • I. Chunk Size: Why ~1000 Tokens?  
  • II. Why Chunk Even With Long-Context LLMs?  
  • III. Contextual Relevance: Filter Before Injecting  
  • IV. Overlap: An Underestimated Lever  
  • V. Best Practice Recap  
  • VI. Example with LangChain  
  • Conclusion  
Follow us

We work with you!

   
Copyright © 2026 Simple Enough Blog All rights reserved. | Powered by Hinode.
Simple Enough Blog
Code copied to clipboard