Chunkers - Definable AI

Chunking splits large documents into smaller pieces that fit within embedding model limits and improve retrieval precision. Smaller, focused chunks tend to match queries more accurately than entire documents.

Why Chunk?

Embedding models have token limits — most accept 512-8192 tokens
Smaller chunks are more precise — a paragraph about “authentication” matches better than a full page with mixed topics
Overlap preserves context — overlapping boundaries prevent information loss at chunk edges

TextChunker

Splits text on a single separator (e.g., double newlines for paragraphs):

from definable.knowledge.chunker import TextChunker

chunker = TextChunker(
    chunk_size=500,
    chunk_overlap=50,
    separator="\n\n",
)

chunk_size

int

default:"1000"

Target size for each chunk in characters.

chunk_overlap

int

default:"200"

Number of characters to overlap between adjacent chunks.

separator

str

default:"\"\\\\n\\\\n\""

The separator to split on.

keep_separator

bool

default:"false"

Whether to keep the separator in chunk content.

RecursiveChunker

Splits text using a hierarchy of separators, falling back to finer-grained splits when chunks are too large. This is the default chunker and generally produces the best results.

from definable.knowledge.chunker import RecursiveChunker

chunker = RecursiveChunker(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
)

separators

List[str]

Ordered list of separators to try. The chunker uses the first separator that produces chunks within the size limit, then recurses with finer separators for any chunks that are still too large.

The default separator hierarchy:

\n\n — paragraph breaks (preferred)
\n — line breaks
. — sentence endings
— word boundaries
"" — character-level split (last resort)

Using with Knowledge

Pass a chunker when creating a knowledge base:

from definable.embedder import OpenAIEmbedder
from definable.knowledge import Knowledge
from definable.knowledge.chunker import RecursiveChunker
from definable.vectordb import InMemoryVectorDB

knowledge = Knowledge(
    vector_db=InMemoryVectorDB(),
    embedder=OpenAIEmbedder(),
    chunker=RecursiveChunker(chunk_size=500, chunk_overlap=100),
)

Disabling Chunking

If your documents are already the right size, skip chunking:

knowledge.add("Short content", chunk=False)

Chunker Interface

Both chunkers implement:

Method	Description
`chunk(document) -> List[Document]`	Chunk a single document
`chunk_many(documents) -> List[Document]`	Chunk multiple documents

Example: Comparing Chunkers

from definable.knowledge import Document
from definable.knowledge.chunker import TextChunker, RecursiveChunker

doc = Document(content="Long document content here...")

# TextChunker — splits on paragraphs only
text_chunks = TextChunker(chunk_size=200).chunk(doc)
print(f"TextChunker: {len(text_chunks)} chunks")

# RecursiveChunker — smart multi-level splitting
recursive_chunks = RecursiveChunker(chunk_size=200).chunk(doc)
print(f"RecursiveChunker: {len(recursive_chunks)} chunks")

Start with the default RecursiveChunker settings. Adjust chunk_size based on your embedding model’s sweet spot — typically 500-1000 characters for most use cases.

​Why Chunk?

​TextChunker

​RecursiveChunker

​Using with Knowledge

​Disabling Chunking

​Chunker Interface

​Example: Comparing Chunkers

Why Chunk?

TextChunker

RecursiveChunker

Using with Knowledge

Disabling Chunking

Chunker Interface

Example: Comparing Chunkers