Skip to main content
File readers extract content from files attached to agent messages before they are sent to the model. This lets agents process PDFs, Word documents, presentations, spreadsheets, images, and audio files without manual preprocessing.
File readers (definable.readers) extract content from files attached to agent messages for LLM processing. They are distinct from Knowledge readers (definable.knowledge.readers), which convert raw sources into Document objects for the RAG pipeline.

Quick Example

from definable.agents import Agent
from definable.media import File
from definable.models import OpenAIChat

agent = Agent(
  model=OpenAIChat(id="gpt-4o"),
  instructions="You are a helpful assistant. Analyze any files the user provides.",
  readers=True,  # Auto-creates a registry with all available parsers
)

file = File(
  content=b"Q3 Revenue: $2.5M\nQ3 Expenses: $1.8M\nQ3 Net Income: $700K",
  filename="financials.txt",
  mime_type="text/plain",
)

output = agent.run("Summarize the key financial metrics.", files=[file])
print(output.content)

Architecture

The readers module uses a layered design:
  • Parsers — stateless format converters: bytes → List[ContentBlock]. Never do I/O.
  • ParserRegistry — priority-based mapping from format to parser.
  • BaseReader — orchestrator: File → bytes → detect format → parse → ReaderOutput.
  • Providers — AI-backed readers (e.g., MistralReader) that handle their own API I/O.

Built-in Parsers

ParserFormatsDependency
TextParser.txt, .md, .csv, .json, .py, .js, +40 moreNone
PDFParser.pdfpypdf>=4.0.0
DocxParser.docxpython-docx>=1.0.0
PptxParser.pptxpython-pptx>=1.0.0
XlsxParser.xlsxopenpyxl>=3.1.0
OdsParser.odsodfpy>=1.4.0
RtfParser.rtfstriprtf>=0.0.26
HTMLParser.html, .htmNone
ImageParser.png, .jpg, .gif, .bmp, .tiff, .webp, .svg, +moreNone
AudioParser.mp3, .wav, .ogg, .flac, .m4a, .webmNone
Install all parser dependencies at once:
pip install 'definable[readers]'
Parsers with missing optional dependencies are silently skipped. Install only what you need.

ContentBlock

Content extraction produces ContentBlock objects — the multimodal output unit:
FieldTypeDescription
content_typestr"text", "image", "table", "audio", or "raw"
contentstr | bytesExtracted content
mime_typestr | NoneMIME type of the content
metadatadictParser-specific metadata
page_numberint | NonePage number (for paginated formats)
Methods:
MethodDescription
as_text()String representation of the content
as_message_content()OpenAI-format content part for message construction

ReaderOutput

Every file read returns a ReaderOutput:
FieldTypeDescription
filenamestrName of the file
blocksList[ContentBlock]Extracted content blocks
mime_typestr | NoneDetected MIME type
page_countint | NoneNumber of pages (PDF, DOCX, PPTX)
word_countint | NoneWord count of extracted text
truncatedboolWhether content was truncated
errorstr | NoneError message if reading failed
metadatadictAdditional metadata
Methods:
MethodDescription
as_text(separator="\n\n")Concatenated text from all blocks
as_messages()OpenAI-format message content list
contentProperty — backwards-compatible alias for as_text()

BaseReader

The main orchestrator that resolves files to parsed content:
from definable.readers import BaseReader

reader = BaseReader()
result = reader.read(file)
print(result.content)

Constructor

config
ReaderConfig
Reader configuration (file size limits, encoding, timeout).
registry
ParserRegistry
Custom parser registry. When None, a default registry with all available parsers is created.

Methods

MethodDescription
register(parser, priority=100)Register a parser (returns self for chaining)
get_parser(file)Get the parser that handles a file, or None
read(file)Read a file synchronously
aread(file)Read a file asynchronously
aread_all(files)Read multiple files concurrently

Agent Integration

Three ways to enable file readers on an agent:
# 1. Auto-registry with all available parsers
agent = Agent(model=model, readers=True)

# 2. Custom reader instance
from definable.readers import BaseReader
reader = BaseReader(config=ReaderConfig(max_file_size=10_000_000))
agent = Agent(model=model, readers=reader)

# 3. Single parser (auto-wrapped in BaseReader)
from definable.readers.parsers.base_parser import BaseParser
agent = Agent(model=model, readers=SomeParser())
When the agent receives files via run(..., files=[...]), it automatically extracts content from each file and injects it into the prompt before calling the model.

ReaderConfig

Configure reader behavior:
from definable.readers import ReaderConfig

config = ReaderConfig(
  max_file_size=None,          # Max file size in bytes (None = unlimited)
  max_content_length=None,     # Max extracted content length (None = unlimited)
  encoding="utf-8",            # Text encoding
  timeout=30.0,                # Read timeout in seconds
)

Creating a Custom Parser

Subclass BaseParser and implement three methods:
from typing import List, Set
from definable.readers.parsers.base_parser import BaseParser
from definable.readers.models import ContentBlock, ReaderConfig

class MarkdownParser(BaseParser):
  def supported_mime_types(self) -> List[str]:
    return ["text/markdown"]

  def supported_extensions(self) -> Set[str]:
    return {".md"}

  def parse(
    self,
    data: bytes,
    *,
    mime_type: str | None = None,
    config: ReaderConfig | None = None,
  ) -> List[ContentBlock]:
    text = data.decode(config.encoding if config else "utf-8")
    return [ContentBlock(content_type="text", content=text)]
Register it with a reader:
from definable.readers import BaseReader
from definable.readers.registry import ParserRegistry

registry = ParserRegistry()
registry.register(MarkdownParser(), priority=200)  # Higher priority wins
reader = BaseReader(registry=registry)

MistralReader

AI-backed OCR provider using the Mistral OCR API. Handles PDFs, DOCX, PPTX, and images with high-quality extraction.
from definable.readers.providers.mistral import MistralReader

reader = MistralReader(api_key="your-key")
agent = Agent(model=model, readers=reader)
api_key
str
Mistral API key. Falls back to MISTRAL_API_KEY env var.
model
str
default:"mistral-ocr-latest"
OCR model to use.
include_image_base64
bool
default:"false"
Include base64-encoded images in output blocks.
local_fallback
bool
default:"true"
Fall back to local parsers for formats Mistral doesn’t support.
Native formats: .pdf, .docx, .pptx, .png, .jpg, .jpeg, .avif
Requires mistralai: pip install 'definable[mistral-ocr]'

Parser Options

PDFParser

from definable.readers.parsers.pdf import PDFParser
parser = PDFParser(page_separator="\n\n")

DocxParser

from definable.readers.parsers.docx import DocxParser
parser = DocxParser(include_tables=True)

XlsxParser

from definable.readers.parsers.xlsx import XlsxParser
parser = XlsxParser(max_rows=1000)

OdsParser

from definable.readers.parsers.ods import OdsParser
parser = OdsParser(max_rows=1000)

ParserRegistry

The registry maps formats to parsers with priority-based dispatch:
from definable.readers.registry import ParserRegistry

registry = ParserRegistry(include_defaults=True)  # Registers all available parsers
registry.register(MyParser(), priority=200)        # Higher priority wins

parser = registry.get_parser(mime_type="application/pdf")
Built-in parsers are registered at priority 0. User-registered parsers default to priority 100. Higher priority wins when multiple parsers handle the same format.

ReadersConfig

Configure the readers integration on the agent via AgentConfig:
from definable.agents import Agent, AgentConfig, ReadersConfig

agent = Agent(
  model=model,
  config=AgentConfig(
    readers=ReadersConfig(
      enabled=True,
      registry=None,                     # Auto-create if None
      max_total_content_length=None,     # Limit total injected content
      context_format="xml",              # "xml" or "markdown"
    ),
  ),
)

Standalone Usage

Use BaseReader without an agent for file processing pipelines:
from definable.media import File
from definable.readers import BaseReader

reader = BaseReader()

files = [
  File(content=b"Hello, world!", filename="greeting.txt", mime_type="text/plain"),
  File(content=b'{"name": "Alice"}', filename="user.json", mime_type="application/json"),
]

for file in files:
  result = reader.read(file)
  print(f"{result.filename}: {result.content[:100]}")

Stream Events

When using streaming, file reads emit events:
EventKey FieldsDescription
FileReadStartedfile_countFile reading began
FileReadCompletedfile_count, files_read, files_failed, duration_msFile reading finished