Document represents a piece of text content along with its metadata, source information, and optional embedding vector. Documents are the fundamental unit that flows through the entire RAG pipeline.
Creating Documents
Document Fields
| Field | Type | Description |
|---|---|---|
content | str | The text content |
id | str | Unique identifier (auto-generated UUID) |
name | str | Human-readable name |
meta_data | dict | Arbitrary metadata for filtering and display |
embedding | List[float] | Vector embedding (set by embedder) |
source | str | Where the document came from (file path, URL) |
source_type | str | Type of source ("text", "pdf", "url") |
chunk_index | int | Index within a chunked document |
chunk_total | int | Total number of chunks from the source |
reranking_score | float | Relevance score from reranking |
Generating Embeddings
Embed a document manually using any embedder:When using the
Knowledge class, embedding is handled automatically during add(). You only need to embed manually if you are working with documents directly.Serialization
Convert documents to and from dictionaries for storage or transmission:Metadata and Filtering
Attach metadata to documents for filtering during search:Chunked Documents
When a large document is chunked, each chunk is a separateDocument with chunk tracking: