Documentation Index
Fetch the complete documentation index at: https://docs.definable.ai/llms.txt
Use this file to discover all available pages before exploring further.
Streaming lets your application display tokens as they are generated instead of waiting for the full response. This dramatically improves perceived latency for end users.
Basic Streaming
from definable.model.openai import OpenAIChat
from definable.model.message import Message
model = OpenAIChat(id="gpt-4o")
for chunk in model.invoke_stream(
messages=[Message(role="user", content="Explain how DNS works.")],
assistant_message=Message(role="assistant", content=""),
):
if chunk.content:
print(chunk.content, end="", flush=True)
Each chunk is a ModelResponse object. During streaming, most chunks contain a small piece of the content. The final chunk includes usage metrics.
When the model decides to call a tool during streaming, you’ll receive chunks with tool_calls instead of content:
from definable.model.message import Message
for chunk in model.invoke_stream(
messages=[Message(role="user", content="What's the weather?")],
assistant_message=Message(role="assistant", content=""),
tools=[get_weather],
):
if chunk.content:
print(chunk.content, end="", flush=True)
if chunk.tool_calls:
print(f"\nTool call: {chunk.tool_calls}")
When using agents, tool execution during streaming is handled automatically. You receive high-level events like ToolCallStartedEvent and ToolCallCompletedEvent instead of raw chunks. See Running Agents for details.
Streaming with Reasoning
Models that support reasoning (like DeepSeek Reasoner or OpenAI o1) emit reasoning content before the final answer:
from definable.model import DeepSeekChat
from definable.model.message import Message
model = DeepSeekChat(id="deepseek-reasoner")
for chunk in model.invoke_stream(
messages=[Message(role="user", content="What is 127 * 843?")],
assistant_message=Message(role="assistant", content=""),
):
if chunk.reasoning_content:
print(f"[thinking] {chunk.reasoning_content}", end="")
if chunk.content:
print(chunk.content, end="")
Collecting the Full Response
To stream output to the user while also capturing the complete response:
from definable.model.message import Message
full_content = []
for chunk in model.invoke_stream(
messages=[Message(role="user", content="Write a poem.")],
assistant_message=Message(role="assistant", content=""),
):
if chunk.content:
full_content.append(chunk.content)
print(chunk.content, end="", flush=True)
complete_text = "".join(full_content)
Streaming vs Non-Streaming
| invoke() / ainvoke() | invoke_stream() / ainvoke_stream() |
|---|
| Latency | Waits for full response | First token arrives immediately |
| Return type | Single ModelResponse | Iterator of ModelResponse chunks |
| Usage metrics | Available on response | Available on final chunk |
| Best for | Background processing, short responses | User-facing output, long responses |