Skip to main content
Streaming lets your application display tokens as they are generated instead of waiting for the full response. This dramatically improves perceived latency for end users.

Basic Streaming

from definable.models import OpenAIChat

model = OpenAIChat(id="gpt-4o")

for chunk in model.invoke_stream(
    messages=[{"role": "user", "content": "Explain how DNS works."}]
):
    if chunk.content:
        print(chunk.content, end="", flush=True)
Each chunk is a ModelResponse object. During streaming, most chunks contain a small piece of the content. The final chunk includes usage metrics.

Streaming with Tools

When the model decides to call a tool during streaming, you’ll receive chunks with tool_calls instead of content:
for chunk in model.invoke_stream(
    messages=[{"role": "user", "content": "What's the weather?"}],
    tools=[get_weather],
):
    if chunk.content:
        print(chunk.content, end="", flush=True)
    if chunk.tool_calls:
        print(f"\nTool call: {chunk.tool_calls}")
When using agents, tool execution during streaming is handled automatically. You receive high-level events like ToolCallStartedEvent and ToolCallCompletedEvent instead of raw chunks. See Running Agents for details.

Streaming with Reasoning

Models that support reasoning (like DeepSeek Reasoner or OpenAI o1) emit reasoning content before the final answer:
from definable.models import DeepSeekChat

model = DeepSeekChat(id="deepseek-reasoner")

for chunk in model.invoke_stream(
    messages=[{"role": "user", "content": "What is 127 * 843?"}]
):
    if chunk.reasoning_content:
        print(f"[thinking] {chunk.reasoning_content}", end="")
    if chunk.content:
        print(chunk.content, end="")

Collecting the Full Response

To stream output to the user while also capturing the complete response:
full_content = []

for chunk in model.invoke_stream(
    messages=[{"role": "user", "content": "Write a poem."}]
):
    if chunk.content:
        full_content.append(chunk.content)
        print(chunk.content, end="", flush=True)

complete_text = "".join(full_content)

Streaming vs Non-Streaming

invoke() / ainvoke()invoke_stream() / ainvoke_stream()
LatencyWaits for full responseFirst token arrives immediately
Return typeSingle ModelResponseIterator of ModelResponse chunks
Usage metricsAvailable on responseAvailable on final chunk
Best forBackground processing, short responsesUser-facing output, long responses