Skip to main content
Definable provides unified media types that work across all providers supporting multimodal input.

Images

from definable.agent import Agent
from definable.media import Image

agent = Agent(model="gpt-4o", instructions="Describe images in detail.")

output = await agent.arun(
    "What do you see?",
    images=[Image(url="https://example.com/photo.jpg")],
)
print(output.content)

Image Sources

image = Image(url="https://example.com/photo.jpg")

Audio

from definable.media import Audio

output = await agent.arun(
    "Transcribe this audio.",
    audio=[Audio(filepath="/path/to/audio.mp3")],
)
Most models do not support raw audio input. Use audio_transcriber=True on the agent to automatically transcribe audio to text before the model sees it.

Files

from definable.media import File

output = await agent.arun(
    "Summarize this document.",
    files=[File(filepath="/path/to/report.pdf")],
)
When readers are enabled on the agent, file content is automatically extracted and injected into the prompt.

Video

from definable.media import Video

output = await agent.arun(
    "Describe what happens in this video.",
    videos=[Video(url="https://example.com/video.mp4")],
)

Voice Note Transcription

For Telegram/Discord voice messages, enable the audio transcriber:
agent = Agent(model="gpt-4o", audio_transcriber=True)  # Uses OpenAI Whisper
See Agent configuration for details.