Multimodal Input - Definable AI

Definable provides unified media types that work across all providers supporting multimodal input.

Images

from definable.agent import Agent
from definable.media import Image

agent = Agent(model="gpt-4o", instructions="Describe images in detail.")

output = await agent.arun(
    "What do you see?",
    images=[Image(url="https://example.com/photo.jpg")],
)
print(output.content)

Image Sources

image = Image(url="https://example.com/photo.jpg")

image = Image(filepath="/path/to/photo.png")

image = Image.from_base64(base64_string, format="png")

image = Image(content=raw_bytes, format="jpeg")

Audio

from definable.media import Audio

output = await agent.arun(
    "Transcribe this audio.",
    audio=[Audio(filepath="/path/to/audio.mp3")],
)

Most models do not support raw audio input. Use audio_transcriber=True on the agent to automatically transcribe audio to text before the model sees it.

Files

from definable.media import File

output = await agent.arun(
    "Summarize this document.",
    files=[File(filepath="/path/to/report.pdf")],
)

When readers are enabled on the agent, file content is automatically extracted and injected into the prompt.

Video

from definable.media import Video

output = await agent.arun(
    "Describe what happens in this video.",
    videos=[Video(url="https://example.com/video.mp4")],
)

Voice Note Transcription

For Telegram/Discord voice messages, enable the audio transcriber:

agent = Agent(model="gpt-4o", audio_transcriber=True)  # Uses OpenAI Whisper

See Agent configuration for details.

​Images

​Image Sources

​Audio

​Files

​Video

​Voice Note Transcription

Images

Image Sources

Audio

Files

Video

Voice Note Transcription