Architecture
Three pipeline modes handle the voice-to-text-to-voice conversion differently:| Mode | How It Works | Latency | Provider Lock |
|---|---|---|---|
| Managed | Twilio ConversationRelay handles STT/TTS | ~500ms | Twilio only |
| Cascading | Your STT → Agent → Your TTS | ~800-1200ms | Any provider |
| Realtime | OpenAI speech-to-speech (no text step) | ~200-300ms | OpenAI model |
Setup
Install
websockets dependency needed for real-time audio streaming.
Twilio Setup
- Create a Twilio account and get a phone number
- Set environment variables:
- Configure your Twilio phone number’s webhook:
- Go to Phone Numbers → Manage → Active Numbers
- Set A call comes in to your server URL:
https://your-domain.com/call/incoming(POST)
Plivo Setup
- Create a Plivo account and get a phone number
- Set environment variables:
- Create a Plivo Application:
- Go to Voice → Applications → New Application
- Set Answer URL to:
https://your-domain.com/call/incoming(POST) - Assign your Plivo phone number to this application
Quick Start — Managed Mode (Twilio)
The simplest mode. Twilio handles STT and TTS via ConversationRelay — you just provide text.Quick Start — Cascading Mode
Full control over STT and TTS providers. Works with both Twilio and Plivo.Quick Start — Realtime Mode
Lowest latency using OpenAI’s speech-to-speech Realtime API. Audio flows directly to the model with no intermediate text step.CallInterface Parameters
Telephony
Telephony provider:
"twilio" or "plivo".Phone number to receive calls on (E.164 format, e.g.
"+15551234567").Twilio Account SID. Falls back to
TWILIO_ACCOUNT_SID env var.Twilio or Plivo auth token. Falls back to
TWILIO_AUTH_TOKEN or PLIVO_AUTH_TOKEN env var.Plivo Auth ID. Falls back to
PLIVO_AUTH_ID env var.Pipeline
Voice pipeline mode:
"managed", "cascading", or "realtime".Speech-to-text provider for cascading mode. Required when
pipeline="cascading".Text-to-speech provider for cascading mode. Required when
pipeline="cascading".Realtime provider for speech-to-speech mode. Required when
pipeline="realtime".Voice Settings
Greeting spoken when a call connects.
Voice name or ID for TTS synthesis.
BCP-47 language code.
When the caller can interrupt:
"none", "dtmf", "speech", or "any".Barge-in sensitivity:
"low", "medium", or "high".Managed Mode Settings
STT provider name for managed mode (Twilio ConversationRelay).
TTS provider name for managed mode (Twilio ConversationRelay).
Server Paths
URL path for the incoming call webhook.
URL path for WebSocket audio streams.
Call Settings
Maximum call duration before automatic hangup (1 hour default).
Pipeline Modes
Managed (Twilio Only)
The telephony provider handles STT and TTS natively. Your agent only sees text.- Simplest to set up — no STT/TTS provider configuration needed
- ~500ms latency
- Limited to Twilio (uses ConversationRelay)
- Provider-dependent voice/model selection
Cascading
Raw audio flows through your own STT and TTS providers. Full control over every component.- Works with both Twilio and Plivo
- Pluggable STT (DeepgramSTT) and TTS (CartesiaTTS)
- Automatic barge-in detection (speech during playback)
- ~800-1200ms latency
Realtime (OpenAI)
Audio flows directly to OpenAI’s Realtime API for speech-to-speech processing. No intermediate text conversion step.- Lowest latency (~200-300ms)
- Native function calling (tools work without text intermediary)
- Server-side VAD (voice activity detection)
- Locked to OpenAI Realtime models (
gpt-4o-realtime-preview)
Telephony Providers
Twilio
Supports all three pipeline modes. Uses Media Streams for cascading/realtime and ConversationRelay for managed mode.Plivo
Supports cascading and realtime modes only. Uses bidirectional Audio Streaming over WebSocket.- No managed mode (no ConversationRelay equivalent)
- Supports 16kHz PCM natively (Twilio only supports 8kHz mu-law)
- Uses HMAC-SHA256 V3 for webhook signatures (Twilio uses HMAC-SHA1)
STT Providers
DeepgramSTT
Real-time streaming transcription via Deepgram’s WebSocket API.TTS Providers
CartesiaTTS
Ultra-low latency streaming TTS via Cartesia’s WebSocket API (40-90ms TTFB).Agent with Tools
Give your phone agent capabilities:Deployment
CallInterface integrates withAgentRuntime, which provides a shared FastAPI server for webhooks and WebSocket connections.
- Use a reverse proxy (nginx, Caddy) with TLS termination
- Point your telephony provider’s webhook to
https://your-domain.com/call/incoming - The WebSocket endpoint is at
wss://your-domain.com/call/stream - Set
max_call_duration_secondsto prevent runaway calls