Voice Architecture: Traditional vs Native¶
This guide compares the two fundamental approaches for real-time voice AI: the traditional STT+LLM+TTS pipeline and native voice-to-voice APIs.
Architecture Overview¶
Traditional Pipeline (STT → LLM → TTS)¶
Audio In → [STT Provider] → Text → [LLM Provider] → Text → [TTS Provider] → Audio Out
~200ms ~300-500ms ~200ms
Total: 700-900ms+
Three separate API calls, each adding latency:
- STT: Deepgram, Whisper, Google Speech, AssemblyAI
- LLM: Claude, GPT, Gemini (text mode)
- TTS: ElevenLabs, Deepgram Aura, Cartesia, OpenAI TTS
Native Voice-to-Voice¶
Single WebSocket connection, model processes audio directly.
Comparison Summary¶
| Aspect | Traditional (STT+LLM+TTS) | Native Voice-to-Voice |
|---|---|---|
| Latency | 500-1500ms | 100-200ms |
| API Calls | 3 separate calls | 1 WebSocket stream |
| Configuration | 3 providers to configure | 1 provider |
| Barge-in | Complex coordination | Native support |
| Turn detection | Manual VAD integration | Built-in VAD |
| Voice options | 1000s (clones, custom) | 5-11 preset voices |
Latency Breakdown¶
Traditional Pipeline¶
| Component | Latency | Notes |
|---|---|---|
| STT transcription | 150-300ms | Depends on utterance length |
| Network (STT response) | 20-50ms | |
| LLM inference | 200-500ms | First token latency |
| Network (LLM response) | 20-50ms | |
| TTS synthesis | 150-300ms | Time to first audio chunk |
| Network (TTS response) | 20-50ms | |
| Total | 560-1250ms | Before user hears response |
Native Voice-to-Voice¶
| Component | Latency | Notes |
|---|---|---|
| Audio buffering | 20-50ms | VAD + chunk collection |
| Model processing | 80-150ms | Direct audio-to-audio |
| Total | 100-200ms | Single round-trip |
Voice Quality Comparison¶
| Aspect | Traditional | Native |
|---|---|---|
| Voice options | 1000+ (ElevenLabs library, custom clones) | 5-11 preset voices |
| Voice cloning | Yes (ElevenLabs, PlayHT, Cartesia) | No |
| Custom voices | Yes (train on your audio) | No |
| Naturalness | Excellent (ElevenLabs) to Good | Good |
| Emotion control | Yes (some providers) | Limited |
| SSML support | Yes (most providers) | No |
Available Voices¶
OpenAI Realtime (11 voices): alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer, verse
Gemini Live (5 voices): Puck, Charon, Kore, Fenrir, Aoede
ElevenLabs (Traditional): 5000+ voices including custom clones
STT Accuracy Comparison¶
| Aspect | Traditional | Native |
|---|---|---|
| Provider choice | Deepgram, Whisper, Google, AssemblyAI | Built-in only |
| Domain tuning | Yes (medical, legal, technical) | Limited |
| Custom vocabulary | Yes (Deepgram keywords, boost) | No |
| Language support | 100+ languages | Fewer (~30) |
| Diarization | Yes (speaker identification) | No |
| Word timestamps | Yes | Limited |
Best STT Accuracy by Domain¶
| Domain | Recommended Provider |
|---|---|
| General | Deepgram Nova-2, Whisper |
| Medical | Deepgram (with medical model) |
| Legal | Deepgram (with custom vocabulary) |
| Call Center | Deepgram, AssemblyAI |
| Low-resource languages | Whisper |
Cost Comparison¶
Per-Minute Costs (Approximate)¶
| Approach | Component | Cost/Minute |
|---|---|---|
| Traditional | ||
| Deepgram STT (Nova-2) | $0.0043 | |
| Claude Sonnet | $0.01-0.03 | |
| ElevenLabs TTS | $0.18 | |
| Subtotal | $0.19-0.21 | |
| Native | ||
| OpenAI Realtime (audio in) | $0.06 | |
| OpenAI Realtime (audio out) | $0.24 | |
| Subtotal | $0.30 |
Cost Optimization Strategies¶
Traditional Pipeline:
- Use Deepgram Aura for TTS (cheaper than ElevenLabs)
- Cache common responses
- Use smaller LLM for simple queries
Native Voice-to-Voice:
- Shorter conversations = lower cost
- Use Gemini Live (may be cheaper for some use cases)
Features Comparison¶
| Feature | Traditional | OpenAI Realtime | Gemini Live |
|---|---|---|---|
| Function calling | Via LLM | Yes | Yes |
| Streaming | Partial (TTS only) | Full duplex | Full duplex |
| Interruption (barge-in) | Manual | Native | Native |
| Context window | LLM dependent | 128k tokens | 1M tokens |
| Vision input | Separate API | No | Yes (video) |
| Google Search | No | No | Yes (grounding) |
| Code execution | No | No | Yes |
| Session persistence | Manual | Built-in | Built-in |
Barge-in Handling¶
Traditional Pipeline¶
Requires coordinating multiple systems:
// 1. Detect user speech via STT VAD events
sttEvents := sttProvider.StreamEvents()
// 2. Stop TTS playback when user speaks
for event := range sttEvents {
if event.Type == stt.EventSpeechStart {
ttsPipeline.Stop() // Stop current audio
// Clear audio buffers
// Signal LLM to handle interruption
}
}
Use the bargein package for this coordination:
import "github.com/plexusone/omnivoice-core/bargein"
detector := bargein.New(bargein.Config{
Mode: bargein.ModeImmediate,
MinSpeechDurationMs: 200,
})
detector.AttachTTS(ttsPipeline)
detector.AttachSTTEvents(sttEvents)
detector.OnInterrupt(handleInterrupt)
Native Voice-to-Voice¶
Barge-in is handled automatically:
// OpenAI Realtime - automatic interruption
// When user speaks, model stops and listens
// Gemini Live - explicit interrupt available
session.Interrupt() // Or automatic via VAD
When to Use Each¶
Use Native Voice-to-Voice When:¶
- Low latency is critical - Customer service, real-time IVR, voice assistants
- Natural conversation flow - Barge-in and turn-taking are important
- Simpler architecture - Fewer moving parts, easier to deploy
- Preset voices are acceptable - Don't need custom/cloned voices
Use Traditional Pipeline When:¶
- Custom voices required - Brand voice, cloned voices, specific persona
- Domain-specific STT - Medical, legal, technical terminology
- Language support - Languages not available in native APIs
- Best-of-breed mixing - Deepgram STT + Claude + ElevenLabs
- Cost optimization - Can be cheaper for low-volume or with caching
- Compliance requirements - Need specific provider certifications
Hybrid Approach¶
Combine both approaches for optimal results:
Real-time conversation: OpenAI Realtime (low latency)
IVR menus/announcements: ElevenLabs (high-quality branded voice)
Voicemail transcription: Deepgram (accuracy + timestamps)
Example configuration:
voice:
# Primary: Native voice-to-voice for conversation
realtime:
provider: openai
voice: alloy
# Fallback: Traditional for specific use cases
tts:
provider: elevenlabs
voice_id: branded-voice-id # For announcements
stt:
provider: deepgram
model: nova-2 # For voicemail transcription
Audio Format Reference¶
| Provider | Input Format | Output Format |
|---|---|---|
| OpenAI Realtime | PCM16 24kHz mono | PCM16 24kHz mono |
| Gemini Live | PCM16 16kHz mono | PCM16 24kHz mono |
| Deepgram STT | Various (mp3, wav, etc.) | Text |
| Deepgram TTS | Text | mp3, wav, pcm |
| ElevenLabs | Text | mp3, pcm |
| Twilio Media Streams | mulaw 8kHz | mulaw 8kHz |
Sample Rate Conversion¶
When connecting Twilio to native voice-to-voice:
// Twilio → OpenAI Realtime
twilioAudio := receive8kMulaw()
pcm16 := convertMulawToPCM16(twilioAudio)
pcm24k := resample8kTo24k(pcm16)
sendToOpenAI(pcm24k)
// OpenAI Realtime → Twilio
openaiAudio := receiveFromOpenAI() // 24kHz PCM16
pcm8k := resample24kTo8k(openaiAudio)
mulaw := convertPCM16ToMulaw(pcm8k)
sendToTwilio(mulaw)
Provider Packages¶
| Approach | Package | Documentation |
|---|---|---|
| Native Voice-to-Voice | ||
| OpenAI Realtime | github.com/plexusone/omni-openai/omnivoice/realtime |
Realtime Guide |
| Gemini Live | github.com/plexusone/omni-google/omnivoice |
Gemini Live Guide |
| Traditional STT | ||
| Deepgram | github.com/plexusone/omnivoice-core/stt/deepgram |
|
| Whisper | github.com/plexusone/omni-openai/omnivoice |
|
| Google Speech | github.com/plexusone/omni-google/omnivoice |
|
| Traditional TTS | ||
| ElevenLabs | github.com/plexusone/omnivoice-core/tts/elevenlabs |
|
| Deepgram Aura | github.com/plexusone/omnivoice-core/tts/deepgram |
|
| Cartesia | github.com/plexusone/omnivoice-core/tts/cartesia |
|
| Infrastructure | ||
| Barge-in Detection | github.com/plexusone/omnivoice-core/bargein |
Barge-in Guide |
| Session Storage | github.com/plexusone/omnivoice-core/storage |
Storage Guide |