Voice Architecture: Traditional vs Native¶

This guide compares the two fundamental approaches for real-time voice AI: the traditional STT+LLM+TTS pipeline and native voice-to-voice APIs.

Architecture Overview¶

Traditional Pipeline (STT → LLM → TTS)¶

Audio In → [STT Provider] → Text → [LLM Provider] → Text → [TTS Provider] → Audio Out
              ~200ms              ~300-500ms              ~200ms
                              Total: 700-900ms+

Three separate API calls, each adding latency:

STT: Deepgram, Whisper, Google Speech, AssemblyAI
LLM: Claude, GPT, Gemini (text mode)
TTS: ElevenLabs, Deepgram Aura, Cartesia, OpenAI TTS

Native Voice-to-Voice¶

Audio In → [OpenAI Realtime / Gemini Live] → Audio Out
                      ~100-200ms

Single WebSocket connection, model processes audio directly.

Comparison Summary¶

Aspect	Traditional (STT+LLM+TTS)	Native Voice-to-Voice
Latency	500-1500ms	100-200ms
API Calls	3 separate calls	1 WebSocket stream
Configuration	3 providers to configure	1 provider
Barge-in	Complex coordination	Native support
Turn detection	Manual VAD integration	Built-in VAD
Voice options	1000s (clones, custom)	5-11 preset voices

Latency Breakdown¶

Traditional Pipeline¶

Component	Latency	Notes
STT transcription	150-300ms	Depends on utterance length
Network (STT response)	20-50ms
LLM inference	200-500ms	First token latency
Network (LLM response)	20-50ms
TTS synthesis	150-300ms	Time to first audio chunk
Network (TTS response)	20-50ms
Total	560-1250ms	Before user hears response

Native Voice-to-Voice¶

Component	Latency	Notes
Audio buffering	20-50ms	VAD + chunk collection
Model processing	80-150ms	Direct audio-to-audio
Total	100-200ms	Single round-trip

Voice Quality Comparison¶

Aspect	Traditional	Native
Voice options	1000+ (ElevenLabs library, custom clones)	5-11 preset voices
Voice cloning	Yes (ElevenLabs, PlayHT, Cartesia)	No
Custom voices	Yes (train on your audio)	No
Naturalness	Excellent (ElevenLabs) to Good	Good
Emotion control	Yes (some providers)	Limited
SSML support	Yes (most providers)	No

Available Voices¶

OpenAI Realtime (11 voices): alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer, verse

Gemini Live (5 voices): Puck, Charon, Kore, Fenrir, Aoede

ElevenLabs (Traditional): 5000+ voices including custom clones

STT Accuracy Comparison¶

Aspect	Traditional	Native
Provider choice	Deepgram, Whisper, Google, AssemblyAI	Built-in only
Domain tuning	Yes (medical, legal, technical)	Limited
Custom vocabulary	Yes (Deepgram keywords, boost)	No
Language support	100+ languages	Fewer (~30)
Diarization	Yes (speaker identification)	No
Word timestamps	Yes	Limited

Best STT Accuracy by Domain¶

Domain	Recommended Provider
General	Deepgram Nova-2, Whisper
Medical	Deepgram (with medical model)
Legal	Deepgram (with custom vocabulary)
Call Center	Deepgram, AssemblyAI
Low-resource languages	Whisper

Cost Comparison¶

Per-Minute Costs (Approximate)¶

Approach	Component	Cost/Minute
Traditional
	Deepgram STT (Nova-2)	$0.0043
	Claude Sonnet	$0.01-0.03
	ElevenLabs TTS	$0.18
	Subtotal	$0.19-0.21

Native
	OpenAI Realtime (audio in)	$0.06
	OpenAI Realtime (audio out)	$0.24
	Subtotal	$0.30

Cost Optimization Strategies¶

Traditional Pipeline:

Use Deepgram Aura for TTS (cheaper than ElevenLabs)
Cache common responses
Use smaller LLM for simple queries

Native Voice-to-Voice:

Shorter conversations = lower cost
Use Gemini Live (may be cheaper for some use cases)

Features Comparison¶

Feature	Traditional	OpenAI Realtime	Gemini Live
Function calling	Via LLM	Yes	Yes
Streaming	Partial (TTS only)	Full duplex	Full duplex
Interruption (barge-in)	Manual	Native	Native
Context window	LLM dependent	128k tokens	1M tokens
Vision input	Separate API	No	Yes (video)
Google Search	No	No	Yes (grounding)
Code execution	No	No	Yes
Session persistence	Manual	Built-in	Built-in

Barge-in Handling¶

Traditional Pipeline¶

Requires coordinating multiple systems:

// 1. Detect user speech via STT VAD events
sttEvents := sttProvider.StreamEvents()

// 2. Stop TTS playback when user speaks
for event := range sttEvents {
    if event.Type == stt.EventSpeechStart {
        ttsPipeline.Stop()  // Stop current audio
        // Clear audio buffers
        // Signal LLM to handle interruption
    }
}

Use the bargein package for this coordination:

import "github.com/plexusone/omnivoice-core/bargein"

detector := bargein.New(bargein.Config{
    Mode: bargein.ModeImmediate,
    MinSpeechDurationMs: 200,
})
detector.AttachTTS(ttsPipeline)
detector.AttachSTTEvents(sttEvents)
detector.OnInterrupt(handleInterrupt)

Native Voice-to-Voice¶

Barge-in is handled automatically:

// OpenAI Realtime - automatic interruption
// When user speaks, model stops and listens

// Gemini Live - explicit interrupt available
session.Interrupt()  // Or automatic via VAD

When to Use Each¶

Use Native Voice-to-Voice When:¶

Low latency is critical - Customer service, real-time IVR, voice assistants
Natural conversation flow - Barge-in and turn-taking are important
Simpler architecture - Fewer moving parts, easier to deploy
Preset voices are acceptable - Don't need custom/cloned voices

Use Traditional Pipeline When:¶

Custom voices required - Brand voice, cloned voices, specific persona
Domain-specific STT - Medical, legal, technical terminology
Language support - Languages not available in native APIs
Best-of-breed mixing - Deepgram STT + Claude + ElevenLabs
Cost optimization - Can be cheaper for low-volume or with caching
Compliance requirements - Need specific provider certifications

Hybrid Approach¶

Combine both approaches for optimal results:

Real-time conversation: OpenAI Realtime (low latency)
IVR menus/announcements: ElevenLabs (high-quality branded voice)
Voicemail transcription: Deepgram (accuracy + timestamps)

Example configuration:

voice:
  # Primary: Native voice-to-voice for conversation
  realtime:
    provider: openai
    voice: alloy

  # Fallback: Traditional for specific use cases
  tts:
    provider: elevenlabs
    voice_id: branded-voice-id  # For announcements

  stt:
    provider: deepgram
    model: nova-2  # For voicemail transcription

Audio Format Reference¶

Provider	Input Format	Output Format
OpenAI Realtime	PCM16 24kHz mono	PCM16 24kHz mono
Gemini Live	PCM16 16kHz mono	PCM16 24kHz mono
Deepgram STT	Various (mp3, wav, etc.)	Text
Deepgram TTS	Text	mp3, wav, pcm
ElevenLabs	Text	mp3, pcm
Twilio Media Streams	mulaw 8kHz	mulaw 8kHz

Sample Rate Conversion¶

When connecting Twilio to native voice-to-voice:

// Twilio → OpenAI Realtime
twilioAudio := receive8kMulaw()
pcm16 := convertMulawToPCM16(twilioAudio)
pcm24k := resample8kTo24k(pcm16)
sendToOpenAI(pcm24k)

// OpenAI Realtime → Twilio
openaiAudio := receiveFromOpenAI()  // 24kHz PCM16
pcm8k := resample24kTo8k(openaiAudio)
mulaw := convertPCM16ToMulaw(pcm8k)
sendToTwilio(mulaw)

Provider Packages¶

Approach	Package	Documentation
Native Voice-to-Voice
OpenAI Realtime	`github.com/plexusone/omni-openai/omnivoice/realtime`	Realtime Guide
Gemini Live	`github.com/plexusone/omni-google/omnivoice`	Gemini Live Guide
Traditional STT
Deepgram	`github.com/plexusone/omnivoice-core/stt/deepgram`
Whisper	`github.com/plexusone/omni-openai/omnivoice`
Google Speech	`github.com/plexusone/omni-google/omnivoice`
Traditional TTS
ElevenLabs	`github.com/plexusone/omnivoice-core/tts/elevenlabs`
Deepgram Aura	`github.com/plexusone/omnivoice-core/tts/deepgram`
Cartesia	`github.com/plexusone/omnivoice-core/tts/cartesia`
Infrastructure
Barge-in Detection	`github.com/plexusone/omnivoice-core/bargein`	Barge-in Guide
Session Storage	`github.com/plexusone/omnivoice-core/storage`	Storage Guide