OmniVoice Realtime¶
The omnivoice/realtime package provides native voice-to-voice capabilities via the OpenAI Realtime API.
Overview¶
OpenAI Realtime API enables:
- Ultra-low latency - ~100ms voice-to-voice response time
- Native audio - No separate STT/TTS, model handles audio directly
- Function calling - Execute tools during conversation
- Turn detection - Automatic voice activity detection (VAD)
Quick Start¶
import (
"context"
"os"
"github.com/plexusone/omni-openai/omnivoice/realtime"
)
func main() {
ctx := context.Background()
// Create provider
provider := realtime.NewProvider(os.Getenv("OPENAI_API_KEY"),
realtime.WithVoice("alloy"),
realtime.WithInstructions("You are a helpful assistant."),
)
// Create audio input channel
audioIn := make(chan []byte, 100)
// Start streaming
audioCh, transcriptCh, err := provider.ProcessAudioStream(ctx, audioIn, realtime.ProcessConfig{})
if err != nil {
log.Fatal(err)
}
// Send audio from microphone (PCM16 24kHz mono)
go streamMicrophoneAudio(audioIn)
// Process responses
for {
select {
case audio, ok := <-audioCh:
if !ok {
return
}
if len(audio.Audio) > 0 {
playAudio(audio.Audio) // PCM16 24kHz mono
}
if audio.IsFinal {
log.Println("Turn complete")
}
case transcript, ok := <-transcriptCh:
if !ok {
return
}
if transcript.IsInput {
log.Printf("User: %s", transcript.Text)
} else {
log.Printf("Assistant: %s", transcript.Text)
}
}
}
}
Registry Integration¶
Added in v0.4.0
Use the omnivoice-core registry for automatic provider discovery:
import (
omnivoice "github.com/plexusone/omnivoice-core"
"github.com/plexusone/omnivoice-core/registry"
_ "github.com/plexusone/omni-openai/omnivoice/realtime" // Auto-register
)
// Get realtime provider via registry
provider, err := omnivoice.GetRealtimeProvider("openai",
registry.WithAPIKey(os.Getenv("OPENAI_API_KEY")),
registry.WithModel("gpt-4o-realtime-preview-2024-12-17"),
registry.WithVoice("alloy"),
registry.WithInstructions("You are a helpful assistant."),
)
if err != nil {
log.Fatal(err)
}
// Process audio streams
audioCh, transcriptCh, err := provider.ProcessAudioStream(ctx, audioIn, nil)
Type-Safe Registry Options¶
Provider-specific options for OpenAI Realtime configuration:
import "github.com/plexusone/omni-openai/omnivoice/realtime"
provider, err := omnivoice.GetRealtimeProvider("openai",
registry.WithAPIKey(os.Getenv("OPENAI_API_KEY")),
// Type-safe OpenAI-specific options
realtime.WithRegistryTools(tools),
realtime.WithRegistryTurnDetection(turnDetectionConfig),
realtime.WithRegistryInputAudioFormat("pcm16"),
realtime.WithRegistryOutputAudioFormat("pcm16"),
realtime.WithRegistryModalities([]string{"text", "audio"}),
realtime.WithRegistryTemperature(0.8),
realtime.WithRegistryMaxResponseOutputTokens(4096),
)
Accessing Underlying Provider¶
Access the underlying OpenAI provider for full API access:
wrapper := provider.(*realtime.RealtimeWrapper)
openaiProvider := wrapper.Provider()
// Use OpenAI-specific methods
Configuration¶
Provider Options¶
provider := realtime.NewProvider(apiKey,
// Voice selection
realtime.WithVoice("nova"),
// System instructions
realtime.WithInstructions("You are a customer service agent."),
// Model selection (default: gpt-4o-realtime-preview)
realtime.WithModel("gpt-4o-realtime-preview-2024-12-17"),
// Response modalities
realtime.WithModalities("text", "audio"),
// Turn detection mode
realtime.WithTurnDetection(realtime.TurnDetectionServerVAD),
// Function tools
realtime.WithTools(
realtime.Tool{
Type: "function",
Name: "get_weather",
Description: "Get current weather",
Parameters: map[string]any{
"type": "object",
"properties": map[string]any{
"location": map[string]any{"type": "string"},
},
},
},
),
)
Available Voices¶
| Voice | Description |
|---|---|
| alloy | Neutral, balanced |
| ash | Warm, engaging |
| ballad | Melodic, expressive |
| coral | Clear, articulate |
| echo | Smooth, natural |
| fable | Storytelling, dramatic |
| nova | Bright, energetic |
| onyx | Deep, resonant |
| sage | Calm, wise |
| shimmer | Light, cheerful |
| verse | Poetic, rhythmic |
Audio Format¶
Input Audio¶
- Format: PCM16 (signed 16-bit little-endian)
- Sample Rate: 24kHz
- Channels: Mono
- Chunk Size: Recommended 20-100ms chunks
Output Audio¶
- Format: PCM16 (signed 16-bit little-endian)
- Sample Rate: 24kHz
- Channels: Mono
Function Calling¶
Handle function calls during the conversation:
config := realtime.ProcessConfig{
Tools: []realtime.Tool{
{
Type: "function",
Name: "lookup_order",
Description: "Look up an order by ID",
Parameters: map[string]any{
"type": "object",
"properties": map[string]any{
"order_id": map[string]any{"type": "string"},
},
"required": []string{"order_id"},
},
},
},
OnFunctionCall: func(name, arguments string) (string, error) {
switch name {
case "lookup_order":
var args struct {
OrderID string `json:"order_id"`
}
json.Unmarshal([]byte(arguments), &args)
order := lookupOrder(args.OrderID)
return json.Marshal(order)
}
return "", fmt.Errorf("unknown function: %s", name)
},
}
audioCh, transcriptCh, err := provider.ProcessAudioStream(ctx, audioIn, config)
Events¶
The WebSocket client sends and receives these event types:
Client Events¶
| Event | Description |
|---|---|
session.update |
Update session configuration |
input_audio_buffer.append |
Send audio chunk |
input_audio_buffer.commit |
Commit buffered audio |
input_audio_buffer.clear |
Clear audio buffer |
conversation.item.create |
Add conversation item |
response.create |
Request a response |
response.cancel |
Cancel in-progress response |
Server Events¶
| Event | Description |
|---|---|
session.created |
Session established |
session.updated |
Session config updated |
response.audio.delta |
Audio chunk |
response.audio.done |
Audio complete |
response.audio_transcript.delta |
Transcript chunk |
response.audio_transcript.done |
Transcript complete |
response.function_call_arguments.done |
Function call ready |
error |
Error occurred |
Integration with Call Systems¶
Twilio Media Streams¶
// Twilio sends mulaw 8kHz, needs conversion
twilioAudio := make(chan []byte) // mulaw 8kHz
// Convert to PCM16 24kHz for Realtime API
audioIn := make(chan []byte)
go func() {
for chunk := range twilioAudio {
pcm := convertMulawToPCM16(chunk)
upsampled := resample8kTo24k(pcm)
audioIn <- upsampled
}
}()
// Realtime API
audioCh, _, _ := provider.ProcessAudioStream(ctx, audioIn, config)
// Convert back for Twilio
for audio := range audioCh {
downsampled := resample24kTo8k(audio.Audio)
mulaw := convertPCM16ToMulaw(downsampled)
sendToTwilio(mulaw)
}
LiveKit¶
// LiveKit provides PCM16 48kHz stereo
livekitAudio := make(chan []byte)
// Resample and convert to mono 24kHz
audioIn := make(chan []byte)
go func() {
for chunk := range livekitAudio {
mono := stereoToMono(chunk)
resampled := resample48kTo24k(mono)
audioIn <- resampled
}
}()
// Output back to LiveKit
audioCh, _, _ := provider.ProcessAudioStream(ctx, audioIn, config)
for audio := range audioCh {
stereo := monoToStereo(audio.Audio)
upsampled := resample24kTo48k(stereo)
sendToLiveKit(upsampled)
}
Error Handling¶
audioCh, transcriptCh, err := provider.ProcessAudioStream(ctx, audioIn, config)
if err != nil {
// Connection failed
log.Fatal(err)
}
// Monitor for errors during streaming
for {
select {
case <-ctx.Done():
return ctx.Err()
case audio, ok := <-audioCh:
if !ok {
// Channel closed, check for error
return nil
}
// Process audio
}
}
vs Traditional Pipeline (STT+LLM+TTS)¶
OpenAI Realtime provides native voice-to-voice, eliminating the need for separate STT and TTS providers.
| Aspect | Traditional | OpenAI Realtime |
|---|---|---|
| Latency | 500-1500ms | ~100ms |
| API Calls | 3 (STT + LLM + TTS) | 1 WebSocket |
| Barge-in | Complex coordination | Native support |
| Voice options | 1000s (ElevenLabs, etc.) | 11 preset voices |
| Custom voices | Yes (cloning) | No |
| Domain STT | Yes (medical, legal) | No |
When to Use OpenAI Realtime¶
- Low latency is critical
- Natural conversation with interruptions
- Simpler architecture preferred
- Preset voices are acceptable
When to Use Traditional Pipeline¶
- Custom/cloned voices required
- Domain-specific STT accuracy needed
- Specific language support
- Cost optimization with caching
See Voice Architecture Guide for detailed comparison.
Best Practices¶
- Buffer audio input - Use buffered channel (100+ capacity)
- Handle disconnects - Implement reconnection logic
- Monitor latency - Log round-trip times for debugging
- Test with real audio - Synthetic tests miss edge cases
- Use appropriate voice - Match voice to use case