OmniVoice - Gemini Live API¶

The omnivoice package provides real-time voice-to-voice capabilities via the Gemini Live API.

Overview¶

Gemini Live API enables:

Real-time voice-to-voice - ~200ms response latency
Native audio processing - Model handles audio directly
Function calling - Execute tools during conversation
Multimodal input - Audio + text + video simultaneously
Google Search grounding - Ground responses in search results
Code execution - Run Python code during conversation

Quick Start¶

import (
    "context"
    "os"

    "github.com/plexusone/omni-google/omnivoice/realtime"
)

func main() {
    ctx := context.Background()

    // Create client
    client, err := realtime.NewLiveClient(os.Getenv("GOOGLE_API_KEY"),
        realtime.WithVoice("Puck"),
        realtime.WithInstructions("You are a helpful assistant."),
    )
    if err != nil {
        log.Fatal(err)
    }

    // Connect to session
    session, err := client.Connect(ctx)
    if err != nil {
        log.Fatal(err)
    }
    defer session.Close()

    // Send audio (PCM16 16kHz mono)
    go func() {
        for chunk := range microphoneAudio {
            session.SendAudio(ctx, chunk)
        }
    }()

    // Receive events
    for event := range session.Events() {
        switch e := event.(type) {
        case *realtime.ServerContent:
            for _, part := range e.ModelTurn.Parts {
                if part.InlineData != nil {
                    // Decode and play audio (PCM16 24kHz)
                    audio, _ := base64.StdEncoding.DecodeString(part.InlineData.Data)
                    playAudio(audio)
                }
                if part.Text != "" {
                    log.Printf("Assistant: %s", part.Text)
                }
            }
        case *realtime.ToolCall:
            // Handle function calls
            handleToolCall(session, e)
        }
    }
}

Using RealtimeProvider¶

For a higher-level interface compatible with OmniVoice patterns:

import "github.com/plexusone/omni-google/omnivoice/realtime"

// Create provider
provider := realtime.NewRealtimeProvider(os.Getenv("GOOGLE_API_KEY"),
    realtime.WithVoice("Puck"),
    realtime.WithInstructions("You are a helpful assistant."),
)

// Stream audio
audioIn := make(chan []byte, 100)
audioCh, transcriptCh, err := provider.ProcessAudioStream(ctx, audioIn, realtime.ProcessConfig{
    Functions: []realtime.FunctionDeclaration{
        {
            Name:        "get_weather",
            Description: "Get current weather",
            Parameters: map[string]any{
                "type": "object",
                "properties": map[string]any{
                    "location": map[string]any{"type": "string"},
                },
            },
        },
    },
    OnFunctionCall: func(id, name, args string) (any, error) {
        return map[string]any{"temp": 72, "condition": "sunny"}, nil
    },
})

Registry Integration¶

Added in v0.6.0

Use the omnivoice-core registry for automatic provider discovery:

import (
    omnivoice "github.com/plexusone/omnivoice-core"
    "github.com/plexusone/omnivoice-core/registry"
    _ "github.com/plexusone/omni-google/omnivoice/realtime" // Auto-register
)

// Get realtime provider via registry
provider, err := omnivoice.GetRealtimeProvider("gemini",
    registry.WithAPIKey(os.Getenv("GOOGLE_API_KEY")),
    registry.WithModel("gemini-2.0-flash-live"),
    registry.WithVoice("Puck"),
    registry.WithInstructions("You are a helpful assistant."),
)
if err != nil {
    log.Fatal(err)
}

// Process audio streams
audioCh, transcriptCh, err := provider.ProcessAudioStream(ctx, audioIn, nil)

Type-Safe Registry Options¶

Provider-specific options for Gemini Live configuration:

import "github.com/plexusone/omni-google/omnivoice/realtime"

provider, err := omnivoice.GetRealtimeProvider("gemini",
    registry.WithAPIKey(os.Getenv("GOOGLE_API_KEY")),
    // Type-safe Gemini-specific options
    realtime.WithRegistryTools(tools),
    realtime.WithRegistryFunctions(functions),
    realtime.WithRegistryResponseModalities([]string{"AUDIO", "TEXT"}),
    realtime.WithRegistryTemperature(0.7),
    realtime.WithRegistryTopP(0.9),
    realtime.WithRegistryTopK(40),
    realtime.WithRegistryMaxOutputTokens(1024),
    realtime.WithRegistryGoogleSearch(),    // Enable grounding
    realtime.WithRegistryCodeExecution(),   // Enable code execution
)

Accessing Underlying Provider¶

Access the underlying Gemini provider for full API access:

wrapper := provider.(*realtime.RealtimeWrapper)
geminiProvider := wrapper.Provider()

// Use Gemini-specific methods

Configuration¶

Client Options¶

client, err := realtime.NewLiveClient(apiKey,
    // Voice selection
    realtime.WithVoice("Puck"),

    // System instructions
    realtime.WithInstructions("You are a customer service agent."),

    // Model selection (default: gemini-2.0-flash-live)
    realtime.WithModel("gemini-2.0-flash-live"),

    // Response modalities
    realtime.WithResponseModalities("TEXT", "AUDIO"),

    // Temperature
    realtime.WithTemperature(0.7),

    // Enable Google Search
    realtime.WithGoogleSearch(),

    // Enable code execution
    realtime.WithCodeExecution(),

    // Custom functions
    realtime.WithFunctions(
        realtime.FunctionDeclaration{
            Name:        "lookup_order",
            Description: "Look up an order by ID",
            Parameters: map[string]any{
                "type": "object",
                "properties": map[string]any{
                    "order_id": map[string]any{"type": "string"},
                },
            },
        },
    ),
)

Available Voices¶

Voice	Description
Puck	Upbeat, lively
Charon	Informative, direct
Kore	Firm, authoritative
Fenrir	Enthusiastic, positive
Aoede	Bright, clear

Audio Format¶

Input Audio¶

Format: PCM16 (signed 16-bit little-endian)
Sample Rate: 16kHz
Channels: Mono
MIME Type: audio/pcm;rate=16000

Output Audio¶

Format: PCM16 (signed 16-bit little-endian)
Sample Rate: 24kHz
Channels: Mono
MIME Type: audio/pcm;rate=24000

Function Calling¶

Handle function calls during the conversation:

for event := range session.Events() {
    switch e := event.(type) {
    case *realtime.ToolCall:
        for _, fc := range e.FunctionCalls {
            // Process the function call
            var result any
            switch fc.Name {
            case "get_weather":
                var args struct {
                    Location string `json:"location"`
                }
                json.Unmarshal(fc.Args, &args)
                result = getWeather(args.Location)
            }

            // Send response back
            session.SendFunctionResponse(fc.ID, fc.Name, result)
        }
    }
}

Message Types¶

Client Messages¶

Message	Description
`setup`	Initialize session with config
`realtimeInput`	Send audio/video chunks
`clientContent`	Send text content
`toolResponse`	Send function call response

Server Messages¶

Message	Description
`setupComplete`	Session initialized
`serverContent`	Audio/text response
`toolCall`	Function call request
`interrupted`	Turn was interrupted

Interruptions¶

Handle user interruptions (barge-in):

// Send interrupt signal
session.Interrupt()

// Handle interruption events
for event := range session.Events() {
    switch e := event.(type) {
    case *realtime.ServerContent:
        if e.Interrupted {
            log.Println("Turn was interrupted")
        }
    }
}

Integration with Call Systems¶

Twilio Media Streams¶

// Twilio sends mulaw 8kHz
twilioAudio := make(chan []byte)

// Convert to PCM16 16kHz for Gemini
audioIn := make(chan []byte)
go func() {
    for chunk := range twilioAudio {
        pcm := convertMulawToPCM16(chunk)
        upsampled := resample8kTo16k(pcm)
        audioIn <- upsampled
    }
}()

// Gemini output is 24kHz
audioCh, _, _ := provider.ProcessAudioStream(ctx, audioIn, config)
for audio := range audioCh {
    // Convert to Twilio format
    downsampled := resample24kTo8k(audio.Audio)
    mulaw := convertPCM16ToMulaw(downsampled)
    sendToTwilio(mulaw)
}

Comparison with OpenAI Realtime¶

Feature	Gemini Live	OpenAI Realtime
Latency	~200ms	~100ms
Input audio	16kHz	24kHz
Output audio	24kHz	24kHz
Google Search	Yes	No
Code execution	Yes	No
Video input	Yes	No
Voices	5	11

Environment Variables¶

# Option 1: Google API key
export GOOGLE_API_KEY="your-api-key"

# Option 2: Gemini API key
export GEMINI_API_KEY="your-api-key"

vs Traditional Pipeline (STT+LLM+TTS)¶

Gemini Live provides native voice-to-voice, eliminating the need for separate STT and TTS providers.

Aspect	Traditional	Gemini Live
Latency	500-1500ms	~200ms
API Calls	3 (STT + LLM + TTS)	1 WebSocket
Barge-in	Complex coordination	Native support
Voice options	1000s (ElevenLabs, etc.)	5 preset voices
Google Search	No	Yes (grounding)
Code execution	No	Yes
Video input	No	Yes

When to Use Gemini Live¶

Low latency is critical
Need Google Search grounding for factual responses
Need code execution during conversation
Need video/multimodal input
Simpler architecture preferred

When to Use Traditional Pipeline¶

Custom/cloned voices required
Domain-specific STT accuracy needed
Specific language support
Lowest possible latency (~100ms with OpenAI Realtime)

See Voice Architecture Guide for detailed comparison.

Best Practices¶

Buffer audio input - Use buffered channel (100+ capacity)
Handle disconnects - Implement reconnection logic
Use appropriate sample rate - 16kHz input, 24kHz output
Enable Google Search - For factual queries
Test with real audio - Synthetic tests miss edge cases