Local TTS Providers¶
OmniVoice supports local text-to-speech providers that run on your own hardware, providing privacy, zero API costs, and offline capability.
Overview¶
Local providers communicate via gRPC over Unix Domain Socket for low-latency, secure local IPC:
┌─────────────────────┐ gRPC/UDS ┌─────────────────────┐
│ Go Client │◄─────────────────►│ Python Server │
│ (omnivoice-core) │ │ (MLX / PyTorch) │
│ │ │ │
│ - TTS interface │ │ - F5-TTS MLX │
│ - Voice cloning │ │ - Model inference │
│ - Profile caching │ │ - Audio generation │
└─────────────────────┘ └─────────────────────┘
Available Local Providers¶
| Provider | Model | Hardware | Voice Cloning | Status |
|---|---|---|---|---|
| F5-TTS MLX | F5-TTS | Apple Silicon | Yes | Available |
| Whisper MLX | Whisper | Apple Silicon | N/A (STT) | Available |
| Qwen3-TTS | Qwen3-TTS | Apple Silicon / CUDA | Yes | Planned |
| Piper | Piper | CPU | No | Planned |
| Apple TTS | AVSpeechSynthesizer | macOS | No | Planned |
F5-TTS Provider¶
F5-TTS is a high-quality voice cloning model that supports zero-shot synthesis from reference audio.
Requirements¶
- Apple Silicon Mac (M1/M2/M3/M4)
- Python 3.11+ (ARM64)
- ~2GB disk space for model weights
Installation¶
# Navigate to the server directory
cd omnivoice-core/providers/f5tts-mlx/server
# Create ARM64 virtual environment
arch -arm64 python3 -m venv .venv
# Install dependencies
arch -arm64 .venv/bin/pip install -r requirements.txt
# Generate Python proto stubs
./generate_proto.sh
Starting the Server¶
# Start without auto-loading model (faster startup)
arch -arm64 .venv/bin/python3 f5tts_server.py
# Start with model pre-loaded (ready for immediate synthesis)
arch -arm64 .venv/bin/python3 f5tts_server.py --auto-load
The server listens on unix:///tmp/omnivoice-f5tts.sock by default.
Go Client Usage¶
import (
"github.com/plexusone/omnivoice"
_ "github.com/plexusone/omnivoice-core/providers/f5tts-mlx" // Auto-register
)
// Create provider using the registry
provider, err := omnivoice.GetTTSProvider("f5tts-mlx",
omnivoice.WithEndpoint("unix:///tmp/omnivoice-f5tts.sock"),
)
if err != nil {
log.Fatal(err)
}
// Load the model (downloads ~2GB on first run)
if loader, ok := provider.(tts.ModelManager); ok {
result, err := loader.LoadModel(ctx)
if err != nil {
log.Fatal(err)
}
log.Printf("Model loaded in %dms", result.LoadTimeMs)
}
// Synthesize speech
result, err := provider.Synthesize(ctx, "Hello from local TTS!", tts.SynthesisConfig{
OutputFormat: "wav",
})
Voice Cloning¶
F5-TTS supports zero-shot voice cloning from a reference audio sample:
// Option 1: Inline reference synthesis
if synth, ok := provider.(tts.ReferenceSynthesizer); ok {
result, err := synth.SynthesizeWithReference(ctx, tts.ReferenceSynthesizeRequest{
Text: "Hello in your cloned voice!",
ReferenceAudio: referenceWAV, // []byte of reference audio
ReferenceText: "This is what I said in the reference.", // Transcript
Config: tts.SynthesisConfig{
OutputFormat: "wav",
},
})
}
// Option 2: Pre-cache voice profile for faster repeated synthesis
if cacher, ok := provider.(tts.ProfileCacher); ok {
// Prepare profile once
profile, err := cacher.PrepareVoiceProfile(ctx, tts.PrepareVoiceProfileRequest{
ProfileID: "my-voice",
ReferenceAudio: referenceWAV,
ReferenceText: "This is what I said in the reference.",
})
// Use cached profile for synthesis
result, err := provider.Synthesize(ctx, "Hello!", tts.SynthesisConfig{
VoiceID: "my-voice", // Use cached profile
OutputFormat: "wav",
})
}
Capability Interfaces¶
Local providers implement additional capability interfaces beyond the base tts.Provider:
VoiceCloner¶
type VoiceCloner interface {
CloneVoice(ctx context.Context, req CloneVoiceRequest) (*VoiceProfile, error)
}
Creates a reusable voice profile from reference audio.
ReferenceSynthesizer¶
type ReferenceSynthesizer interface {
SynthesizeWithReference(ctx context.Context, req ReferenceSynthesizeRequest) (*SynthesisResult, error)
}
One-shot synthesis using inline reference audio (no pre-caching).
ProfileCacher¶
type ProfileCacher interface {
PrepareVoiceProfile(ctx context.Context, req PrepareVoiceProfileRequest) (*PreparedProfile, error)
ListPreparedProfiles(ctx context.Context) ([]*PreparedProfile, error)
DeletePreparedProfile(ctx context.Context, profileID string) error
}
Pre-compute and cache voice embeddings for faster repeated synthesis.
ModelManager¶
type ModelManager interface {
LoadModel(ctx context.Context) (*LoadModelResult, error)
UnloadModel(ctx context.Context) (*UnloadModelResult, error)
IsModelLoaded() bool
}
Control model lifecycle for memory management.
RuntimeChecker¶
Query runtime environment (device type, memory usage, framework version).
HealthChecker¶
Check provider health and model status.
Configuration¶
Environment Variables¶
| Variable | Description | Default |
|---|---|---|
F5TTS_ENDPOINT |
gRPC endpoint | unix:///tmp/omnivoice-f5tts.sock |
F5TTS_TEST_REFERENCE_AUDIO |
Path to reference audio for tests | - |
F5TTS_TEST_REFERENCE_TEXT |
Transcript for test reference | - |
Registry Options¶
// Custom endpoint
omnivoice.WithEndpoint("unix:///custom/path.sock")
// TCP endpoint (for remote servers)
omnivoice.WithEndpoint("localhost:50051")
Performance¶
Latency (M1 Max, 32GB)¶
| Operation | First Run | Cached |
|---|---|---|
| Model Load | ~30s | ~5s |
| Synthesis (short) | ~2s | ~500ms |
| Synthesis (long) | ~5s | ~2s |
| Voice Cloning | ~3s | ~1s |
Memory Usage¶
| State | Memory |
|---|---|
| Server idle | ~200MB |
| Model loaded | ~2.2GB |
| During synthesis | ~3GB peak |
Troubleshooting¶
Server won't start¶
-
Check Python architecture:
-
Verify MLX is installed:
Model download fails¶
The model downloads from Hugging Face on first use. If it fails:
- Check network connectivity
- Try manual download:
gRPC connection refused¶
- Check socket exists:
ls -la /tmp/omnivoice-f5tts.sock - Restart server:
pkill -f f5tts_server.py && ./run.sh
Proto Definition¶
The local voice service is defined in proto/localvoice/v1/localvoice.proto:
service LocalVoice {
rpc Synthesize(SynthesizeRequest) returns (stream AudioChunk);
rpc SynthesizeWithReference(ReferenceSynthesizeRequest) returns (stream AudioChunk);
rpc PrepareVoiceProfile(PrepareVoiceProfileRequest) returns (PrepareVoiceProfileResponse);
rpc Health(HealthRequest) returns (HealthResponse);
rpc LoadModel(LoadModelRequest) returns (LoadModelResponse);
rpc UnloadModel(UnloadModelRequest) returns (UnloadModelResponse);
rpc RuntimeInfo(RuntimeInfoRequest) returns (RuntimeInfoResponse);
}
See Also¶
- Voice Cloning Guide - General voice cloning concepts
- Provider Registry - How provider registration works
- Local Provider TRD - Technical design document