Local Voice Providers - Technical Requirements Document¶
This document describes the technical architecture for local TTS providers in OmniVoice.
Architecture Overview¶
┌─────────────────────────────────────────────────────────────────────────┐
│ Go Application │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ omnivoice-core │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │
│ │ │providers/ │ │providers/ │ │ omni-elevenlabs │ │ │
│ │ │ f5tts │ │ qwen (TODO) │ │ (thick SDK client) │ │ │
│ │ │ (gRPC client)│ │ (gRPC client)│ │ │ │ │
│ │ └──────┬───────┘ └──────┬───────┘ └──────────────────────┘ │ │
│ │ │ │ │ │
│ └─────────┼─────────────────┼──────────────────────────────────────┘ │
│ │ │ │
└────────────┼─────────────────┼──────────────────────────────────────────┘
│ │
│ gRPC/UDS │ gRPC/UDS
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ F5-TTS MLX Server │ │ Qwen3-TTS Server │
│ (Python + MLX) │ │ (Python + MLX) │
│ │ │ │
│ unix:///tmp/ │ │ unix:///tmp/ │
│ omnivoice-f5tts │ │ omnivoice-qwen3tts │
└─────────────────────┘ └─────────────────────┘
Communication Protocol: gRPC over Unix Domain Socket¶
Why gRPC over UDS¶
| Aspect | HTTP/JSON | gRPC/TCP | gRPC/UDS |
|---|---|---|---|
| Audio payload | Base64 (33% overhead) | Native bytes | Native bytes |
| Streaming | Chunked/SSE | Native | Native |
| Latency | ~5-10ms | ~2-5ms | ~1-2ms |
| Connection | Per-request | Persistent | Persistent |
| Contract | Informal | Protobuf | Protobuf |
Decision: gRPC over UDS provides lowest latency for local IPC while maintaining streaming support and strong typing.
Socket Paths¶
| Provider | Socket Path |
|---|---|
| F5-TTS MLX | unix:///tmp/omnivoice-f5tts.sock |
| Qwen3-TTS | unix:///tmp/omnivoice-qwen3tts.sock |
| Piper | unix:///tmp/omnivoice-piper.sock |
Protobuf Definition¶
Location: omnivoice-core/proto/localvoice/v1/localvoice.proto
syntax = "proto3";
package omnivoice.localvoice.v1;
option go_package = "github.com/plexusone/omnivoice-core/proto/localvoice/v1;localvoicev1";
// LocalVoice service for local TTS inference
service LocalVoice {
// Synthesize speech from text, streaming audio chunks
rpc Synthesize(SynthesizeRequest) returns (stream AudioChunk);
// Synthesize with a reference audio for voice cloning
rpc SynthesizeWithReference(ReferenceSynthesizeRequest) returns (stream AudioChunk);
// Prepare/cache a voice profile embedding for faster synthesis
rpc PrepareVoiceProfile(PrepareVoiceProfileRequest) returns (PrepareVoiceProfileResponse);
// Health check and model status
rpc Health(HealthRequest) returns (HealthResponse);
// Load model into memory
rpc LoadModel(LoadModelRequest) returns (LoadModelResponse);
// Unload model from memory
rpc UnloadModel(UnloadModelRequest) returns (UnloadModelResponse);
// Get runtime information (memory, device, etc.)
rpc RuntimeInfo(RuntimeInfoRequest) returns (RuntimeInfoResponse);
}
message SynthesizeRequest {
string text = 1;
string voice_id = 2; // Voice profile ID or "default"
AudioFormat format = 3;
optional float speed = 4; // Speech rate multiplier (default 1.0)
}
message ReferenceSynthesizeRequest {
string text = 1;
bytes reference_audio = 2; // WAV/PCM audio bytes
string reference_text = 3; // Transcript of reference audio
AudioFormat format = 4;
optional float speed = 5;
}
message AudioChunk {
bytes data = 1; // Raw audio bytes
bool is_final = 2; // True for last chunk
optional AudioMetadata metadata = 3; // Only on first chunk
}
message AudioMetadata {
AudioFormat format = 1;
int32 sample_rate = 2;
int32 channels = 3;
int32 bit_depth = 4;
}
enum AudioFormat {
AUDIO_FORMAT_UNSPECIFIED = 0;
AUDIO_FORMAT_WAV = 1;
AUDIO_FORMAT_PCM_S16LE = 2; // Raw PCM, signed 16-bit little-endian
AUDIO_FORMAT_PCM_F32LE = 3; // Raw PCM, float32 little-endian
AUDIO_FORMAT_MP3 = 4;
AUDIO_FORMAT_MULAW_8K = 5; // G.711 mu-law, 8kHz (telephony)
}
message PrepareVoiceProfileRequest {
string profile_id = 1;
bytes reference_audio = 2;
string reference_text = 3;
}
message PrepareVoiceProfileResponse {
string profile_id = 1;
bool cached = 2; // True if embedding was cached
int64 embedding_size_bytes = 3;
}
message HealthRequest {}
message HealthResponse {
bool healthy = 1;
bool model_loaded = 2;
string model_name = 3;
string model_version = 4;
}
message LoadModelRequest {
optional string model_path = 1; // Override default model path
}
message LoadModelResponse {
bool success = 1;
int64 load_time_ms = 2;
int64 memory_used_mb = 3;
}
message UnloadModelRequest {}
message UnloadModelResponse {
bool success = 1;
int64 memory_freed_mb = 2;
}
message RuntimeInfoRequest {}
message RuntimeInfoResponse {
string device_type = 1; // "mlx", "mps", "cpu"
int64 memory_used_mb = 2;
int64 memory_available_mb = 3;
string mlx_version = 4;
string python_version = 5;
}
Go Interface Extensions¶
Add capability interfaces to omnivoice-core for local provider features:
// omnivoice-core/tts/local.go
// StreamingSynthesizer supports streaming audio output
type StreamingSynthesizer interface {
SynthesizeStream(ctx context.Context, req Request) (<-chan AudioChunk, error)
}
// VoiceCloner supports voice cloning from reference audio
type VoiceCloner interface {
CloneVoice(ctx context.Context, req CloneVoiceRequest) (*VoiceProfile, error)
}
// ReferenceSynthesizer supports synthesis with reference audio
type ReferenceSynthesizer interface {
SynthesizeWithReference(ctx context.Context, req ReferenceSynthesizeRequest) (*Response, error)
}
// ProfileCacher supports pre-computing voice embeddings
type ProfileCacher interface {
PrepareVoiceProfile(ctx context.Context, req PrepareVoiceProfileRequest) (*PreparedProfile, error)
}
// ModelManager supports loading/unloading models
type ModelManager interface {
LoadModel(ctx context.Context) error
UnloadModel(ctx context.Context) error
IsModelLoaded() bool
}
// RuntimeChecker provides runtime information
type RuntimeChecker interface {
RuntimeInfo(ctx context.Context) (*RuntimeInfo, error)
}
// AudioChunk represents a chunk of streaming audio
type AudioChunk struct {
Data []byte
IsFinal bool
Metadata *AudioMetadata
}
// RuntimeInfo contains local runtime details
type RuntimeInfo struct {
DeviceType string // "mlx", "mps", "cpu"
MemoryUsedMB int64
MemoryAvailableMB int64
MLXVersion string
PythonVersion string
}
Provider Implementation: f5tts¶
Local providers are included in omnivoice-core/providers/ (thin clients without thick SDK dependencies). Thick providers using official SDKs are in separate omni-{provider} modules.
Provider Structure¶
omnivoice-core/providers/f5tts/
├── f5tts.go # TTSProvider implementation (gRPC client)
├── f5tts_test.go
├── README.md
└── server/
├── requirements.txt
├── f5tts_server.py # Python gRPC server
├── generate_proto.sh
└── run.sh # Server startup script
Go Client Implementation¶
// omnivoice-core/providers/f5tts/f5tts.go
package f5tts
import (
"context"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
omnivoice "github.com/plexusone/omnivoice-core"
"github.com/plexusone/omnivoice-core/tts"
pb "github.com/plexusone/omnivoice-core/proto/localvoice/v1"
)
const (
DefaultSocketPath = "unix:///tmp/omnivoice-f5tts.sock"
ProviderName = "f5tts"
)
func init() {
omnivoice.RegisterTTSProvider(ProviderName, NewProvider, omnivoice.PriorityThick)
}
type Provider struct {
conn *grpc.ClientConn
client pb.LocalVoiceClient
}
func NewProvider(cfg *omnivoice.ProviderConfig) (tts.Provider, error) {
endpoint := cfg.Endpoint
if endpoint == "" {
endpoint = DefaultSocketPath
}
conn, err := grpc.NewClient(endpoint,
grpc.WithTransportCredentials(insecure.NewCredentials()),
)
if err != nil {
return nil, err
}
return &Provider{
conn: conn,
client: pb.NewLocalVoiceClient(conn),
}, nil
}
func (p *Provider) Name() string {
return ProviderName
}
func (p *Provider) Synthesize(ctx context.Context, req tts.Request) (*tts.Response, error) {
stream, err := p.client.Synthesize(ctx, &pb.SynthesizeRequest{
Text: req.Text,
VoiceId: req.Voice,
Format: toProtoFormat(req.Format),
})
if err != nil {
return nil, err
}
// Collect all chunks
var audio []byte
for {
chunk, err := stream.Recv()
if err != nil {
break
}
audio = append(audio, chunk.Data...)
if chunk.IsFinal {
break
}
}
return &tts.Response{
Audio: audio,
Format: req.Format,
SampleRate: 24000, // F5-TTS default
}, nil
}
// SynthesizeStream implements StreamingSynthesizer
func (p *Provider) SynthesizeStream(ctx context.Context, req tts.Request) (<-chan tts.AudioChunk, error) {
stream, err := p.client.Synthesize(ctx, &pb.SynthesizeRequest{
Text: req.Text,
VoiceId: req.Voice,
Format: toProtoFormat(req.Format),
})
if err != nil {
return nil, err
}
ch := make(chan tts.AudioChunk)
go func() {
defer close(ch)
for {
chunk, err := stream.Recv()
if err != nil {
return
}
ch <- tts.AudioChunk{
Data: chunk.Data,
IsFinal: chunk.IsFinal,
}
if chunk.IsFinal {
return
}
}
}()
return ch, nil
}
func (p *Provider) Close() error {
return p.conn.Close()
}
Python gRPC Server¶
# omni-f5tts/server/f5tts_server.py
import grpc
from concurrent import futures
import logging
import mlx
from f5_tts_mlx import F5TTS
from omnivoice.localvoice.v1 import localvoice_pb2 as pb
from omnivoice.localvoice.v1 import localvoice_pb2_grpc as pb_grpc
class LocalVoiceServicer(pb_grpc.LocalVoiceServicer):
def __init__(self):
self.model = None
self.model_loaded = False
def LoadModel(self, request, context):
try:
self.model = F5TTS()
self.model_loaded = True
return pb.LoadModelResponse(
success=True,
load_time_ms=0, # TODO: measure
memory_used_mb=0, # TODO: measure
)
except Exception as e:
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(str(e))
return pb.LoadModelResponse(success=False)
def Synthesize(self, request, context):
if not self.model_loaded:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details("Model not loaded")
return
# Generate audio
audio = self.model.generate(
text=request.text,
# voice_id handling...
)
# Stream chunks
chunk_size = 4096
for i in range(0, len(audio), chunk_size):
chunk = audio[i:i + chunk_size]
is_final = i + chunk_size >= len(audio)
yield pb.AudioChunk(
data=chunk,
is_final=is_final,
)
def Health(self, request, context):
return pb.HealthResponse(
healthy=True,
model_loaded=self.model_loaded,
model_name="f5-tts-mlx",
model_version="1.0.0",
)
def RuntimeInfo(self, request, context):
return pb.RuntimeInfoResponse(
device_type="mlx",
memory_used_mb=0, # TODO
memory_available_mb=0, # TODO
mlx_version=mlx.__version__,
)
def serve(socket_path: str = "/tmp/omnivoice-f5tts.sock"):
server = grpc.server(futures.ThreadPoolExecutor(max_workers=4))
pb_grpc.add_LocalVoiceServicer_to_server(LocalVoiceServicer(), server)
server.add_insecure_port(f"unix://{socket_path}")
server.start()
logging.info(f"F5-TTS gRPC server listening on {socket_path}")
server.wait_for_termination()
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
serve()
Voice Profile Storage¶
Directory Structure¶
~/.omnivoice/
├── config.yaml # Global config
└── voice-profiles/
└── {profile-id}/
├── manifest.json # Profile metadata
├── reference.wav # Canonical reference audio
├── reference.txt # Reference transcript
└── embeddings/
├── f5tts-mlx/
│ └── embedding.bin # Cached F5-TTS embedding
└── qwen3-tts/
└── embedding.bin # Cached Qwen3-TTS embedding
Manifest Schema¶
{
"id": "john",
"name": "John's Voice",
"language": "en",
"created_at": "2024-01-15T10:30:00Z",
"reference": {
"audio_file": "reference.wav",
"transcript_file": "reference.txt",
"duration_seconds": 8.5,
"sample_rate": 24000
},
"embeddings": {
"f5tts-mlx": {
"file": "embeddings/f5tts-mlx/embedding.bin",
"created_at": "2024-01-15T10:31:00Z",
"model_version": "1.0.0"
}
}
}
Error Handling¶
gRPC Status Codes¶
| Scenario | gRPC Code | Description |
|---|---|---|
| Model not loaded | FAILED_PRECONDITION |
Call LoadModel first |
| Invalid voice profile | NOT_FOUND |
Profile doesn't exist |
| Out of memory | RESOURCE_EXHAUSTED |
Not enough GPU memory |
| Synthesis failed | INTERNAL |
Model inference error |
| Server unavailable | UNAVAILABLE |
Server not running |
Go Error Types¶
// omnivoice-core/tts/errors.go
var (
ErrModelNotLoaded = errors.New("model not loaded")
ErrProfileNotFound = errors.New("voice profile not found")
ErrResourceExhausted = errors.New("insufficient memory")
ErrServerUnavailable = errors.New("local server unavailable")
)
Testing Strategy¶
Unit Tests (Go)¶
- Mock gRPC client for provider tests
- Test streaming audio collection
- Test error handling and retries
Integration Tests¶
- Start Python server in subprocess
- Run synthesis through full stack
- Verify audio output format
Benchmark Tests¶
- Measure time to first byte
- Measure total synthesis latency
- Measure gRPC overhead vs HTTP baseline
Security Considerations¶
- UDS permissions - Socket files should be user-readable only (0600)
- No network exposure - UDS is local-only by design
- Model integrity - Verify model checksums on load
- Profile isolation - Voice profiles stored in user home directory