Transcript Format¶
OmniVoice provides a canonical JSON transcript format for STT (speech-to-text) output. This standardized format enables consistent handling of transcription results across different providers and applications.
Overview¶
The Transcript format captures:
- Full transcription text
- Segment-level timing (sentences, phrases, utterances)
- Word-level timing (when enabled)
- Speaker diarization (when enabled)
- Confidence scores
- Provider metadata and options used
All duration fields use DurationMilliseconds which serializes as integer milliseconds in JSON while providing full time.Duration functionality in Go.
Quick Start¶
import "github.com/plexusone/omnivoice-core/stt"
// Convert a transcription result to canonical format
transcript := stt.NewTranscript(result, "deepgram", "nova-2", "audio.mp3", config)
// Save to JSON file
err := transcript.SaveJSON("output.transcript.json")
// Load from JSON file
loaded, err := stt.LoadTranscript("output.transcript.json")
// Access timing information
fmt.Printf("Total duration: %v\n", transcript.TotalDuration())
for _, seg := range transcript.Segments {
fmt.Printf("[%v - %v] %s\n",
seg.Start.Duration(),
seg.End.Duration(),
seg.Text)
}
JSON Format¶
{
"$schema": "https://omnivoice.dev/schema/transcript-v1.json",
"version": "1.0",
"text": "Hello world. How are you today?",
"language": "en-US",
"language_confidence": 0.95,
"duration_ms": 5000,
"segments": [
{
"text": "Hello world.",
"start_ms": 0,
"end_ms": 1500,
"speaker": "speaker_1",
"confidence": 0.98,
"words": [
{
"text": "Hello",
"start_ms": 0,
"end_ms": 600,
"confidence": 0.99
},
{
"text": "world.",
"start_ms": 700,
"end_ms": 1500,
"confidence": 0.97
}
]
},
{
"text": "How are you today?",
"start_ms": 2000,
"end_ms": 4500,
"speaker": "speaker_2",
"confidence": 0.96
}
],
"metadata": {
"provider": "deepgram",
"model": "nova-2",
"created_at": "2026-05-02T12:00:00Z",
"audio_file": "conversation.mp3",
"options": {
"language": "en-US",
"enable_punctuation": true,
"enable_word_timestamps": true,
"enable_speaker_diarization": true
}
}
}
Types¶
Transcript¶
The root type containing the full transcription.
| Field | Type | JSON Key | Description |
|---|---|---|---|
Schema |
string |
$schema |
JSON Schema URL for validation |
Version |
string |
version |
Format version (currently "1.0") |
Text |
string |
text |
Complete transcription text |
Language |
string |
language |
BCP-47 language code (e.g., "en-US") |
LanguageConfidence |
float64 |
language_confidence |
Language detection confidence (0.0-1.0) |
Duration |
DurationMilliseconds |
duration_ms |
Total audio duration |
Segments |
[]TranscriptSegment |
segments |
Transcript segments |
Metadata |
TranscriptMetadata |
metadata |
Provenance information |
TranscriptSegment¶
A segment of the transcript (sentence, phrase, or utterance).
| Field | Type | JSON Key | Description |
|---|---|---|---|
Text |
string |
text |
Segment text |
Start |
DurationMilliseconds |
start_ms |
Start time |
End |
DurationMilliseconds |
end_ms |
End time |
Speaker |
string |
speaker |
Speaker identifier |
Confidence |
float64 |
confidence |
Average confidence (0.0-1.0) |
Language |
string |
language |
Segment language (if different) |
Words |
[]TranscriptWord |
words |
Word-level details |
TranscriptWord¶
A single word with timing information.
| Field | Type | JSON Key | Description |
|---|---|---|---|
Text |
string |
text |
The word |
Start |
DurationMilliseconds |
start_ms |
Start time |
End |
DurationMilliseconds |
end_ms |
End time |
Speaker |
string |
speaker |
Speaker identifier |
Confidence |
float64 |
confidence |
Recognition confidence (0.0-1.0) |
TranscriptMetadata¶
Provenance information about how the transcript was generated.
| Field | Type | JSON Key | Description |
|---|---|---|---|
Provider |
string |
provider |
STT provider (e.g., "deepgram", "openai") |
Model |
string |
model |
Provider-specific model |
CreatedAt |
string |
created_at |
ISO 8601 timestamp |
AudioFile |
string |
audio_file |
Original audio file path/URL |
Options |
*TranscriptOptions |
options |
Transcription options used |
TranscriptOptions¶
Records the options used for transcription.
| Field | Type | JSON Key | Description |
|---|---|---|---|
Language |
string |
language |
Requested language |
EnablePunctuation |
bool |
enable_punctuation |
Punctuation enabled |
EnableWordTimestamps |
bool |
enable_word_timestamps |
Word timestamps enabled |
EnableSpeakerDiarization |
bool |
enable_speaker_diarization |
Speaker diarization enabled |
DurationMilliseconds¶
Duration fields use duration.DurationMilliseconds from github.com/grokify/mogo/time/duration. This type:
- Wraps
time.Durationfor full Go duration functionality - Serializes as integer milliseconds in JSON (not nanoseconds)
- Provides type safety to prevent mixing with raw integers
import "github.com/grokify/mogo/time/duration"
// Create from time.Duration
d := duration.FromDuration(5 * time.Second)
// Create from milliseconds
d := duration.FromMilliseconds(5000)
// Access as time.Duration
td := d.Duration()
td.Seconds() // 5.0
// Get milliseconds
d.Milliseconds() // 5000
// JSON marshaling
data, _ := json.Marshal(d) // "5000"
Methods¶
Transcript Methods¶
// TotalDuration returns the total duration as time.Duration
func (t *Transcript) TotalDuration() time.Duration
// ToJSON serializes the transcript to indented JSON
func (t *Transcript) ToJSON() ([]byte, error)
// SaveJSON writes the transcript to a JSON file
func (t *Transcript) SaveJSON(filePath string) error
TranscriptSegment Methods¶
// SegmentDuration returns the segment duration as time.Duration
func (s *TranscriptSegment) SegmentDuration() time.Duration
TranscriptWord Methods¶
// WordDuration returns the word duration as time.Duration
func (w *TranscriptWord) WordDuration() time.Duration
Functions¶
// NewTranscript creates a Transcript from a TranscriptionResult
func NewTranscript(
result *TranscriptionResult,
provider, model, audioFile string,
config *TranscriptionConfig,
) *Transcript
// LoadTranscript reads a transcript from a JSON file
func LoadTranscript(filePath string) (*Transcript, error)
Schema Validation¶
The schema package provides an embedded JSON Schema for validation:
import "github.com/plexusone/omnivoice-core/schema"
// Get the embedded schema
schemaJSON := schema.TranscriptV1Schema
// Use with any JSON Schema validator library
// Example with github.com/santhosh-tekuri/jsonschema:
compiler := jsonschema.NewCompiler()
if err := compiler.AddResource("transcript.json", strings.NewReader(schemaJSON)); err != nil {
log.Fatal(err)
}
sch, err := compiler.Compile("transcript.json")
if err != nil {
log.Fatal(err)
}
// Validate a transcript
transcriptData, _ := transcript.ToJSON()
if err := sch.Validate(bytes.NewReader(transcriptData)); err != nil {
log.Printf("Validation failed: %v", err)
}
Constants¶
// TranscriptFormatVersion is the current version of the format
const TranscriptFormatVersion = "1.0"
// TranscriptSchemaURL is the JSON Schema URL
const TranscriptSchemaURL = "https://omnivoice.dev/schema/transcript-v1.json"
Use Cases¶
Converting Provider Results¶
// After transcribing with any STT provider
result, err := provider.Transcribe(ctx, audioData, config)
if err != nil {
return err
}
// Convert to canonical format
transcript := stt.NewTranscript(result, provider.Name(), "whisper-1", "recording.wav", config)
// Save for later analysis
return transcript.SaveJSON("transcripts/recording.transcript.json")
Building Subtitles¶
import "github.com/plexusone/omnivoice-core/subtitle"
// Load existing transcript
transcript, err := stt.LoadTranscript("recording.transcript.json")
if err != nil {
return err
}
// Generate SRT subtitles from segments
// (subtitle package works with TranscriptionResult,
// but you can convert Transcript segments back)
Analyzing Speaker Turns¶
transcript, _ := stt.LoadTranscript("meeting.transcript.json")
speakers := make(map[string]time.Duration)
for _, seg := range transcript.Segments {
speakers[seg.Speaker] += seg.SegmentDuration()
}
for speaker, duration := range speakers {
fmt.Printf("%s spoke for %v\n", speaker, duration)
}
Cross-Application Interop¶
The canonical JSON format enables sharing transcripts between applications:
// Application A: Generate transcript
transcript := stt.NewTranscript(result, "deepgram", "nova-2", "audio.mp3", config)
transcript.SaveJSON("shared/transcript.json")
// Application B: Load and process
loaded, _ := stt.LoadTranscript("shared/transcript.json")
fmt.Printf("Transcribed by %s using %s\n",
loaded.Metadata.Provider,
loaded.Metadata.Model)