Skip to content

Transcript Format

OmniVoice provides a canonical JSON transcript format for STT (speech-to-text) output. This standardized format enables consistent handling of transcription results across different providers and applications.

Overview

The Transcript format captures:

  • Full transcription text
  • Segment-level timing (sentences, phrases, utterances)
  • Word-level timing (when enabled)
  • Speaker diarization (when enabled)
  • Confidence scores
  • Provider metadata and options used

All duration fields use DurationMilliseconds which serializes as integer milliseconds in JSON while providing full time.Duration functionality in Go.

Quick Start

import "github.com/plexusone/omnivoice-core/stt"

// Convert a transcription result to canonical format
transcript := stt.NewTranscript(result, "deepgram", "nova-2", "audio.mp3", config)

// Save to JSON file
err := transcript.SaveJSON("output.transcript.json")

// Load from JSON file
loaded, err := stt.LoadTranscript("output.transcript.json")

// Access timing information
fmt.Printf("Total duration: %v\n", transcript.TotalDuration())
for _, seg := range transcript.Segments {
    fmt.Printf("[%v - %v] %s\n",
        seg.Start.Duration(),
        seg.End.Duration(),
        seg.Text)
}

JSON Format

{
  "$schema": "https://omnivoice.dev/schema/transcript-v1.json",
  "version": "1.0",
  "text": "Hello world. How are you today?",
  "language": "en-US",
  "language_confidence": 0.95,
  "duration_ms": 5000,
  "segments": [
    {
      "text": "Hello world.",
      "start_ms": 0,
      "end_ms": 1500,
      "speaker": "speaker_1",
      "confidence": 0.98,
      "words": [
        {
          "text": "Hello",
          "start_ms": 0,
          "end_ms": 600,
          "confidence": 0.99
        },
        {
          "text": "world.",
          "start_ms": 700,
          "end_ms": 1500,
          "confidence": 0.97
        }
      ]
    },
    {
      "text": "How are you today?",
      "start_ms": 2000,
      "end_ms": 4500,
      "speaker": "speaker_2",
      "confidence": 0.96
    }
  ],
  "metadata": {
    "provider": "deepgram",
    "model": "nova-2",
    "created_at": "2026-05-02T12:00:00Z",
    "audio_file": "conversation.mp3",
    "options": {
      "language": "en-US",
      "enable_punctuation": true,
      "enable_word_timestamps": true,
      "enable_speaker_diarization": true
    }
  }
}

Types

Transcript

The root type containing the full transcription.

Field Type JSON Key Description
Schema string $schema JSON Schema URL for validation
Version string version Format version (currently "1.0")
Text string text Complete transcription text
Language string language BCP-47 language code (e.g., "en-US")
LanguageConfidence float64 language_confidence Language detection confidence (0.0-1.0)
Duration DurationMilliseconds duration_ms Total audio duration
Segments []TranscriptSegment segments Transcript segments
Metadata TranscriptMetadata metadata Provenance information

TranscriptSegment

A segment of the transcript (sentence, phrase, or utterance).

Field Type JSON Key Description
Text string text Segment text
Start DurationMilliseconds start_ms Start time
End DurationMilliseconds end_ms End time
Speaker string speaker Speaker identifier
Confidence float64 confidence Average confidence (0.0-1.0)
Language string language Segment language (if different)
Words []TranscriptWord words Word-level details

TranscriptWord

A single word with timing information.

Field Type JSON Key Description
Text string text The word
Start DurationMilliseconds start_ms Start time
End DurationMilliseconds end_ms End time
Speaker string speaker Speaker identifier
Confidence float64 confidence Recognition confidence (0.0-1.0)

TranscriptMetadata

Provenance information about how the transcript was generated.

Field Type JSON Key Description
Provider string provider STT provider (e.g., "deepgram", "openai")
Model string model Provider-specific model
CreatedAt string created_at ISO 8601 timestamp
AudioFile string audio_file Original audio file path/URL
Options *TranscriptOptions options Transcription options used

TranscriptOptions

Records the options used for transcription.

Field Type JSON Key Description
Language string language Requested language
EnablePunctuation bool enable_punctuation Punctuation enabled
EnableWordTimestamps bool enable_word_timestamps Word timestamps enabled
EnableSpeakerDiarization bool enable_speaker_diarization Speaker diarization enabled

DurationMilliseconds

Duration fields use duration.DurationMilliseconds from github.com/grokify/mogo/time/duration. This type:

  • Wraps time.Duration for full Go duration functionality
  • Serializes as integer milliseconds in JSON (not nanoseconds)
  • Provides type safety to prevent mixing with raw integers
import "github.com/grokify/mogo/time/duration"

// Create from time.Duration
d := duration.FromDuration(5 * time.Second)

// Create from milliseconds
d := duration.FromMilliseconds(5000)

// Access as time.Duration
td := d.Duration()
td.Seconds() // 5.0

// Get milliseconds
d.Milliseconds() // 5000

// JSON marshaling
data, _ := json.Marshal(d) // "5000"

Methods

Transcript Methods

// TotalDuration returns the total duration as time.Duration
func (t *Transcript) TotalDuration() time.Duration

// ToJSON serializes the transcript to indented JSON
func (t *Transcript) ToJSON() ([]byte, error)

// SaveJSON writes the transcript to a JSON file
func (t *Transcript) SaveJSON(filePath string) error

TranscriptSegment Methods

// SegmentDuration returns the segment duration as time.Duration
func (s *TranscriptSegment) SegmentDuration() time.Duration

TranscriptWord Methods

// WordDuration returns the word duration as time.Duration
func (w *TranscriptWord) WordDuration() time.Duration

Functions

// NewTranscript creates a Transcript from a TranscriptionResult
func NewTranscript(
    result *TranscriptionResult,
    provider, model, audioFile string,
    config *TranscriptionConfig,
) *Transcript

// LoadTranscript reads a transcript from a JSON file
func LoadTranscript(filePath string) (*Transcript, error)

Schema Validation

The schema package provides an embedded JSON Schema for validation:

import "github.com/plexusone/omnivoice-core/schema"

// Get the embedded schema
schemaJSON := schema.TranscriptV1Schema

// Use with any JSON Schema validator library
// Example with github.com/santhosh-tekuri/jsonschema:
compiler := jsonschema.NewCompiler()
if err := compiler.AddResource("transcript.json", strings.NewReader(schemaJSON)); err != nil {
    log.Fatal(err)
}
sch, err := compiler.Compile("transcript.json")
if err != nil {
    log.Fatal(err)
}

// Validate a transcript
transcriptData, _ := transcript.ToJSON()
if err := sch.Validate(bytes.NewReader(transcriptData)); err != nil {
    log.Printf("Validation failed: %v", err)
}

Constants

// TranscriptFormatVersion is the current version of the format
const TranscriptFormatVersion = "1.0"

// TranscriptSchemaURL is the JSON Schema URL
const TranscriptSchemaURL = "https://omnivoice.dev/schema/transcript-v1.json"

Use Cases

Converting Provider Results

// After transcribing with any STT provider
result, err := provider.Transcribe(ctx, audioData, config)
if err != nil {
    return err
}

// Convert to canonical format
transcript := stt.NewTranscript(result, provider.Name(), "whisper-1", "recording.wav", config)

// Save for later analysis
return transcript.SaveJSON("transcripts/recording.transcript.json")

Building Subtitles

import "github.com/plexusone/omnivoice-core/subtitle"

// Load existing transcript
transcript, err := stt.LoadTranscript("recording.transcript.json")
if err != nil {
    return err
}

// Generate SRT subtitles from segments
// (subtitle package works with TranscriptionResult,
// but you can convert Transcript segments back)

Analyzing Speaker Turns

transcript, _ := stt.LoadTranscript("meeting.transcript.json")

speakers := make(map[string]time.Duration)
for _, seg := range transcript.Segments {
    speakers[seg.Speaker] += seg.SegmentDuration()
}

for speaker, duration := range speakers {
    fmt.Printf("%s spoke for %v\n", speaker, duration)
}

Cross-Application Interop

The canonical JSON format enables sharing transcripts between applications:

// Application A: Generate transcript
transcript := stt.NewTranscript(result, "deepgram", "nova-2", "audio.mp3", config)
transcript.SaveJSON("shared/transcript.json")

// Application B: Load and process
loaded, _ := stt.LoadTranscript("shared/transcript.json")
fmt.Printf("Transcribed by %s using %s\n",
    loaded.Metadata.Provider,
    loaded.Metadata.Model)