Dudoxx Omni — Wire Envelope v1.0.0 (Frozen)
Self-contained reference for the frozen Deepgram-shaped envelope that flows over WS / SSE between Dudoxx STT, TTS, Usage, LLM agents, and clients.
Status: frozen v1.0.0 (2026-04-24). Breaking changes require a v2 WS path (/v2/listen, /v2/speak).
Spec source: ddx-prd-specs/envelopes/.
Generated bindings: Pydantic v2 (ddx_mlx_envelopes) and TypeScript .d.ts + Zod under ddx-prd-specs/envelopes/dist/typescript/.
Schemas: JSON Schema 2020-12 — asr-frame, tts-frame, control-messages, usage-event.
Conformance: 17 fixtures, 34 tests green.
Why a frozen envelope
- Vendor neutrality — clients written against Deepgram-shaped frames work against MLX, CUDA, and any future backend.
- Adapter parity — LiveKit + Mastra-AI + TS/Py clients reuse the same shape.
- Safe evolution — additive fields only at v1; breaking changes go to v2 paths.
ASR frame (/v1/listen, /v1/listen/dg)
Sequence: Metadata (open) → N × Results → SpeechStarted → N × Results → UtteranceEnd → Metadata (final).
Metadata (open)
{
"type": "Metadata",
"transaction_key": "deprecated",
"request_id": "9c3e…",
"sha256": "00000000…",
"created": "2026-05-10T17:45:00.123Z",
"duration": 0.0,
"channels": 1,
"models": ["nova-2"]
}Results (partial + final)
{
"type": "Results",
"channel": {
"alternatives": [{
"transcript": "Dudoxx streams speech",
"confidence": 0.97,
"words": [
{ "word": "Dudoxx", "start": 0.32, "end": 0.74, "confidence": 0.97, "punctuated_word": "Dudoxx", "speaker": 0 }
]
}]
},
"is_final": false,
"speech_final": false,
"from_finalize": false,
"start": 0.32,
"duration": 1.16,
"metadata": { "request_id": "9c3e…" }
}is_final=true, from_finalize=true → frame was emitted in response to a client {"type":"Finalize"} control message.
SpeechStarted
{ "type": "SpeechStarted", "channel": [0], "timestamp": 0.32 }UtteranceEnd
{ "type": "UtteranceEnd", "channel": [0], "last_word_end": 1.48 }TTS frame (/v1/speak, /v1/speak/sse)
Sequence: Metadata → SynthesisStarted → Audio* → [Audio with visemes?] → SynthesisEnded.
Metadata / SynthesisStarted
{
"type": "SynthesisStarted",
"request_id": "9c3e…",
"model": "qwen3-tts",
"voice": "de_frenz",
"sample_rate": 24000,
"channels": 1,
"encoding": "pcm_s16le"
}Audio
{
"type": "Audio",
"sequence": 0,
"start": 0.0,
"duration": 0.04,
"encoding": "pcm_s16le",
"sample_rate": 24000,
"channels": 1,
"audio": "<base64-pcm-bytes>"
}When emit_visemes: true, the final Audio frame appends visemes: [{ time_ms, viseme }] (Preston-Blair-15).
SynthesisEnded
{
"type": "SynthesisEnded",
"request_id": "9c3e…",
"total_duration": 2.84,
"total_frames": 71,
"reason": "complete"
}Error
{ "type": "Error", "request_id": "9c3e…", "code": "model_unavailable", "message": "Engine warmup failed" }Control messages (client → server, on STT WS)
| Type | Server response | Use |
|---|---|---|
{"type":"KeepAlive"} | none (resets 10s NET-0001 idle timer) | Send every 5–8s during long pauses |
{"type":"Finalize"} | next Results has is_final=true, from_finalize=true; session stays open | Force final boundary mid-stream |
{"type":"CloseStream"} | final Results (if any pending audio) → final Metadata → close 1000 | Clean shutdown |
Unknown control type → close 1008 DATA-0000.
JSON parse error on text frame → close 1008 DATA-0000.
10s no audio AND no control → close 1011 NET-0001.
Usage event (ddx-mlx-usage)
{
"type": "UsageEvent",
"request_id": "9c3e…",
"service": "stt",
"tenant_id": "default",
"user_id": "u_123",
"model": "parakeet-tdt-0.6b-v3",
"started_at": "2026-05-10T17:45:00.123Z",
"ended_at": "2026-05-10T17:45:32.018Z",
"duration_s": 31.895,
"tokens_in": null,
"tokens_out": null,
"audio_seconds_in": 31.5,
"audio_seconds_out": null,
"status": "ok"
}tokens_* for LLM, audio_seconds_* for STT/TTS. status ∈ {ok, error, cancelled}.
TypeScript bindings
Generated to ddx-prd-specs/envelopes/dist/typescript/. Exports:
AsrFrame,TtsFrame,ControlMessage,UsageEvent—.d.tstypesAsrFrameSchema,TtsFrameSchema, … — Zod parsers
import { AsrFrameSchema, type AsrFrame } from '@dudoxx/envelopes';
ws.onmessage = (ev) => {
if (typeof ev.data !== 'string') return;
const parsed = AsrFrameSchema.safeParse(JSON.parse(ev.data));
if (!parsed.success) return;
const frame: AsrFrame = parsed.data;
// narrow on frame.type
};Pydantic v2 bindings
from ddx_mlx_envelopes import AsrEvent, TtsEvent, ControlMessage, UsageEvent
frame = AsrEvent.model_validate_json(message)Versioning
- v1.0.0 is immutable in this repo.
- Additive fields only — clients must ignore unknown keys.
- Breaking changes ship at a v2 path (
/v2/listen,/v2/speak) without removing v1.
Reference
- Schemas:
ddx-prd-specs/envelopes/schemas/{asr-frame,tts-frame,control-messages,usage-event}.schema.json - Fixtures:
ddx-prd-specs/envelopes/fixtures/ - Pydantic pkg:
ddx-mlx-envelopes/(regenerate viamake genonly when intentional) - TypeScript dist:
ddx-prd-specs/envelopes/dist/typescript/ - Spec README:
ddx-prd-specs/envelopes/README.md