# Dudoxx Omni — Wire Envelope v1.0.0 (Frozen)

> Self-contained reference for the frozen Deepgram-shaped envelope that flows over WS / SSE between Dudoxx STT, TTS, Usage, LLM agents, and clients.

**Status**: frozen v1.0.0 (2026-04-24). Breaking changes require a v2 WS path (`/v2/listen`, `/v2/speak`).
**Spec source**: `ddx-prd-specs/envelopes/`.
**Generated bindings**: Pydantic v2 (`ddx_mlx_envelopes`) and TypeScript `.d.ts` + Zod under `ddx-prd-specs/envelopes/dist/typescript/`.
**Schemas**: JSON Schema 2020-12 — `asr-frame`, `tts-frame`, `control-messages`, `usage-event`.
**Conformance**: 17 fixtures, 34 tests green.

---

## Why a frozen envelope

1. **Vendor neutrality** — clients written against Deepgram-shaped frames work against MLX, CUDA, and any future backend.
2. **Adapter parity** — LiveKit + Mastra-AI + TS/Py clients reuse the same shape.
3. **Safe evolution** — additive fields only at v1; breaking changes go to v2 paths.

---

## ASR frame (`/v1/listen`, `/v1/listen/dg`)

Sequence: `Metadata (open) → N × Results → SpeechStarted → N × Results → UtteranceEnd → Metadata (final)`.

### Metadata (open)

```json
{
  "type": "Metadata",
  "transaction_key": "deprecated",
  "request_id": "9c3e…",
  "sha256": "00000000…",
  "created": "2026-05-10T17:45:00.123Z",
  "duration": 0.0,
  "channels": 1,
  "models": ["nova-2"]
}
```

### Results (partial + final)

```json
{
  "type": "Results",
  "channel": {
    "alternatives": [{
      "transcript": "Dudoxx streams speech",
      "confidence": 0.97,
      "words": [
        { "word": "Dudoxx", "start": 0.32, "end": 0.74, "confidence": 0.97, "punctuated_word": "Dudoxx", "speaker": 0 }
      ]
    }]
  },
  "is_final": false,
  "speech_final": false,
  "from_finalize": false,
  "start": 0.32,
  "duration": 1.16,
  "metadata": { "request_id": "9c3e…" }
}
```

`is_final=true, from_finalize=true` → frame was emitted in response to a client `{"type":"Finalize"}` control message.

### SpeechStarted

```json
{ "type": "SpeechStarted", "channel": [0], "timestamp": 0.32 }
```

### UtteranceEnd

```json
{ "type": "UtteranceEnd", "channel": [0], "last_word_end": 1.48 }
```

---

## TTS frame (`/v1/speak`, `/v1/speak/sse`)

Sequence: `Metadata → SynthesisStarted → Audio* → [Audio with visemes?] → SynthesisEnded`.

### Metadata / SynthesisStarted

```json
{
  "type": "SynthesisStarted",
  "request_id": "9c3e…",
  "model": "qwen3-tts",
  "voice": "de_frenz",
  "sample_rate": 24000,
  "channels": 1,
  "encoding": "pcm_s16le"
}
```

### Audio

```json
{
  "type": "Audio",
  "sequence": 0,
  "start": 0.0,
  "duration": 0.04,
  "encoding": "pcm_s16le",
  "sample_rate": 24000,
  "channels": 1,
  "audio": "<base64-pcm-bytes>"
}
```

When `emit_visemes: true`, the final Audio frame appends `visemes: [{ time_ms, viseme }]` (Preston-Blair-15).

### SynthesisEnded

```json
{
  "type": "SynthesisEnded",
  "request_id": "9c3e…",
  "total_duration": 2.84,
  "total_frames": 71,
  "reason": "complete"
}
```

### Error

```json
{ "type": "Error", "request_id": "9c3e…", "code": "model_unavailable", "message": "Engine warmup failed" }
```

---

## Control messages (client → server, on STT WS)

| Type | Server response | Use |
|---|---|---|
| `{"type":"KeepAlive"}` | none (resets 10s NET-0001 idle timer) | Send every 5–8s during long pauses |
| `{"type":"Finalize"}` | next `Results` has `is_final=true, from_finalize=true`; session stays open | Force final boundary mid-stream |
| `{"type":"CloseStream"}` | final `Results` (if any pending audio) → final `Metadata` → close 1000 | Clean shutdown |

Unknown control type → close 1008 `DATA-0000`.
JSON parse error on text frame → close 1008 `DATA-0000`.
10s no audio AND no control → close 1011 `NET-0001`.

---

## Usage event (`ddx-mlx-usage`)

```json
{
  "type": "UsageEvent",
  "request_id": "9c3e…",
  "service": "stt",
  "tenant_id": "default",
  "user_id": "u_123",
  "model": "parakeet-tdt-0.6b-v3",
  "started_at": "2026-05-10T17:45:00.123Z",
  "ended_at":   "2026-05-10T17:45:32.018Z",
  "duration_s": 31.895,
  "tokens_in":  null,
  "tokens_out": null,
  "audio_seconds_in":  31.5,
  "audio_seconds_out": null,
  "status": "ok"
}
```

`tokens_*` for LLM, `audio_seconds_*` for STT/TTS. `status` ∈ `{ok, error, cancelled}`.

---

## TypeScript bindings

Generated to `ddx-prd-specs/envelopes/dist/typescript/`. Exports:

- `AsrFrame`, `TtsFrame`, `ControlMessage`, `UsageEvent` — `.d.ts` types
- `AsrFrameSchema`, `TtsFrameSchema`, … — Zod parsers

```ts
import { AsrFrameSchema, type AsrFrame } from '@dudoxx/envelopes';

ws.onmessage = (ev) => {
  if (typeof ev.data !== 'string') return;
  const parsed = AsrFrameSchema.safeParse(JSON.parse(ev.data));
  if (!parsed.success) return;
  const frame: AsrFrame = parsed.data;
  // narrow on frame.type
};
```

## Pydantic v2 bindings

```python
from ddx_mlx_envelopes import AsrEvent, TtsEvent, ControlMessage, UsageEvent

frame = AsrEvent.model_validate_json(message)
```

---

## Versioning

- v1.0.0 is **immutable** in this repo.
- Additive fields only — clients must ignore unknown keys.
- Breaking changes ship at a v2 path (`/v2/listen`, `/v2/speak`) without removing v1.

---

## Reference

- Schemas: `ddx-prd-specs/envelopes/schemas/{asr-frame,tts-frame,control-messages,usage-event}.schema.json`
- Fixtures: `ddx-prd-specs/envelopes/fixtures/`
- Pydantic pkg: `ddx-mlx-envelopes/` (regenerate via `make gen` only when intentional)
- TypeScript dist: `ddx-prd-specs/envelopes/dist/typescript/`
- Spec README: `ddx-prd-specs/envelopes/README.md`
