# Dudoxx Omni — TTS Integration Skill

> Self-contained guide for integrating the Dudoxx Text-to-Speech service from Next.js 16, NestJS 11, and Python 3.12. Drop this file into any LLM context — it stands alone.

**Service**: `ddx-cuda-live-tts` (port `4650`) and `ddx-mlx-live-tts` (port `4150`).
**Public**: `https://tts.forge.dudoxx.com`
**Wire format**: Deepgram-shaped TTS envelope, frozen v1.0.0.
**Routes**: `POST /v1/synthesize` · `GET /v1/speak/sse` · `WS /v1/speak`.

---

## TL;DR

- **Browser**: never call TTS directly — the API key + host leak. Always proxy via Next.js / NestJS.
- **Auth**: `X-API-Key: <key>` (REST/SSE) or `?api_key=<key>` (WS).
- **Output frames**: `Metadata` → `SynthesisStarted` → `Audio*` → `SynthesisEnded`. `audio` is base64 `pcm_s16le` mono at requested sample rate.
- **Sample rates**: 16000 / 22050 / 24000 / 48000.
- **Voices**: pick from `/v1/voices` or omit and let the engine pick a default per language.
- **Visemes**: pass `emit_visemes: true` for Preston-Blair-15 lipsync frames (CUDA backend).

---

## Request body

```jsonc
{
  "text": "Hallo mein Freund, hoffe es geht dir gut.",
  "voice": "de_frenz",            // optional
  "language": "de",
  "sample_rate": 24000,
  "speed": 1.0,
  "ref_audio_b64": null,          // optional cloning audio
  "emit_visemes": false,
  "stream": true,
  "gen_params": {                 // optional sampler overrides (CUDA only)
    "subtalker_dosample": false,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "decode_window_frames": 25,
    "emit_every_frames": 2
  }
}
```

`gen_params` is optional. Server defaults already prevent the Qwen3-TTS codec from emitting an early stop token (no last-word truncation).

---

## Browser anti-pattern (DO NOT DO)

```ts
// ❌ exposes the API key + host
const ws = new WebSocket('wss://tts.forge.dudoxx.com/v1/speak?api_key=...');
```

```ts
// ✅ proxy through your own server
const ws = new WebSocket('/api/tts/stream'); // same-origin, no key
```

---

## Next.js 16 — SSE proxy (`app/api/tts/stream/route.ts`)

```ts
import type { NextRequest } from 'next/server';

export const runtime = 'nodejs';
export const dynamic = 'force-dynamic';

interface TtsBody {
  text: string;
  voice?: string;
  language?: string;
  sample_rate?: 16000 | 22050 | 24000 | 48000;
  speed?: number;
  emit_visemes?: boolean;
  gen_params?: Record<string, string | number | boolean>;
}

export async function POST(req: NextRequest): Promise<Response> {
  const body = (await req.json()) as TtsBody;
  const upstream = await fetch(
    `${process.env.TTS_URL}/v1/speak/sse?` +
      new URLSearchParams({
        text: body.text,
        voice: body.voice ?? '',
        language: body.language ?? 'en',
        sample_rate: String(body.sample_rate ?? 24000),
        speed: String(body.speed ?? 1.0),
        emit_visemes: String(body.emit_visemes ?? false),
        ...(body.gen_params ? { gen_params: JSON.stringify(body.gen_params) } : {}),
      }),
    {
      method: 'GET',
      headers: { 'X-API-Key': process.env.TTS_API_KEY! },
      cache: 'no-store',
    },
  );
  if (!upstream.ok || !upstream.body) {
    return new Response(`tts upstream ${upstream.status}`, { status: 502 });
  }
  return new Response(upstream.body, {
    status: 200,
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache, no-transform',
      'Connection': 'keep-alive',
      'X-Accel-Buffering': 'no',
    },
  });
}
```

Browser consumes the SSE:

```ts
const r = await fetch('/api/tts/stream', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ text: 'Hello world', language: 'en' }),
});
const reader = r.body!.getReader();
const dec = new TextDecoder();
let buf = '';
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  buf += dec.decode(value, { stream: true });
  const lines = buf.split('\n');
  buf = lines.pop() ?? '';
  for (const line of lines) {
    if (!line.startsWith('data: ')) continue;
    const frame = JSON.parse(line.slice(6));
    // dispatch frame.type === 'Audio' | 'Metadata' | …
  }
}
```

---

## Next.js 16 — one-shot POST (small text, no streaming UI)

```ts
// app/api/tts/synthesize/route.ts
export async function POST(req: NextRequest): Promise<Response> {
  const body = await req.json();
  const r = await fetch(`${process.env.TTS_URL}/v1/synthesize`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'X-API-Key': process.env.TTS_API_KEY!,
    },
    body: JSON.stringify(body),
    cache: 'no-store',
  });
  return new Response(await r.text(), {
    status: r.status,
    headers: { 'Content-Type': 'application/json' },
  });
}
```

Browser receives `{ audio_b64, sample_rate, duration_s, visemes, metadata }`.

---

## NestJS 11 — service + controller

```ts
// src/modules/tts/tts.service.ts
import { Injectable } from '@nestjs/common';
import { ConfigService } from '@nestjs/config';

export interface SpeakOpts {
  voice?: string;
  language?: string;
  sampleRate?: 16000 | 22050 | 24000 | 48000;
  genParams?: Record<string, string | number | boolean>;
}

@Injectable()
export class TtsService {
  constructor(private readonly cfg: ConfigService) {}

  async synthesize(text: string, opts: SpeakOpts = {}): Promise<Buffer> {
    const r = await fetch(`${this.cfg.get('TTS_URL')}/v1/synthesize`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-API-Key': this.cfg.getOrThrow('TTS_API_KEY'),
      },
      body: JSON.stringify({
        text,
        voice: opts.voice,
        language: opts.language ?? 'en',
        sample_rate: opts.sampleRate ?? 24000,
        gen_params: opts.genParams,
      }),
    });
    if (!r.ok) throw new Error(`tts ${r.status}`);
    const json = (await r.json()) as { audio_b64: string };
    return Buffer.from(json.audio_b64, 'base64');
  }

  async streamSse(text: string, opts: SpeakOpts, sink: NodeJS.WritableStream): Promise<void> {
    const params = new URLSearchParams({
      text,
      language: opts.language ?? 'en',
      sample_rate: String(opts.sampleRate ?? 24000),
      ...(opts.genParams ? { gen_params: JSON.stringify(opts.genParams) } : {}),
    });
    const r = await fetch(`${this.cfg.get('TTS_URL')}/v1/speak/sse?${params}`, {
      headers: { 'X-API-Key': this.cfg.getOrThrow('TTS_API_KEY') },
    });
    if (!r.ok || !r.body) throw new Error(`tts ${r.status}`);
    for await (const chunk of r.body as unknown as AsyncIterable<Uint8Array>) {
      sink.write(chunk);
    }
    sink.end();
  }
}
```

```ts
// src/modules/tts/tts.controller.ts
@Controller('tts')
export class TtsController {
  constructor(private readonly tts: TtsService) {}

  @Post('stream')
  async stream(@Body() body: SpeakDto, @Res() res: Response): Promise<void> {
    res.setHeader('Content-Type', 'text/event-stream');
    res.setHeader('Cache-Control', 'no-cache, no-transform');
    res.setHeader('Connection', 'keep-alive');
    res.setHeader('X-Accel-Buffering', 'no');
    await this.tts.streamSse(body.text, { genParams: body.genParams }, res);
  }
}
```

---

## Python 3.12 — async one-shot + WS streaming

```python
import base64, httpx

async def synthesize(text: str, *, voice: str | None = None,
                     language: str = "en", base_url: str, api_key: str) -> bytes:
    async with httpx.AsyncClient(timeout=60) as cli:
        r = await cli.post(
            f"{base_url}/v1/synthesize",
            headers={"X-API-Key": api_key},
            json={"text": text, "voice": voice, "language": language, "stream": False},
        )
        r.raise_for_status()
        return base64.b64decode(r.json()["audio_b64"])
```

```python
import asyncio, base64, json, websockets

async def stream_tts(text: str, *, base_url: str, api_key: str,
                     language: str = "en", voice: str | None = None,
                     gen_params: dict | None = None) -> bytes:
    url = base_url.replace("http", "ws") + f"/v1/speak?api_key={api_key}"
    pcm = bytearray()
    async with websockets.connect(url, max_size=None, ping_interval=None) as ws:
        await ws.send(json.dumps({
            "text": text, "voice": voice, "language": language,
            "sample_rate": 24000, "speed": 1.0,
            "emit_visemes": False, "stream": True,
            **({"gen_params": gen_params} if gen_params else {}),
        }))
        async for msg in ws:
            obj = json.loads(msg)
            if obj.get("type") == "Audio":
                pcm += base64.b64decode(obj["audio"])
            elif obj.get("type") in ("SynthesisEnded", "Error"):
                break
    return bytes(pcm)
```

---

## Backend parity (CUDA vs MLX)

| Feature | CUDA `:4650` | MLX `:4150` |
|---|---|---|
| Wire format (envelope v1.0.0) | ✅ | ✅ |
| `gen_params` request field | ✅ | ❌ (single sampler, defaults already deterministic) |
| `subtalker_dosample=false` default fix | ✅ | n/a |
| Voice cloning (`ref_audio_b64`) | ✅ | ✅ |

Cross-backend clients should pass `gen_params` only — MLX silently ignores unknown fields.

---

## Operational notes

- **Tail truncation fixed at server defaults**: clients no longer need to pass `subtalker_dosample=false` to get a complete utterance.
- **Override sparingly**: passing `subtalker_dosample=true` returns upstream `qwen-tts` stochastic behavior — last-word truncation may return.
- **TTFB tuning**: lower `emit_every_frames` (1–2) for lowest first-byte latency, higher (4–8) for fewer WS messages on long utterances.
- **Quality tuning**: raise `decode_window_frames` (25 → 50–80) for cleaner tail audio at the cost of latency.
- **Cold-start**: first frame ~600 ms after model warmup; subsequent requests reuse the warm engine.

---

## Failure modes

| Symptom | Cause | Fix |
|---|---|---|
| HTTP 401 | missing `X-API-Key` / `?api_key` | Add the header / query param |
| HTTP 422 unsupported `language` | not in `SpeakRequest._LANGUAGE_PATTERN` | Use BCP-47 short tag (`en`, `fr`, `de`, `it`) |
| HTTP 422 unsupported `sample_rate` | not in `{16000, 22050, 24000, 48000}` | Pick a supported rate |
| Last word missing in CUDA TTS | `subtalker_dosample=true` override | Remove override or set `false` |
| WS closes 1011 mid-utterance | Engine exception (Qwen3 / Kokoro init) | Check `logs/tts.log`; restart `./ddx-manage.sh restart --prod tts` |
| SSE frames buffer at NGINX | proxy-buffering on | Set `proxy_buffering off; proxy_set_header X-Accel-Buffering no;` |
| `audio_b64` payload too large | one-shot response holding 10s+ utterance | Switch to `/v1/speak/sse` for long text |

---

## Reference

- Service docs: `ddx-cuda-live-tts/TTS_API_USAGE.md`, `TTS_FULL_CAPABILITIES.md`, `TTS_API_ENDPOINTS.md`, `TTS_GEN_PARAMS.md`
- Frozen wire format: `ddx-prd-specs/envelopes/README.md`, schema `tts-frame.schema.json`
- Live dashboard playground: `http://localhost:4650/v1/metrics/dashboard`
