Dudoxx Omni — TTS Integration Skill

Self-contained guide for integrating the Dudoxx Text-to-Speech service from Next.js 16, NestJS 11, and Python 3.12. Drop this file into any LLM context — it stands alone.

Service: ddx-cuda-live-tts (port 4650) and ddx-mlx-live-tts (port 4150). Public: https://tts.forge.dudoxx.com Wire format: Deepgram-shaped TTS envelope, frozen v1.0.0. Routes: POST /v1/synthesize · GET /v1/speak/sse · WS /v1/speak.

TL;DR

Browser: never call TTS directly — the API key + host leak. Always proxy via Next.js / NestJS.
Auth: X-API-Key: <key> (REST/SSE) or ?api_key=<key> (WS).
Output frames: Metadata → SynthesisStarted → Audio* → SynthesisEnded. audio is base64 pcm_s16le mono at requested sample rate.
Sample rates: 16000 / 22050 / 24000 / 48000.
Voices: pick from /v1/voices or omit and let the engine pick a default per language.
Visemes: pass emit_visemes: true for Preston-Blair-15 lipsync frames (CUDA backend).

Request body

jsonc

{
  "text": "Hallo mein Freund, hoffe es geht dir gut.",
  "voice": "de_frenz",            // optional
  "language": "de",
  "sample_rate": 24000,
  "speed": 1.0,
  "ref_audio_b64": null,          // optional cloning audio
  "emit_visemes": false,
  "stream": true,
  "gen_params": {                 // optional sampler overrides (CUDA only)
    "subtalker_dosample": false,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "decode_window_frames": 25,
    "emit_every_frames": 2
  }
}

gen_params is optional. Server defaults already prevent the Qwen3-TTS codec from emitting an early stop token (no last-word truncation).

Browser anti-pattern (DO NOT DO)

// ❌ exposes the API key + host
const ws = new WebSocket('wss://tts.forge.dudoxx.com/v1/speak?api_key=...');

// ✅ proxy through your own server
const ws = new WebSocket('/api/tts/stream'); // same-origin, no key

Next.js 16 — SSE proxy (`app/api/tts/stream/route.ts`)

import type { NextRequest } from 'next/server';

export const runtime = 'nodejs';
export const dynamic = 'force-dynamic';

interface TtsBody {
  text: string;
  voice?: string;
  language?: string;
  sample_rate?: 16000 | 22050 | 24000 | 48000;
  speed?: number;
  emit_visemes?: boolean;
  gen_params?: Record<string, string | number | boolean>;
}

export async function POST(req: NextRequest): Promise<Response> {
  const body = (await req.json()) as TtsBody;
  const upstream = await fetch(
    `${process.env.TTS_URL}/v1/speak/sse?` +
      new URLSearchParams({
        text: body.text,
        voice: body.voice ?? '',
        language: body.language ?? 'en',
        sample_rate: String(body.sample_rate ?? 24000),
        speed: String(body.speed ?? 1.0),
        emit_visemes: String(body.emit_visemes ?? false),
        ...(body.gen_params ? { gen_params: JSON.stringify(body.gen_params) } : {}),
      }),
    {
      method: 'GET',
      headers: { 'X-API-Key': process.env.TTS_API_KEY! },
      cache: 'no-store',
    },
  );
  if (!upstream.ok || !upstream.body) {
    return new Response(`tts upstream ${upstream.status}`, { status: 502 });
  }
  return new Response(upstream.body, {
    status: 200,
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache, no-transform',
      'Connection': 'keep-alive',
      'X-Accel-Buffering': 'no',
    },
  });
}

Browser consumes the SSE:

const r = await fetch('/api/tts/stream', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ text: 'Hello world', language: 'en' }),
});
const reader = r.body!.getReader();
const dec = new TextDecoder();
let buf = '';
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  buf += dec.decode(value, { stream: true });
  const lines = buf.split('\n');
  buf = lines.pop() ?? '';
  for (const line of lines) {
    if (!line.startsWith('data: ')) continue;
    const frame = JSON.parse(line.slice(6));
    // dispatch frame.type === 'Audio' | 'Metadata' | …
  }
}

Next.js 16 — one-shot POST (small text, no streaming UI)

// app/api/tts/synthesize/route.ts
export async function POST(req: NextRequest): Promise<Response> {
  const body = await req.json();
  const r = await fetch(`${process.env.TTS_URL}/v1/synthesize`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'X-API-Key': process.env.TTS_API_KEY!,
    },
    body: JSON.stringify(body),
    cache: 'no-store',
  });
  return new Response(await r.text(), {
    status: r.status,
    headers: { 'Content-Type': 'application/json' },
  });
}

Browser receives { audio_b64, sample_rate, duration_s, visemes, metadata }.

NestJS 11 — service + controller

// src/modules/tts/tts.service.ts
import { Injectable } from '@nestjs/common';
import { ConfigService } from '@nestjs/config';

export interface SpeakOpts {
  voice?: string;
  language?: string;
  sampleRate?: 16000 | 22050 | 24000 | 48000;
  genParams?: Record<string, string | number | boolean>;
}

@Injectable()
export class TtsService {
  constructor(private readonly cfg: ConfigService) {}

  async synthesize(text: string, opts: SpeakOpts = {}): Promise<Buffer> {
    const r = await fetch(`${this.cfg.get('TTS_URL')}/v1/synthesize`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-API-Key': this.cfg.getOrThrow('TTS_API_KEY'),
      },
      body: JSON.stringify({
        text,
        voice: opts.voice,
        language: opts.language ?? 'en',
        sample_rate: opts.sampleRate ?? 24000,
        gen_params: opts.genParams,
      }),
    });
    if (!r.ok) throw new Error(`tts ${r.status}`);
    const json = (await r.json()) as { audio_b64: string };
    return Buffer.from(json.audio_b64, 'base64');
  }

  async streamSse(text: string, opts: SpeakOpts, sink: NodeJS.WritableStream): Promise<void> {
    const params = new URLSearchParams({
      text,
      language: opts.language ?? 'en',
      sample_rate: String(opts.sampleRate ?? 24000),
      ...(opts.genParams ? { gen_params: JSON.stringify(opts.genParams) } : {}),
    });
    const r = await fetch(`${this.cfg.get('TTS_URL')}/v1/speak/sse?${params}`, {
      headers: { 'X-API-Key': this.cfg.getOrThrow('TTS_API_KEY') },
    });
    if (!r.ok || !r.body) throw new Error(`tts ${r.status}`);
    for await (const chunk of r.body as unknown as AsyncIterable<Uint8Array>) {
      sink.write(chunk);
    }
    sink.end();
  }
}

// src/modules/tts/tts.controller.ts
@Controller('tts')
export class TtsController {
  constructor(private readonly tts: TtsService) {}

  @Post('stream')
  async stream(@Body() body: SpeakDto, @Res() res: Response): Promise<void> {
    res.setHeader('Content-Type', 'text/event-stream');
    res.setHeader('Cache-Control', 'no-cache, no-transform');
    res.setHeader('Connection', 'keep-alive');
    res.setHeader('X-Accel-Buffering', 'no');
    await this.tts.streamSse(body.text, { genParams: body.genParams }, res);
  }
}

Python 3.12 — async one-shot + WS streaming

python

import base64, httpx

async def synthesize(text: str, *, voice: str | None = None,
                     language: str = "en", base_url: str, api_key: str) -> bytes:
    async with httpx.AsyncClient(timeout=60) as cli:
        r = await cli.post(
            f"{base_url}/v1/synthesize",
            headers={"X-API-Key": api_key},
            json={"text": text, "voice": voice, "language": language, "stream": False},
        )
        r.raise_for_status()
        return base64.b64decode(r.json()["audio_b64"])

python

import asyncio, base64, json, websockets

async def stream_tts(text: str, *, base_url: str, api_key: str,
                     language: str = "en", voice: str | None = None,
                     gen_params: dict | None = None) -> bytes:
    url = base_url.replace("http", "ws") + f"/v1/speak?api_key={api_key}"
    pcm = bytearray()
    async with websockets.connect(url, max_size=None, ping_interval=None) as ws:
        await ws.send(json.dumps({
            "text": text, "voice": voice, "language": language,
            "sample_rate": 24000, "speed": 1.0,
            "emit_visemes": False, "stream": True,
            **({"gen_params": gen_params} if gen_params else {}),
        }))
        async for msg in ws:
            obj = json.loads(msg)
            if obj.get("type") == "Audio":
                pcm += base64.b64decode(obj["audio"])
            elif obj.get("type") in ("SynthesisEnded", "Error"):
                break
    return bytes(pcm)

Backend parity (CUDA vs MLX)

Feature	CUDA `:4650`	MLX `:4150`
Wire format (envelope v1.0.0)	✅	✅
`gen_params` request field	✅	❌ (single sampler, defaults already deterministic)
`subtalker_dosample=false` default fix	✅	n/a
Voice cloning (`ref_audio_b64`)	✅	✅

Cross-backend clients should pass gen_params only — MLX silently ignores unknown fields.

Operational notes

Tail truncation fixed at server defaults: clients no longer need to pass subtalker_dosample=false to get a complete utterance.
Override sparingly: passing subtalker_dosample=true returns upstream qwen-tts stochastic behavior — last-word truncation may return.
TTFB tuning: lower emit_every_frames (1–2) for lowest first-byte latency, higher (4–8) for fewer WS messages on long utterances.
Quality tuning: raise decode_window_frames (25 → 50–80) for cleaner tail audio at the cost of latency.
Cold-start: first frame ~600 ms after model warmup; subsequent requests reuse the warm engine.

Failure modes

Symptom	Cause	Fix
HTTP 401	missing `X-API-Key` / `?api_key`	Add the header / query param
HTTP 422 unsupported `language`	not in `SpeakRequest._LANGUAGE_PATTERN`	Use BCP-47 short tag (`en`, `fr`, `de`, `it`)
HTTP 422 unsupported `sample_rate`	not in `{16000, 22050, 24000, 48000}`	Pick a supported rate
Last word missing in CUDA TTS	`subtalker_dosample=true` override	Remove override or set `false`
WS closes 1011 mid-utterance	Engine exception (Qwen3 / Kokoro init)	Check `logs/tts.log`; restart `./ddx-manage.sh restart --prod tts`
SSE frames buffer at NGINX	proxy-buffering on	Set `proxy_buffering off; proxy_set_header X-Accel-Buffering no;`
`audio_b64` payload too large	one-shot response holding 10s+ utterance	Switch to `/v1/speak/sse` for long text

Reference

Service docs: ddx-cuda-live-tts/TTS_API_USAGE.md, TTS_FULL_CAPABILITIES.md, TTS_API_ENDPOINTS.md, TTS_GEN_PARAMS.md
Frozen wire format: ddx-prd-specs/envelopes/README.md, schema tts-frame.schema.json
Live dashboard playground: http://localhost:4650/v1/metrics/dashboard