Dudoxx Omni — TTS Integration Skill
Self-contained guide for integrating the Dudoxx Text-to-Speech service from Next.js 16, NestJS 11, and Python 3.12. Drop this file into any LLM context — it stands alone.
Service: ddx-cuda-live-tts (port 4650) and ddx-mlx-live-tts (port 4150).
Public: https://tts.forge.dudoxx.com
Wire format: Deepgram-shaped TTS envelope, frozen v1.0.0.
Routes: POST /v1/synthesize · GET /v1/speak/sse · WS /v1/speak.
TL;DR
- Browser: never call TTS directly — the API key + host leak. Always proxy via Next.js / NestJS.
- Auth:
X-API-Key: <key>(REST/SSE) or?api_key=<key>(WS). - Output frames:
Metadata→SynthesisStarted→Audio*→SynthesisEnded.audiois base64pcm_s16lemono at requested sample rate. - Sample rates: 16000 / 22050 / 24000 / 48000.
- Voices: pick from
/v1/voicesor omit and let the engine pick a default per language. - Visemes: pass
emit_visemes: truefor Preston-Blair-15 lipsync frames (CUDA backend).
Request body
{
"text": "Hallo mein Freund, hoffe es geht dir gut.",
"voice": "de_frenz", // optional
"language": "de",
"sample_rate": 24000,
"speed": 1.0,
"ref_audio_b64": null, // optional cloning audio
"emit_visemes": false,
"stream": true,
"gen_params": { // optional sampler overrides (CUDA only)
"subtalker_dosample": false,
"temperature": 0.7,
"top_p": 0.95,
"top_k": 40,
"decode_window_frames": 25,
"emit_every_frames": 2
}
}gen_params is optional. Server defaults already prevent the Qwen3-TTS codec from emitting an early stop token (no last-word truncation).
Browser anti-pattern (DO NOT DO)
// ❌ exposes the API key + host
const ws = new WebSocket('wss://tts.forge.dudoxx.com/v1/speak?api_key=...');// ✅ proxy through your own server
const ws = new WebSocket('/api/tts/stream'); // same-origin, no keyNext.js 16 — SSE proxy (app/api/tts/stream/route.ts)
import type { NextRequest } from 'next/server';
export const runtime = 'nodejs';
export const dynamic = 'force-dynamic';
interface TtsBody {
text: string;
voice?: string;
language?: string;
sample_rate?: 16000 | 22050 | 24000 | 48000;
speed?: number;
emit_visemes?: boolean;
gen_params?: Record<string, string | number | boolean>;
}
export async function POST(req: NextRequest): Promise<Response> {
const body = (await req.json()) as TtsBody;
const upstream = await fetch(
`${process.env.TTS_URL}/v1/speak/sse?` +
new URLSearchParams({
text: body.text,
voice: body.voice ?? '',
language: body.language ?? 'en',
sample_rate: String(body.sample_rate ?? 24000),
speed: String(body.speed ?? 1.0),
emit_visemes: String(body.emit_visemes ?? false),
...(body.gen_params ? { gen_params: JSON.stringify(body.gen_params) } : {}),
}),
{
method: 'GET',
headers: { 'X-API-Key': process.env.TTS_API_KEY! },
cache: 'no-store',
},
);
if (!upstream.ok || !upstream.body) {
return new Response(`tts upstream ${upstream.status}`, { status: 502 });
}
return new Response(upstream.body, {
status: 200,
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache, no-transform',
'Connection': 'keep-alive',
'X-Accel-Buffering': 'no',
},
});
}Browser consumes the SSE:
const r = await fetch('/api/tts/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text: 'Hello world', language: 'en' }),
});
const reader = r.body!.getReader();
const dec = new TextDecoder();
let buf = '';
while (true) {
const { value, done } = await reader.read();
if (done) break;
buf += dec.decode(value, { stream: true });
const lines = buf.split('\n');
buf = lines.pop() ?? '';
for (const line of lines) {
if (!line.startsWith('data: ')) continue;
const frame = JSON.parse(line.slice(6));
// dispatch frame.type === 'Audio' | 'Metadata' | …
}
}Next.js 16 — one-shot POST (small text, no streaming UI)
// app/api/tts/synthesize/route.ts
export async function POST(req: NextRequest): Promise<Response> {
const body = await req.json();
const r = await fetch(`${process.env.TTS_URL}/v1/synthesize`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-API-Key': process.env.TTS_API_KEY!,
},
body: JSON.stringify(body),
cache: 'no-store',
});
return new Response(await r.text(), {
status: r.status,
headers: { 'Content-Type': 'application/json' },
});
}Browser receives { audio_b64, sample_rate, duration_s, visemes, metadata }.
NestJS 11 — service + controller
// src/modules/tts/tts.service.ts
import { Injectable } from '@nestjs/common';
import { ConfigService } from '@nestjs/config';
export interface SpeakOpts {
voice?: string;
language?: string;
sampleRate?: 16000 | 22050 | 24000 | 48000;
genParams?: Record<string, string | number | boolean>;
}
@Injectable()
export class TtsService {
constructor(private readonly cfg: ConfigService) {}
async synthesize(text: string, opts: SpeakOpts = {}): Promise<Buffer> {
const r = await fetch(`${this.cfg.get('TTS_URL')}/v1/synthesize`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-API-Key': this.cfg.getOrThrow('TTS_API_KEY'),
},
body: JSON.stringify({
text,
voice: opts.voice,
language: opts.language ?? 'en',
sample_rate: opts.sampleRate ?? 24000,
gen_params: opts.genParams,
}),
});
if (!r.ok) throw new Error(`tts ${r.status}`);
const json = (await r.json()) as { audio_b64: string };
return Buffer.from(json.audio_b64, 'base64');
}
async streamSse(text: string, opts: SpeakOpts, sink: NodeJS.WritableStream): Promise<void> {
const params = new URLSearchParams({
text,
language: opts.language ?? 'en',
sample_rate: String(opts.sampleRate ?? 24000),
...(opts.genParams ? { gen_params: JSON.stringify(opts.genParams) } : {}),
});
const r = await fetch(`${this.cfg.get('TTS_URL')}/v1/speak/sse?${params}`, {
headers: { 'X-API-Key': this.cfg.getOrThrow('TTS_API_KEY') },
});
if (!r.ok || !r.body) throw new Error(`tts ${r.status}`);
for await (const chunk of r.body as unknown as AsyncIterable<Uint8Array>) {
sink.write(chunk);
}
sink.end();
}
}// src/modules/tts/tts.controller.ts
@Controller('tts')
export class TtsController {
constructor(private readonly tts: TtsService) {}
@Post('stream')
async stream(@Body() body: SpeakDto, @Res() res: Response): Promise<void> {
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache, no-transform');
res.setHeader('Connection', 'keep-alive');
res.setHeader('X-Accel-Buffering', 'no');
await this.tts.streamSse(body.text, { genParams: body.genParams }, res);
}
}Python 3.12 — async one-shot + WS streaming
import base64, httpx
async def synthesize(text: str, *, voice: str | None = None,
language: str = "en", base_url: str, api_key: str) -> bytes:
async with httpx.AsyncClient(timeout=60) as cli:
r = await cli.post(
f"{base_url}/v1/synthesize",
headers={"X-API-Key": api_key},
json={"text": text, "voice": voice, "language": language, "stream": False},
)
r.raise_for_status()
return base64.b64decode(r.json()["audio_b64"])import asyncio, base64, json, websockets
async def stream_tts(text: str, *, base_url: str, api_key: str,
language: str = "en", voice: str | None = None,
gen_params: dict | None = None) -> bytes:
url = base_url.replace("http", "ws") + f"/v1/speak?api_key={api_key}"
pcm = bytearray()
async with websockets.connect(url, max_size=None, ping_interval=None) as ws:
await ws.send(json.dumps({
"text": text, "voice": voice, "language": language,
"sample_rate": 24000, "speed": 1.0,
"emit_visemes": False, "stream": True,
**({"gen_params": gen_params} if gen_params else {}),
}))
async for msg in ws:
obj = json.loads(msg)
if obj.get("type") == "Audio":
pcm += base64.b64decode(obj["audio"])
elif obj.get("type") in ("SynthesisEnded", "Error"):
break
return bytes(pcm)Backend parity (CUDA vs MLX)
| Feature | CUDA :4650 | MLX :4150 |
|---|---|---|
| Wire format (envelope v1.0.0) | ✅ | ✅ |
gen_params request field | ✅ | ❌ (single sampler, defaults already deterministic) |
subtalker_dosample=false default fix | ✅ | n/a |
Voice cloning (ref_audio_b64) | ✅ | ✅ |
Cross-backend clients should pass gen_params only — MLX silently ignores unknown fields.
Operational notes
- Tail truncation fixed at server defaults: clients no longer need to pass
subtalker_dosample=falseto get a complete utterance. - Override sparingly: passing
subtalker_dosample=truereturns upstreamqwen-ttsstochastic behavior — last-word truncation may return. - TTFB tuning: lower
emit_every_frames(1–2) for lowest first-byte latency, higher (4–8) for fewer WS messages on long utterances. - Quality tuning: raise
decode_window_frames(25 → 50–80) for cleaner tail audio at the cost of latency. - Cold-start: first frame ~600 ms after model warmup; subsequent requests reuse the warm engine.
Failure modes
| Symptom | Cause | Fix |
|---|---|---|
| HTTP 401 | missing X-API-Key / ?api_key | Add the header / query param |
HTTP 422 unsupported language | not in SpeakRequest._LANGUAGE_PATTERN | Use BCP-47 short tag (en, fr, de, it) |
HTTP 422 unsupported sample_rate | not in {16000, 22050, 24000, 48000} | Pick a supported rate |
| Last word missing in CUDA TTS | subtalker_dosample=true override | Remove override or set false |
| WS closes 1011 mid-utterance | Engine exception (Qwen3 / Kokoro init) | Check logs/tts.log; restart ./ddx-manage.sh restart --prod tts |
| SSE frames buffer at NGINX | proxy-buffering on | Set proxy_buffering off; proxy_set_header X-Accel-Buffering no; |
audio_b64 payload too large | one-shot response holding 10s+ utterance | Switch to /v1/speak/sse for long text |
Reference
- Service docs:
ddx-cuda-live-tts/TTS_API_USAGE.md,TTS_FULL_CAPABILITIES.md,TTS_API_ENDPOINTS.md,TTS_GEN_PARAMS.md - Frozen wire format:
ddx-prd-specs/envelopes/README.md, schematts-frame.schema.json - Live dashboard playground:
http://localhost:4650/v1/metrics/dashboard