TTS — Help
Generate speech from text in three languages (English, German, French) with two operating modes, batch synthesis, voice cloning, and studio-grade post-processing.
This page covers everything you can do from the /tts panel on omni-demo. For HTTP integration examples (Next.js / NestJS / Python), see ddx-cuda-live-tts/TTS_API_USAGE.md.
Realtime vs HQ
The TTS panel has two synthesis modes, selected from the segmented control at the top.
Realtime streams audio over a WebSocket as it's generated. First sound plays in ~300 ms (TTFB); visemes arrive in sliding-window batches you can size with the Viseme window slider. Use Realtime for live demos, voice agents, or anywhere you need the mouth to start moving before the sentence finishes.
HQ posts the whole text in one HTTP request, waits for synthesis to complete, then plays a single decoded audio blob with word-level alignment. Latency is higher (full utterance time), but you get loudness normalization, container/bit-depth choices, head/tail silence, deterministic seeds, and click-to-seek word chips. Use HQ for podcast cuts, voice-over exports, or anything you'll save to disk.
| Realtime | HQ | |
|---|---|---|
| Transport | WS /v1/speak | POST /v1/synthesize/hq |
| TTFB | ~300 ms | full-utterance |
| Visemes | sliding window (slider) | terminal, word-aligned |
| Studio settings | hidden | visible |
| Word chips | no | yes (click to seek) |
Switching modes mid-utterance is blocked while audio is playing — press Stop first.
Studio settings
When you switch to HQ mode, a collapsible Studio settings panel appears below the main controls. It's hidden in Realtime because those parameters don't apply to a streaming WS contract.
Open it to set loudness target, container format, bit depth, and head/tail silence. The values you choose ride along with every HQ request until you change them — they're not saved across page reloads (yet).
Example: for a -16 LUFS, 24-bit FLAC export with 250 ms head silence, open Studio settings and set Loudness EBU R128, Target LUFS -16, Container flac, Bits 24-bit, Head silence 250.
Loudness normalization
Three options:
- None — raw engine output. Levels vary by voice and prompt.
- Peak — ffmpeg single-pass peak normalize to −1 dBFS. Fast, prevents clipping, ignores Target LUFS.
- EBU R128 — single-pass
loudnormto your chosen target LUFS withTP=-1.5,LRA=11. Slower (a few hundred ms extra), but the level you target is the level you get.
Pick EBU R128 for anything that will be mixed with music or other voice tracks; pick Peak for fast batch exports where you just need consistent ceiling.
Target LUFS is only shown when EBU R128 is selected. Allowed values: -23 (broadcast spec), -16 (podcast loud), -14 (streaming-platform default — Spotify / Apple Music).
Container format
Four output containers, all written from the same 24 kHz PCM master:
| Format | Bits | Use case |
|---|---|---|
mp3 | 16 only | smallest file; lossy; max compatibility |
wav | 16 / 24 | lossless; large; archival |
flac | 16 / 24 | lossless compressed; ~50% of WAV |
opus | 16 only | smallest lossless-grade speech; ideal for streaming |
The Studio panel Container dropdown chooses one. If you pick mp3 or opus, the bit-depth selector is constrained to 16 by the server.
Bit depth & sample rate
Bit depth is the dynamic range of each PCM sample:
- 16-bit — 96 dB SNR; CD-quality. Default. Use for speech, demos, web playback.
- 24-bit — 144 dB SNR; mastering grade. Use when you'll mix with other tracks, apply EQ, or hand off to a producer.
Sample rate is fixed at 24 kHz for the engine and downsampled / upsampled by ffmpeg on the way out. The realtime WS contract negotiates sample_rate separately (16000 / 22050 / 24000 / 48000) — see Realtime vs HQ.
Head & tail silence
Two number inputs (0–5000 ms, step 50) pad pure silence at the start and end of the rendered audio:
- Head silence — useful before a voice-over so the editor has a frame of pre-roll, or to stop a hard click at playback start.
- Tail silence — fades the utterance into quiet rather than truncating on the last phoneme. ~200 ms is usually enough; 1000 ms feels like a hand-off pause.
Both default to 0. Server enforces the 5000 ms upper bound.
SSML — prosody & phoneme
The SSML lite toggle (gear icon in Controls) lets you embed two tag families inline:
<prosody rate="slow" pitch="-2st">Please listen carefully.</prosody>
<phoneme alphabet="ipa" ph="ˈdjuːdɒks">Dudoxx</phoneme>Supported rate values: x-slow | slow | medium | fast | x-fast (or any positive decimal, e.g. 0.85).
Supported pitch values: ±Nst (semitones, range -12st…+12st), x-low | low | medium | high | x-high.
Phoneme alphabet accepts ipa only; ph is the IPA string spoken in place of the child text.
When SSML lite is OFF, tags are stripped and the text inside is spoken verbatim — safe default for user-pasted content.
Word alignment & visemes
Every HQ response carries an alignment[] of {word, start, end} triples plus a visemes[] array of {viseme, start, duration} frames. The panel uses both:
- Word chips render under the player; the chip whose
[start, end]brackets<audio>.currentTimeis highlighted. Click a chip to seek the player to that word'sstart. - VisemeFace mouth-shape ticks against the same
<audio>.currentTimevia a single requestAnimationFrame loop — no buffering, plays immediately.
In Realtime mode visemes stream in sliding windows (default 2 s, range 0.5–5.0). Lower window = tighter lip-sync, higher TTFB; higher window = more audio per batch, looser sync. Toggle Emit visemes off entirely if you only need audio (saves bandwidth + aligner CPU on the server).
Batch synthesis
The Batch panel synthesizes up to 32 items in one HTTP round-trip via POST /v1/synthesize/batch.
Two input modes:
-
Pipe (default) — one item per line,
id|text:intro|Welcome to Dudoxx Omni. pitch|We turn raw audio into structured records. close|Talk soon.If you omit the
id|, a random id is generated. -
JSON — paste an array of
{id, text}objects:json [ {"id": "intro", "text": "Welcome to Dudoxx Omni."}, {"id": "pitch", "text": "We turn raw audio into structured records."} ]
Defaults at the top (voice, language, speed, loudness, target LUFS, bits, format) apply to every item — the per-request body is built server-side from your defaults plus each item's text.
Results render in input order: one <audio> + Download link per success, a red badge with {error.code}: {error.message} per failure. Blob URLs are revoked when you submit again or leave the page. Over 32 items shows a warning and disables Synthesize.
Voice cloning (style + tier)
Two cloning slots are wired through the request body:
ref_audio_b64— the primary voice prompt (3–10 s, 16 kHz+ mono). The engine matches timbre + cadence.style_ref_audio_b64— optional secondary clip whose style (emotion, intensity, prosody pattern) is transferred onto the primary voice.
Two tuning knobs:
clone_strength(0.0–1.0, default0.7) — how strictly the clone tracks the prompt. Lower = more engine personality; higher = closer mimicry.clone_steps(8–32, default16) — ICL steps. More steps = sharper match at the cost of TTFB.
The web UI uses the engine defaults; programmatic clients can pass both fields in the request body. See TTS_API_ENDPOINTS.md for the field schemas.
Voice catalogue (31 voices, 3 languages)
The full live catalogue is at GET /v1/voices and ddx-web fetches it on every TTS / Translator page load (cache: 'no-store') — new voices appear automatically without a deploy.
English (10): ddx_bella F, ddx_heart F, ddx_adam M (DDX clones), plus en_eleanor, en_charlotte, en_victoria F and en_william, en_george, en_arthur M (LibriVox-sourced 2026-05-17).
German (11): de_katharina F, de_frenz, de_hans, de_karlsson, de_hokuspokus M (legacy), plus de_alice_anna, de_alice_maria, de_alice_klara F and de_grimm_max, de_grimm_otto, de_grimm_kurt M (LibriVox-sourced 2026-05-17).
French (10): fr_sonia, fr_ezwa, fr_nadine F, fr_jean M (legacy), plus fr_camille, fr_juliette, fr_margot F and fr_jules, fr_louis, fr_henri M (LibriVox-sourced 2026-05-17).
Open-license voices (the *_alice_*, *_grimm_*, en_william/george/arthur/eleanor/charlotte/victoria, fr_jules/louis/henri/camille/juliette/margot prefixes) are derived from public-domain LibriVox recordings, preprocessed with a fixed pipeline (high-pass 80 Hz → spectral denoise → loudness-normalize to -16 LUFS → 20s trim → 24 kHz mono PCM-16). Each voice ships a voices/<id>.<lang>.manifest.json next to the WAV with the source URL, license, and processing parameters. License: public domain (LibriVox). For attribution in commercial use, the manifest's license_url field points to the source Internet Archive item.
The VoiceSettingsPanel side sheet groups voices by language and shows engine / gender / accent metadata. Click Set default to pin one to sessionStorage['ddx-tts-voice-pref']. The Translator board (ddx-web/src/components/translator/) and Batch panel both read the same catalogue.
Spell-out tokens
The text normalizer expands numerals, units, and dates into the spoken form before synthesis. If a token MUST be spelled letter-by-letter (e.g. an acronym you don't want pronounced as a word, or a serial number), wrap it in the spell_out request field:
{
"text": "Your case number is BFG-9000.",
"spell_out": ["BFG-9000"]
}The normalizer then emits B F G dash nine zero zero zero to the engine instead of the literal token. Multiple entries are matched case-insensitively in order.
Usage receipts
Every HQ response carries a usage block:
{
"characters": 142,
"audio_seconds": 9.83,
"engine": "qwen3-12hz-0.6b",
"gpu_seconds": 2.4
}The panel shows a Usage card next to the player after each synthesis. gpu_seconds is hidden when the backend doesn't report it (MLX does not — CUDA does). Use the receipt to:
- Predict cost (
charactersfor prompt-billed plans,audio_secondsfor runtime-billed). - Detect a stuck engine (
gpu_seconds>>audio_secondsmeans the model is thrashing). - Capacity-plan (
audio_seconds / wall_seconds= real-time factor).
ETag cache (advanced)
The HQ and /v1/render endpoints honor HTTP cache validation. Server hashes {text, voice, language, speed, seed, quality, normalize_loudness, target_lufs, bits_per_sample, lexicon_hash, ssml_lite, spell_out, head_silence_ms, tail_silence_ms, ref_audio_sha256, style_ref_audio_sha256} into a strong ETag.
Subsequent requests sending If-None-Match: "<etag>" for the same parameters get a 304 Not Modified with zero body — useful for:
- Idempotent retries — a flaky network won't double-bill.
- Browser caching — same prompt across two tabs reuses one synthesis.
- CDN warm-up — pre-fetch popular prompts; subsequent users hit the edge.
The web UI does not surface this directly; pass If-None-Match from your own client and you'll see the 304 in DevTools.
Reference
- Frontend component map:
ddx-web/TTS_FRONTEND.md - Backend wire contract:
ddx-cuda-live-tts/TTS_API_ENDPOINTS.md - Integration recipes (Next.js / NestJS / Python):
ddx-cuda-live-tts/TTS_API_USAGE.md - Frozen envelope schema:
ddx-prd-specs/envelopes/schemas/tts-frame.schema.json