Dudoxx Omni — LLM Integration Skill
Self-contained guide for integrating the Dudoxx LLM service from Next.js 16, NestJS 11, and Python 3.12. OpenAI-compatible chat with model swap, streaming, and tool calling.
Service: ddx-mlx-llm (port 4250).
Public: see https://omni-demo.forge.dudoxx.com for deployed demo (proxied).
Routes: GET /v1/models · GET /v1/models/current · POST /v1/models/load · POST /v1/chat/completions.
Modes: mock (default — canned tool-call response) · real (DDX_LLM_USE_REAL_MODEL=1, mounts upstream mlx-omni-server chat router).
TL;DR
- Auth:
X-API-Key: <key>header OR?api_key=<key>query. Constant-time HMAC compare againstDDX_LLM_API_KEYS. When the env is empty, auth is bypassed (warning logged on prod). - Model registry:
GET /v1/modelsreturns the curated list with dudoxx extras (family,arch,total_params_b,quantization,context_window,multimodal,label). - Hot swap:
POST /v1/models/load { model_id }— old model is auto-unloaded. - Streaming:
POST /v1/chat/completionswithstream: true→ SSE chunks ending withdata: [DONE]. - Disconnection: server polls
request.is_disconnected()between SSE chunks andaclose()s the upstream generator.
Endpoints
| Method | Path | Auth | Purpose |
|---|---|---|---|
| GET | /health /healthz | none | { status, model, model_state, model_load_elapsed_ms, … } |
| GET | /metrics | none | Prometheus exposition |
| GET | /v1/models | none | List registry (OpenAI-compat + dudoxx extras) |
| GET | /v1/models/current | none | { model_id, loaded, last_load_ms } |
| POST | /v1/models/load | API key | { model_id } → { swapped, load_ms, loaded } |
| POST | /v1/chat/completions | API key | OpenAI-compatible chat (mock or real) |
Status aggregation in /health: ready → ok, loading → loading, idle/failed → degraded. Always returns HTTP 200.
Next.js 16 — service-side OpenAI-compat client
// app/lib/llm.ts
import 'server-only';
import OpenAI from 'openai';
export function llmClient(): OpenAI {
return new OpenAI({
baseURL: `${process.env.LLM_URL}/v1`,
apiKey: process.env.LLM_API_KEY ?? 'dummy',
});
}
export async function chat(prompt: string): Promise<string> {
const r = await llmClient().chat.completions.create({
model: process.env.LLM_MODEL ?? 'mlx-community/Qwen3-4B-Instruct-2507-4bit',
messages: [{ role: 'user', content: prompt }],
stream: false,
});
return r.choices[0]?.message?.content ?? '';
}Next.js 16 — streaming proxy (app/api/llm/chat/route.ts)
import type { NextRequest } from 'next/server';
export const runtime = 'nodejs';
export const dynamic = 'force-dynamic';
export async function POST(req: NextRequest): Promise<Response> {
const body = await req.json();
const upstream = await fetch(`${process.env.LLM_URL}/v1/chat/completions`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-API-Key': process.env.LLM_API_KEY!,
},
body: JSON.stringify({ ...body, stream: true }),
cache: 'no-store',
});
if (!upstream.ok || !upstream.body) {
return new Response(`llm upstream ${upstream.status}`, { status: 502 });
}
return new Response(upstream.body, {
status: 200,
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache, no-transform',
'Connection': 'keep-alive',
'X-Accel-Buffering': 'no',
},
});
}Browser consumes the SSE same as TTS (data: prefix, [DONE] terminator).
NestJS 11 — chat service
// src/modules/llm/llm.service.ts
import { Injectable } from '@nestjs/common';
import { ConfigService } from '@nestjs/config';
export interface ChatMessage {
role: 'system' | 'user' | 'assistant' | 'tool';
content: string;
tool_call_id?: string;
}
export interface ChatRequest {
messages: ChatMessage[];
model?: string;
temperature?: number;
max_tokens?: number;
tools?: ChatTool[];
stream?: boolean;
}
export interface ChatTool {
type: 'function';
function: { name: string; description?: string; parameters: object };
}
export interface ChatResponse {
id: string;
choices: Array<{
message: { role: string; content: string | null; tool_calls?: ToolCall[] };
finish_reason: string;
}>;
}
export interface ToolCall {
id: string;
type: 'function';
function: { name: string; arguments: string };
}
@Injectable()
export class LlmService {
constructor(private readonly cfg: ConfigService) {}
async chat(req: ChatRequest): Promise<ChatResponse> {
const r = await fetch(`${this.cfg.get('LLM_URL')}/v1/chat/completions`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-API-Key': this.cfg.getOrThrow('LLM_API_KEY'),
},
body: JSON.stringify({ ...req, stream: false }),
});
if (!r.ok) throw new Error(`llm ${r.status}`);
return (await r.json()) as ChatResponse;
}
async stream(req: ChatRequest, sink: NodeJS.WritableStream): Promise<void> {
const r = await fetch(`${this.cfg.get('LLM_URL')}/v1/chat/completions`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-API-Key': this.cfg.getOrThrow('LLM_API_KEY'),
},
body: JSON.stringify({ ...req, stream: true }),
});
if (!r.ok || !r.body) throw new Error(`llm ${r.status}`);
for await (const chunk of r.body as unknown as AsyncIterable<Uint8Array>) {
sink.write(chunk);
}
sink.end();
}
}Python 3.12 — async client (openai SDK + httpx)
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="http://127.0.0.1:4250/v1",
api_key="dummy",
)
async def ask(prompt: str) -> str:
r = await client.chat.completions.create(
model="mlx-community/Qwen3-4B-Instruct-2507-4bit",
messages=[{"role": "user", "content": prompt}],
)
return r.choices[0].message.content or ""Streaming:
async def stream(prompt: str):
async for chunk in await client.chat.completions.create(
model="mlx-community/Qwen3-4B-Instruct-2507-4bit",
messages=[{"role": "user", "content": prompt}],
stream=True,
):
delta = chunk.choices[0].delta.content
if delta:
yield deltaHot model swap (raw httpx — not in OpenAI SDK):
import httpx
async def swap_model(model_id: str) -> dict:
async with httpx.AsyncClient(timeout=120) as cli:
r = await cli.post(
"http://127.0.0.1:4250/v1/models/load",
headers={"X-API-Key": "dummy"},
json={"model_id": model_id},
)
r.raise_for_status()
return r.json() # { swapped, load_ms, loaded }Tool calling (mock mode default response)
In mock mode, POST /v1/chat/completions returns a canned response with one tool call:
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_0",
"type": "function",
"function": { "name": "get_weather", "arguments": "{\"city\":\"Berlin\"}" }
}]
},
"finish_reason": "tool_calls"
}]
}Use this to verify your client tool-call parsing before flipping DDX_LLM_USE_REAL_MODEL=1.
In real mode, the upstream mlx-omni-server chat router is mounted as-is — request/response shapes match standard OpenAI.
Errors
Shape ({ code, message, detail? }):
| Exception | HTTP | code |
|---|---|---|
RequestInvalid | 422 | request_invalid |
ContextTooLong | 422 | context_too_long |
ModelNotFound | 404 | model_not_found |
ModelUnavailable | 503 | model_unavailable |
other LlmError | 500 | llm_error |
Endpoints NOT exposed
The dudoxx layer adds no completions/embeddings router. In real mode, whatever the upstream mlx_omni_server.chat.openai.router exports is mounted as-is.
POST /v1/completions(legacy text completions) — not providedPOST /v1/embeddings— not providedPOST /v1/audio/*,POST /v1/images/*— not provided
For embeddings, use a separate service (e.g. via ddx-mlx-llm registry's multimodal models when added).
Failure modes
| Symptom | Cause | Fix |
|---|---|---|
HTTP 404 model_not_found | requested model_id not in registry | GET /v1/models first, pick an id |
HTTP 404 + detail="server not in real-model mode" on /v1/models/load | DDX_LLM_USE_REAL_MODEL=0 | Set env var, restart ./ddx-manage.sh restart --prod llm |
HTTP 422 context_too_long | total tokens exceed model's context_window | Truncate history or pick a larger-context model |
HTTP 503 model_unavailable | model file missing / load failed | Check model_error in /health; verify models/cache/ |
| Stream stalls at first chunk | upstream still loading the model on first request | Poll /v1/models/current.loaded === true before streaming |
| CORS error in browser | origin not in DDX_LLM_CORS_ORIGINS | Add origin or proxy through Next.js (preferred) |
| No assistant content in mock mode | mock always returns tool_calls | Switch to real mode or handle tool_calls in client |
Reference
- Service docs:
ddx-mlx-llm/LLM_API_USAGE.md,LLM_API_ENDPOINTS.md,LLM_FULL_CAPABILITIES.md - OpenAPI:
http://127.0.0.1:4250/openapi.json, security schemesAPIKeyHeader+APIKeyQuery - Headers: every response carries
X-Request-ID(echoed if client supplied; configurable viaDDX_LLM_REQUEST_ID_HEADER) - CORS:
DDX_LLM_CORS_ORIGINS(defaulthttp://localhost:3100)