Dudoxx Omni — LLM Integration Skill

Self-contained guide for integrating the Dudoxx LLM service from Next.js 16, NestJS 11, and Python 3.12. OpenAI-compatible chat with model swap, streaming, and tool calling.

Service: ddx-mlx-llm (port 4250). Public: see https://omni-demo.forge.dudoxx.com for deployed demo (proxied). Routes: GET /v1/models · GET /v1/models/current · POST /v1/models/load · POST /v1/chat/completions. Modes: mock (default — canned tool-call response) · real (DDX_LLM_USE_REAL_MODEL=1, mounts upstream mlx-omni-server chat router).

TL;DR

Auth: X-API-Key: <key> header OR ?api_key=<key> query. Constant-time HMAC compare against DDX_LLM_API_KEYS. When the env is empty, auth is bypassed (warning logged on prod).
Model registry: GET /v1/models returns the curated list with dudoxx extras (family, arch, total_params_b, quantization, context_window, multimodal, label).
Hot swap: POST /v1/models/load { model_id } — old model is auto-unloaded.
Streaming: POST /v1/chat/completions with stream: true → SSE chunks ending with data: [DONE].
Disconnection: server polls request.is_disconnected() between SSE chunks and aclose()s the upstream generator.

Endpoints

Method	Path	Auth	Purpose
GET	`/health` `/healthz`	none	`{ status, model, model_state, model_load_elapsed_ms, … }`
GET	`/metrics`	none	Prometheus exposition
GET	`/v1/models`	none	List registry (OpenAI-compat + dudoxx extras)
GET	`/v1/models/current`	none	`{ model_id, loaded, last_load_ms }`
POST	`/v1/models/load`	API key	`{ model_id }` → `{ swapped, load_ms, loaded }`
POST	`/v1/chat/completions`	API key	OpenAI-compatible chat (mock or real)

Status aggregation in /health: ready → ok, loading → loading, idle/failed → degraded. Always returns HTTP 200.

Next.js 16 — service-side OpenAI-compat client

// app/lib/llm.ts
import 'server-only';
import OpenAI from 'openai';

export function llmClient(): OpenAI {
  return new OpenAI({
    baseURL: `${process.env.LLM_URL}/v1`,
    apiKey: process.env.LLM_API_KEY ?? 'dummy',
  });
}

export async function chat(prompt: string): Promise<string> {
  const r = await llmClient().chat.completions.create({
    model: process.env.LLM_MODEL ?? 'mlx-community/Qwen3-4B-Instruct-2507-4bit',
    messages: [{ role: 'user', content: prompt }],
    stream: false,
  });
  return r.choices[0]?.message?.content ?? '';
}

Next.js 16 — streaming proxy (`app/api/llm/chat/route.ts`)

import type { NextRequest } from 'next/server';

export const runtime = 'nodejs';
export const dynamic = 'force-dynamic';

export async function POST(req: NextRequest): Promise<Response> {
  const body = await req.json();
  const upstream = await fetch(`${process.env.LLM_URL}/v1/chat/completions`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'X-API-Key': process.env.LLM_API_KEY!,
    },
    body: JSON.stringify({ ...body, stream: true }),
    cache: 'no-store',
  });
  if (!upstream.ok || !upstream.body) {
    return new Response(`llm upstream ${upstream.status}`, { status: 502 });
  }
  return new Response(upstream.body, {
    status: 200,
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache, no-transform',
      'Connection': 'keep-alive',
      'X-Accel-Buffering': 'no',
    },
  });
}

Browser consumes the SSE same as TTS (data: prefix, [DONE] terminator).

NestJS 11 — chat service

// src/modules/llm/llm.service.ts
import { Injectable } from '@nestjs/common';
import { ConfigService } from '@nestjs/config';

export interface ChatMessage {
  role: 'system' | 'user' | 'assistant' | 'tool';
  content: string;
  tool_call_id?: string;
}

export interface ChatRequest {
  messages: ChatMessage[];
  model?: string;
  temperature?: number;
  max_tokens?: number;
  tools?: ChatTool[];
  stream?: boolean;
}

export interface ChatTool {
  type: 'function';
  function: { name: string; description?: string; parameters: object };
}

export interface ChatResponse {
  id: string;
  choices: Array<{
    message: { role: string; content: string | null; tool_calls?: ToolCall[] };
    finish_reason: string;
  }>;
}

export interface ToolCall {
  id: string;
  type: 'function';
  function: { name: string; arguments: string };
}

@Injectable()
export class LlmService {
  constructor(private readonly cfg: ConfigService) {}

  async chat(req: ChatRequest): Promise<ChatResponse> {
    const r = await fetch(`${this.cfg.get('LLM_URL')}/v1/chat/completions`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-API-Key': this.cfg.getOrThrow('LLM_API_KEY'),
      },
      body: JSON.stringify({ ...req, stream: false }),
    });
    if (!r.ok) throw new Error(`llm ${r.status}`);
    return (await r.json()) as ChatResponse;
  }

  async stream(req: ChatRequest, sink: NodeJS.WritableStream): Promise<void> {
    const r = await fetch(`${this.cfg.get('LLM_URL')}/v1/chat/completions`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-API-Key': this.cfg.getOrThrow('LLM_API_KEY'),
      },
      body: JSON.stringify({ ...req, stream: true }),
    });
    if (!r.ok || !r.body) throw new Error(`llm ${r.status}`);
    for await (const chunk of r.body as unknown as AsyncIterable<Uint8Array>) {
      sink.write(chunk);
    }
    sink.end();
  }
}

Python 3.12 — async client (`openai` SDK + httpx)

python

from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="http://127.0.0.1:4250/v1",
    api_key="dummy",
)

async def ask(prompt: str) -> str:
    r = await client.chat.completions.create(
        model="mlx-community/Qwen3-4B-Instruct-2507-4bit",
        messages=[{"role": "user", "content": prompt}],
    )
    return r.choices[0].message.content or ""

Streaming:

python

async def stream(prompt: str):
    async for chunk in await client.chat.completions.create(
        model="mlx-community/Qwen3-4B-Instruct-2507-4bit",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    ):
        delta = chunk.choices[0].delta.content
        if delta:
            yield delta

Hot model swap (raw httpx — not in OpenAI SDK):

python

import httpx

async def swap_model(model_id: str) -> dict:
    async with httpx.AsyncClient(timeout=120) as cli:
        r = await cli.post(
            "http://127.0.0.1:4250/v1/models/load",
            headers={"X-API-Key": "dummy"},
            json={"model_id": model_id},
        )
        r.raise_for_status()
        return r.json()  # { swapped, load_ms, loaded }

Tool calling (mock mode default response)

In mock mode, POST /v1/chat/completions returns a canned response with one tool call:

json

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_0",
        "type": "function",
        "function": { "name": "get_weather", "arguments": "{\"city\":\"Berlin\"}" }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

Use this to verify your client tool-call parsing before flipping DDX_LLM_USE_REAL_MODEL=1.

In real mode, the upstream mlx-omni-server chat router is mounted as-is — request/response shapes match standard OpenAI.

Errors

Shape ({ code, message, detail? }):

Exception	HTTP	code
`RequestInvalid`	422	`request_invalid`
`ContextTooLong`	422	`context_too_long`
`ModelNotFound`	404	`model_not_found`
`ModelUnavailable`	503	`model_unavailable`
other `LlmError`	500	`llm_error`

Endpoints NOT exposed

The dudoxx layer adds no completions/embeddings router. In real mode, whatever the upstream mlx_omni_server.chat.openai.router exports is mounted as-is.

POST /v1/completions (legacy text completions) — not provided
POST /v1/embeddings — not provided
POST /v1/audio/*, POST /v1/images/* — not provided

For embeddings, use a separate service (e.g. via ddx-mlx-llm registry's multimodal models when added).

Failure modes

Symptom	Cause	Fix
HTTP 404 `model_not_found`	requested `model_id` not in registry	`GET /v1/models` first, pick an `id`
HTTP 404 + `detail="server not in real-model mode"` on `/v1/models/load`	`DDX_LLM_USE_REAL_MODEL=0`	Set env var, restart `./ddx-manage.sh restart --prod llm`
HTTP 422 `context_too_long`	total tokens exceed model's `context_window`	Truncate history or pick a larger-context model
HTTP 503 `model_unavailable`	model file missing / load failed	Check `model_error` in `/health`; verify `models/cache/`
Stream stalls at first chunk	upstream still loading the model on first request	Poll `/v1/models/current.loaded === true` before streaming
CORS error in browser	origin not in `DDX_LLM_CORS_ORIGINS`	Add origin or proxy through Next.js (preferred)
No assistant content in mock mode	mock always returns tool_calls	Switch to real mode or handle `tool_calls` in client

Reference

Service docs: ddx-mlx-llm/LLM_API_USAGE.md, LLM_API_ENDPOINTS.md, LLM_FULL_CAPABILITIES.md
OpenAPI: http://127.0.0.1:4250/openapi.json, security schemes APIKeyHeader + APIKeyQuery
Headers: every response carries X-Request-ID (echoed if client supplied; configurable via DDX_LLM_REQUEST_ID_HEADER)
CORS: DDX_LLM_CORS_ORIGINS (default http://localhost:3100)

Dudoxx Omni — LLM Integration Skill

TL;DR

Endpoints

Next.js 16 — service-side OpenAI-compat client

Next.js 16 — streaming proxy (app/api/llm/chat/route.ts)

NestJS 11 — chat service

Python 3.12 — async client (openai SDK + httpx)

Tool calling (mock mode default response)

Errors

Endpoints NOT exposed

Failure modes

Reference

Next.js 16 — streaming proxy (`app/api/llm/chat/route.ts`)

Python 3.12 — async client (`openai` SDK + httpx)