# Dudoxx Omni — LLM Integration Skill

> Self-contained guide for integrating the Dudoxx LLM service from Next.js 16, NestJS 11, and Python 3.12. OpenAI-compatible chat with model swap, streaming, and tool calling.

**Service**: `ddx-mlx-llm` (port `4250`).
**Public**: see `https://omni-demo.forge.dudoxx.com` for deployed demo (proxied).
**Routes**: `GET /v1/models` · `GET /v1/models/current` · `POST /v1/models/load` · `POST /v1/chat/completions`.
**Modes**: `mock` (default — canned tool-call response) · `real` (`DDX_LLM_USE_REAL_MODEL=1`, mounts upstream `mlx-omni-server` chat router).

---

## TL;DR

- **Auth**: `X-API-Key: <key>` header OR `?api_key=<key>` query. Constant-time HMAC compare against `DDX_LLM_API_KEYS`. When the env is empty, auth is bypassed (warning logged on prod).
- **Model registry**: `GET /v1/models` returns the curated list with dudoxx extras (`family`, `arch`, `total_params_b`, `quantization`, `context_window`, `multimodal`, `label`).
- **Hot swap**: `POST /v1/models/load { model_id }` — old model is auto-unloaded.
- **Streaming**: `POST /v1/chat/completions` with `stream: true` → SSE chunks ending with `data: [DONE]`.
- **Disconnection**: server polls `request.is_disconnected()` between SSE chunks and `aclose()`s the upstream generator.

---

## Endpoints

| Method | Path | Auth | Purpose |
|---|---|---|---|
| GET | `/health` `/healthz` | none | `{ status, model, model_state, model_load_elapsed_ms, … }` |
| GET | `/metrics` | none | Prometheus exposition |
| GET | `/v1/models` | none | List registry (OpenAI-compat + dudoxx extras) |
| GET | `/v1/models/current` | none | `{ model_id, loaded, last_load_ms }` |
| POST | `/v1/models/load` | API key | `{ model_id }` → `{ swapped, load_ms, loaded }` |
| POST | `/v1/chat/completions` | API key | OpenAI-compatible chat (mock or real) |

Status aggregation in `/health`: `ready → ok`, `loading → loading`, `idle/failed → degraded`. Always returns HTTP 200.

---

## Next.js 16 — service-side OpenAI-compat client

```ts
// app/lib/llm.ts
import 'server-only';
import OpenAI from 'openai';

export function llmClient(): OpenAI {
  return new OpenAI({
    baseURL: `${process.env.LLM_URL}/v1`,
    apiKey: process.env.LLM_API_KEY ?? 'dummy',
  });
}

export async function chat(prompt: string): Promise<string> {
  const r = await llmClient().chat.completions.create({
    model: process.env.LLM_MODEL ?? 'mlx-community/Qwen3-4B-Instruct-2507-4bit',
    messages: [{ role: 'user', content: prompt }],
    stream: false,
  });
  return r.choices[0]?.message?.content ?? '';
}
```

---

## Next.js 16 — streaming proxy (`app/api/llm/chat/route.ts`)

```ts
import type { NextRequest } from 'next/server';

export const runtime = 'nodejs';
export const dynamic = 'force-dynamic';

export async function POST(req: NextRequest): Promise<Response> {
  const body = await req.json();
  const upstream = await fetch(`${process.env.LLM_URL}/v1/chat/completions`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'X-API-Key': process.env.LLM_API_KEY!,
    },
    body: JSON.stringify({ ...body, stream: true }),
    cache: 'no-store',
  });
  if (!upstream.ok || !upstream.body) {
    return new Response(`llm upstream ${upstream.status}`, { status: 502 });
  }
  return new Response(upstream.body, {
    status: 200,
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache, no-transform',
      'Connection': 'keep-alive',
      'X-Accel-Buffering': 'no',
    },
  });
}
```

Browser consumes the SSE same as TTS (`data: ` prefix, `[DONE]` terminator).

---

## NestJS 11 — chat service

```ts
// src/modules/llm/llm.service.ts
import { Injectable } from '@nestjs/common';
import { ConfigService } from '@nestjs/config';

export interface ChatMessage {
  role: 'system' | 'user' | 'assistant' | 'tool';
  content: string;
  tool_call_id?: string;
}

export interface ChatRequest {
  messages: ChatMessage[];
  model?: string;
  temperature?: number;
  max_tokens?: number;
  tools?: ChatTool[];
  stream?: boolean;
}

export interface ChatTool {
  type: 'function';
  function: { name: string; description?: string; parameters: object };
}

export interface ChatResponse {
  id: string;
  choices: Array<{
    message: { role: string; content: string | null; tool_calls?: ToolCall[] };
    finish_reason: string;
  }>;
}

export interface ToolCall {
  id: string;
  type: 'function';
  function: { name: string; arguments: string };
}

@Injectable()
export class LlmService {
  constructor(private readonly cfg: ConfigService) {}

  async chat(req: ChatRequest): Promise<ChatResponse> {
    const r = await fetch(`${this.cfg.get('LLM_URL')}/v1/chat/completions`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-API-Key': this.cfg.getOrThrow('LLM_API_KEY'),
      },
      body: JSON.stringify({ ...req, stream: false }),
    });
    if (!r.ok) throw new Error(`llm ${r.status}`);
    return (await r.json()) as ChatResponse;
  }

  async stream(req: ChatRequest, sink: NodeJS.WritableStream): Promise<void> {
    const r = await fetch(`${this.cfg.get('LLM_URL')}/v1/chat/completions`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-API-Key': this.cfg.getOrThrow('LLM_API_KEY'),
      },
      body: JSON.stringify({ ...req, stream: true }),
    });
    if (!r.ok || !r.body) throw new Error(`llm ${r.status}`);
    for await (const chunk of r.body as unknown as AsyncIterable<Uint8Array>) {
      sink.write(chunk);
    }
    sink.end();
  }
}
```

---

## Python 3.12 — async client (`openai` SDK + httpx)

```python
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="http://127.0.0.1:4250/v1",
    api_key="dummy",
)

async def ask(prompt: str) -> str:
    r = await client.chat.completions.create(
        model="mlx-community/Qwen3-4B-Instruct-2507-4bit",
        messages=[{"role": "user", "content": prompt}],
    )
    return r.choices[0].message.content or ""
```

Streaming:

```python
async def stream(prompt: str):
    async for chunk in await client.chat.completions.create(
        model="mlx-community/Qwen3-4B-Instruct-2507-4bit",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    ):
        delta = chunk.choices[0].delta.content
        if delta:
            yield delta
```

Hot model swap (raw httpx — not in OpenAI SDK):

```python
import httpx

async def swap_model(model_id: str) -> dict:
    async with httpx.AsyncClient(timeout=120) as cli:
        r = await cli.post(
            "http://127.0.0.1:4250/v1/models/load",
            headers={"X-API-Key": "dummy"},
            json={"model_id": model_id},
        )
        r.raise_for_status()
        return r.json()  # { swapped, load_ms, loaded }
```

---

## Tool calling (mock mode default response)

In mock mode, `POST /v1/chat/completions` returns a canned response with one tool call:

```json
{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_0",
        "type": "function",
        "function": { "name": "get_weather", "arguments": "{\"city\":\"Berlin\"}" }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}
```

Use this to verify your client tool-call parsing before flipping `DDX_LLM_USE_REAL_MODEL=1`.

In real mode, the upstream `mlx-omni-server` chat router is mounted as-is — request/response shapes match standard OpenAI.

---

## Errors

Shape (`{ code, message, detail? }`):

| Exception | HTTP | code |
|---|---|---|
| `RequestInvalid` | 422 | `request_invalid` |
| `ContextTooLong` | 422 | `context_too_long` |
| `ModelNotFound` | 404 | `model_not_found` |
| `ModelUnavailable` | 503 | `model_unavailable` |
| other `LlmError` | 500 | `llm_error` |

---

## Endpoints NOT exposed

The dudoxx layer adds **no** completions/embeddings router. In real mode, whatever the upstream `mlx_omni_server.chat.openai.router` exports is mounted as-is.

- `POST /v1/completions` (legacy text completions) — not provided
- `POST /v1/embeddings` — not provided
- `POST /v1/audio/*`, `POST /v1/images/*` — not provided

For embeddings, use a separate service (e.g. via `ddx-mlx-llm` registry's multimodal models when added).

---

## Failure modes

| Symptom | Cause | Fix |
|---|---|---|
| HTTP 404 `model_not_found` | requested `model_id` not in registry | `GET /v1/models` first, pick an `id` |
| HTTP 404 + `detail="server not in real-model mode"` on `/v1/models/load` | `DDX_LLM_USE_REAL_MODEL=0` | Set env var, restart `./ddx-manage.sh restart --prod llm` |
| HTTP 422 `context_too_long` | total tokens exceed model's `context_window` | Truncate history or pick a larger-context model |
| HTTP 503 `model_unavailable` | model file missing / load failed | Check `model_error` in `/health`; verify `models/cache/` |
| Stream stalls at first chunk | upstream still loading the model on first request | Poll `/v1/models/current.loaded === true` before streaming |
| CORS error in browser | origin not in `DDX_LLM_CORS_ORIGINS` | Add origin or proxy through Next.js (preferred) |
| No assistant content in mock mode | mock always returns tool_calls | Switch to real mode or handle `tool_calls` in client |

---

## Reference

- Service docs: `ddx-mlx-llm/LLM_API_USAGE.md`, `LLM_API_ENDPOINTS.md`, `LLM_FULL_CAPABILITIES.md`
- OpenAPI: `http://127.0.0.1:4250/openapi.json`, security schemes `APIKeyHeader` + `APIKeyQuery`
- Headers: every response carries `X-Request-ID` (echoed if client supplied; configurable via `DDX_LLM_REQUEST_ID_HEADER`)
- CORS: `DDX_LLM_CORS_ORIGINS` (default `http://localhost:3100`)