Major Features: - Audio transcription via mlx-audio backend (Whisper, >10min duration) - OpenAI /v1/audio/transcriptions endpoint - Memory Gate System (Vision: 8GB, Audio: 4GB) - Config-based backend routing (ADR-020) - Benchmark toolchain (memmon/memplot, Schema v0.2.2) Key Fixes: - EuroLLM tokenizer decoding - Vision-model text-only routing regression - Multimodal model context length detection - Memory cleanup bug (mx.metal.clear_cache) - Orphan process bug Test Results: - Unit tests: 647 passed, 11 skipped (Python 3.10-3.12) - wet-umbrella: 171 passed total See CHANGELOG.md for complete details and known issues.
34 KiB
MLX Knife Server Handbook
Version: 2.0.4-beta.9 (WIP) Status: ⚠️ WORK IN PROGRESS - This document will evolve until 2.1 stable release Last Updated: 2026-02-02
Audience: Server operators, DevOps, API consumers For implementation details: See
ARCHITECTURE.mdanddocs/ADR/(developer documentation)
Quick Start
# Basic server
mlxk serve --port 8000
# JSON logging (production)
mlxk serve --port 8000 --log-json
# Custom host
mlxk serve --host 0.0.0.0 --port 8000
Requirements:
- Python 3.10-3.12 (Text, Vision, Audio)
- mlx-lm ≥0.30.5
- mlx-vlm ≥0.3.10 (PyPI) for Vision
- mlx-audio ≥0.3.1 (PyPI) for Audio STT (
pip install mlx-knife[audio])
OpenAI API Compatibility
MLX Knife implements a subset of the OpenAI API with documented behavioral differences.
Supported Endpoints
| Endpoint | Status | Notes |
|---|---|---|
/v1/chat/completions |
✅ Supported | Text, Vision (image_url), Audio (input_audio) |
/v1/completions |
✅ Supported | Legacy text completion |
/v1/audio/transcriptions |
✅ Supported | OpenAI Whisper API (beta.9+) |
/v1/models |
✅ Supported | Extended with context_length field |
/health |
✅ Custom | MLX Knife extension |
Authentication
MLX Knife ignores authentication headers. The server accepts but does not validate:
Authorization: Bearer ...- Any API key
Note: For production deployments requiring authentication, use a reverse proxy (nginx, Caddy).
⚠️ Client Implementers: When adding reverse proxy authentication, ensure your client sends authentication headers to all endpoints, including:
/v1/chat/completions/v1/completions/v1/audio/transcriptions(file upload endpoint)/v1/models
A common mistake is implementing auth for JSON endpoints but forgetting multipart/form-data endpoints like audio transcription.
Request Headers
Content-Type: application/json (required)
Authorization: Bearer ... (optional, ignored)
Response Headers
X-Request-ID: <unique-id> (all responses, MLX Knife extension)
X-Request-ID (MLX Knife extension):
- Present on every response (success and error)
- Same ID appears in error response body as
"request_id" - Use for request correlation and distributed tracing (e.g., Broke-Cluster log aggregation)
Behavioral Deviations from OpenAI
These are intentional design choices, not bugs:
| Behavior | OpenAI | MLX Knife | Reason |
|---|---|---|---|
| Vision history | Full history to model | Only last user message | Prevents pattern reproduction (hallucinations) |
| Image URLs | HTTP URLs + Base64 + File IDs | Base64 data URLs only | No external fetching |
| Audio+Vision | Both processed | Audio silently ignored | mlx-vlm limitation |
| Multi-audio | Supported | 1 per request | mlx-vlm limitation |
| Error format | {"error": {"message", "type", "code"}} |
ADR-004 envelope (see below) | Richer error context |
max_completion_tokens |
Preferred | Not supported (use max_tokens) |
Legacy compatibility |
| HTTP 507 | Not used | Memory constraint | Explicit OOM prevention |
Error Response Format
MLX Knife uses an extended error envelope (ADR-004), not the OpenAI format:
{
"status": "error",
"error": {
"type": "validation_error",
"message": "No user message found",
"retryable": false
},
"request_id": "abc123..."
}
Error types: validation_error, model_not_found, internal_error, server_shutdown, insufficient_memory, access_denied
API Endpoints
POST /v1/chat/completions
OpenAI-compatible chat completion endpoint.
Request:
{
"model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": null,
"temperature": 0.7,
"stream": false
}
Vision Request (Base64 Images):
{
"model": "mlx-community/Llama-3.2-11B-Vision-Instruct-4bit",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}
],
"max_tokens": 2048,
"temperature": 0.4,
"chunk": 1
}
Audio Request (OpenAI input_audio format):
{
"model": "mlx-community/gemma-3n-E2B-it-4bit",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe what is spoken in this audio"},
{
"type": "input_audio",
"input_audio": {
"data": "<base64-encoded>",
"format": "wav"
}
}
]
}
],
"max_tokens": 2048,
"temperature": 0.0
}
Supported audio formats: wav, mp3 (or mpeg alias)
mlx-knife Extension Parameters:
chunk(integer, optional): Batch size for vision processing (default: 1). Controls how many images are processed per inference session. Higher values may trigger OOM on resource-constrained systems. Maximum: 5 (enforced by server).
Default chunk size:
- Request parameter
chunk(highest priority) - Server startup:
mlxk serve --chunk N - Environment:
MLXK2_VISION_CHUNK_SIZE=N - Default: 1 (maximum safety)
Response:
{
"id": "chatcmpl-...",
"object": "chat.completion",
"created": 1702345678,
"model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 8,
"total_tokens": 20
}
}
POST /v1/completions
Legacy completion endpoint (text-only, no chat template).
Request:
{
"model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
"prompt": "Once upon a time",
"max_tokens": 100,
"temperature": 0.7
}
POST /v1/audio/transcriptions
OpenAI Whisper API compatible audio transcription (beta.9+).
Use this endpoint for direct file upload transcription with STT models (Whisper, Voxtral).
Request (multipart/form-data):
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper-large" \
-F "language=en" \
-F "response_format=json"
Form Fields:
| Field | Type | Required | Description |
|---|---|---|---|
file |
File | ✅ | Audio file (WAV, MP3, M4A, FLAC, OGG) |
model |
String | ✅ | Model ID (e.g., whisper-large, mlx-community/whisper-large-v3-turbo-4bit) |
language |
String | ❌ | Language code (e.g., en, de). Auto-detect if omitted. |
prompt |
String | ❌ | Optional context to guide transcription |
response_format |
String | ❌ | json (default), text, verbose_json |
temperature |
Float | ❌ | Sampling temperature (default: 0.0 for greedy) |
Response (JSON - default):
{
"text": "A man said to the universe, Sir, I exist."
}
Response (text):
A man said to the universe, Sir, I exist.
Response (verbose_json):
{
"task": "transcribe",
"language": "en",
"duration": 0.57,
"text": "A man said to the universe, Sir, I exist."
}
Supported Models:
- Whisper:
whisper-large,mlx-community/whisper-large-v3-turbo-4bit - Voxtral:
mlx-community/Voxtral-Mini-3B-2507-bf16(upstream tokenizer issues)
Note: This endpoint requires mlx-audio (pip install mlx-knife[audio]).
vs. /v1/chat/completions with input_audio:
| Feature | /v1/audio/transcriptions |
/v1/chat/completions |
|---|---|---|
| Format | Multipart file upload | Base64 in JSON |
| Models | STT only (Whisper, Voxtral) | Multimodal (Gemma-3n) |
| Use case | Pure transcription | Chat with audio context |
| OpenAI API | Whisper API | Chat Completions API |
GET /v1/models
List available models.
Returns all cached models that are healthy and runtime-compatible. Models are sorted with preloaded model first (if any), then alphabetically.
Response:
{
"object": "list",
"data": [
{
"id": "mlx-community/Llama-3.2-3B-Instruct-4bit",
"object": "model",
"owned_by": "mlx-knife-2.0",
"permission": [],
"context_length": 8192
}
]
}
Fields:
id: Model identifier (HuggingFace name or workspace path)object: Always"model"(OpenAI-compatible)owned_by:"mlx-knife-2.0"for cached models,"workspace"for local directoriespermission: Empty array (OpenAI legacy field)context_length: Maximum context window in tokens (may benullif unavailable)
Why context_length matters:
MLX Knife uses client-side context management (unlike OpenAI's server-side history):
- Vision models: Fully stateless - client holds entire conversation history
- Text models: Shift-window (context_length / 2 reserved for history on server)
- Clients need this to manage conversation pruning and token budgets
- Load balancing: BROKE Cluster and similar tools use this for scheduling decisions
Note: LM Studio provides similar field as max_context_length.
GET /health
Server health check (200 OK if server is running).
Features & Capabilities
Vision Support (2.0.4-beta.1)
See examples/vision_pipe.sh for a practical Vision→Text pipeline example (CLI).
Supported:
- ✅ Base64 data URLs (
data:image/jpeg;base64,...) - ✅ Multiple images (up to 5 per request)
- ✅ Formats: JPEG, PNG, GIF, WebP
Limits:
- Per-image: 20 MB max
- Count: 5 images max per request
Important Characteristics:
- Stateless Server: No server-side state required
- Sequential Images: Only images from the last user message are processed (OpenAI API compliant)
- Each request is independent: No "shift-window" context like text models (Metal memory limitations)
Stable Image IDs (History-Based)
Problem: How to maintain stable "Image 1, 2, 3..." numbering across multiple requests?
Solution: The conversation history IS the session.
The server scans the full messages[] array (which clients send with each request per OpenAI API) and assigns IDs chronologically based on content hash:
Request 1: beach.jpg (hash: 5c691ddb) → Image 1
Request 2: beach.jpg + mountain.jpg in history → Image 1, Image 2
Request 3: Re-upload beach.jpg → Still Image 1 (hash match)
Properties:
- ✅ Standard messages[] format — no custom headers or protocol extensions
- ✅ Stateless server — no registry, no TTL, no cleanup
- ✅ Content-hash deduplication — same image always gets same ID
- ✅ Cross-model workflows — "Image 1" stable across Vision↔Text model switches
Client Responsibility:
- Maintain full conversation history in
messages[]array - Same content = same ID (content-hash based)
Python Version:
- ✅ Python 3.10+ required (mlx-vlm dependency)
- ❌ Python 3.9: Vision requests → HTTP 501
Audio Support (2.0.4-beta.9)
Two methods for audio transcription:
Method 1: /v1/audio/transcriptions (Whisper API)
Direct file upload for STT models (Whisper, Voxtral). Recommended for pure transcription.
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper-large"
Supported:
- ✅ File upload (multipart/form-data)
- ✅ Formats: WAV, MP3, M4A, FLAC, OGG
- ✅ Response formats:
json,text,verbose_json - ✅ Language detection or explicit
languageparameter
Models: Whisper, Voxtral (requires pip install mlx-knife[audio])
Method 2: /v1/chat/completions with input_audio
Base64-encoded audio in chat messages for multimodal models (Gemma-3n).
{
"model": "gemma-3n-E2B-it-4bit",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe this audio"},
{"type": "input_audio", "input_audio": {"data": "<base64>", "format": "wav"}}
]
}]
}
Supported:
- ✅ OpenAI
input_audioformat (Base64-encoded) - ✅ Formats: WAV, MP3
- ✅ Temperature 0.0 (greedy sampling for transcription consistency)
Limits (both methods):
- Per-audio: 50 MB max for transcriptions endpoint, 5 MB for chat
- Count: 1 audio per request
Models: Gemma-3n (Vision + Audio + Text)
Important Characteristics:
- Stateless Server: Same as Vision — no server-side state
- Single Audio: Only one audio file per request
- Audio+Vision: When both present in chat, audio is silently ignored (mlx-vlm behavior)
- Temperature: Fixed at 0.0 for transcription consistency
History Handling:
When switching from Audio to Text model mid-conversation:
- Server filters
input_audiocontent blocks - Text model sees
[n audio(s) were attached]placeholder
Python Version:
- ✅ Python 3.10+ required (same as Vision)
- ❌ Python 3.9: Audio requests → HTTP 501
Token Limits: Text vs Multimodal Models
Critical Difference: Text and multimodal (Vision/Audio) models use different max_tokens strategies.
Text Models (MLXRunner)
Strategy: Shift-window context management
- Conversation history maintained in context buffer
- Server reserves space for history
Defaults:
- Server:
context_length / 2(reserve half for history, half for generation) - CLI:
context_length(full context, no reservation)
Example:
- Llama-3.2-3B (128K context) → Server default: 64K max_tokens
Vision/Audio Models (VisionRunner)
Strategy: Stateless processing
- Each request is independent (no conversation history in context)
- Metal limitations prevent context preservation
Defaults:
- Server/CLI:
2048tokens (conservative, works for all models)
Rationale:
- No need for
/2division (no history to reserve) - Multimodal inference is slow → 2048 adequate for descriptions/transcriptions
- Prevents accidentally generating 64K+ tokens
Override:
{
"model": "mlx-community/gemma-3n-E2B-it-4bit",
"messages": [...],
"max_tokens": 4096 // Explicit override
}
Memory-Aware Loading (ADR-016)
Pre-load memory checks prevent OOM crashes.
Vision Models
- Threshold: 70% system RAM
- Behavior: Model size > 70% → HTTP 507 (Insufficient Storage)
- Rationale: Vision Encoder has unpredictable per-image overhead
Example (64GB system):
- Llama-3.2-11B-Vision (5.6GB) → ✅ Loads (8.75% of RAM)
- Llama-3.2-90B-Vision (46.4GB) → ❌ HTTP 507 (72.5% of RAM)
Text Models
- Threshold: 70% system RAM
- Behavior: Model size > 70% → Warning only (backwards compatible)
- Rationale: Text models swap gracefully, no hard memory spikes
Streaming (SSE - Server-Sent Events)
Text Models
- ✅ True streaming: Tokens streamed as generated
- Format: SSE (
data: {...}\n\n) - Completion:
data: [DONE]\n\n
Vision Models
- ✅ Per-chunk streaming: Real SSE events as each image chunk completes (2.0.4-beta.7+)
- Multiple images: Each chunk (1-5 images) streams as it finishes processing
- Single image: Behaves like batch mode (one SSE event)
- Format: OpenAI-compatible SSE with per-chunk deltas
Audio Models
- ⚠️ Batch mode only: Single SSE event with complete response
- Reason: Single audio per request, no chunking needed
- Format: Same as Vision single-image mode
Request:
{
"model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
"messages": [...],
"stream": true
}
Response (SSE stream):
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1702345678,"model":"mlx-community/Llama-3.2-3B-Instruct-4bit","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1702345678,"model":"mlx-community/Llama-3.2-3B-Instruct-4bit","choices":[{"index":0,"delta":{"content":" there"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1702345678,"model":"mlx-community/Llama-3.2-3B-Instruct-4bit","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Note: stream_options.include_usage is not supported.
Configuration
Environment Variables
# Server binding
MLXK2_HOST=0.0.0.0
MLXK2_PORT=8000
# Logging
MLXK2_LOG_JSON=1 # JSON logs (production)
MLXK2_LOG_LEVEL=info # debug|info|warning|error
# Feature gates (beta features)
MLXK2_ENABLE_PIPES=1 # Unix pipe integration (2.0.4-beta.1)
Supervised Mode (Default)
Behavior:
- Handles Ctrl-C gracefully (clean shutdown with 5s timeout)
- Runs server in subprocess for improved signal handling
- Logs go to stderr
--log-jsonproduces 100% JSON output- Note: No auto-restart on crashes (use systemd/supervisor for production)
Start:
mlxk serve --port 8000 --log-json
Direct Mode (Development)
Behavior:
- No auto-restart
- Direct uvicorn process
Start:
python -m mlxk2.core.server_base
HTTP Status Codes
Success
- 200 OK: Request successful
- 201 Created: Resource created (future)
Client Errors (4xx)
- 400 Bad Request: Invalid input (e.g., too many images, invalid format, validation failures)
- 404 Not Found: Model not found in cache
Server Errors (5xx)
- 500 Internal Server Error: Unexpected backend failure
- 501 Not Implemented: Feature not supported (e.g., vision on Python 3.9)
- 503 Service Unavailable: Server shutting down
- 507 Insufficient Storage: Memory constraints violated (vision model >70% RAM)
Performance Characteristics
Model Loading
- Time: ~5-10 seconds (first request only)
- Caching: Model stays loaded until server restart or model switch
- Memory: Held in RAM until explicitly unloaded
Inference Speed
Text Models:
- Typical: 20-50 tokens/sec (depends on model size, hardware)
- Streaming: Real-time token output
Vision Models:
- Slower than text: Vision Encoder adds overhead
- Per-image: ~2-5 seconds baseline + generation time
- Multiple images: Processed in chunks (default: 1, max: 5 via
--chunk) - Streaming: Each chunk delivers results immediately (see Streaming section above)
Concurrent Requests
- Current: Sequential processing (one request at a time)
- Reason: Metal backend, single GPU
- Future: May add request queuing
Troubleshooting
Multimodal Request Fails on Python 3.9
Symptom: HTTP 501 "Vision/Audio models require Python 3.10+"
Solution:
# Upgrade Python (3.10-3.12 required)
pyenv install 3.10
pyenv local 3.10
# Install with Vision support
pip install mlx-knife[vision]
# Install with Audio STT support (Whisper)
pip install mlx-knife[audio]
# Install with everything
pip install mlx-knife[all]
Memory Constraint Errors (HTTP 507)
Symptom: Model requires XGB but only YGB available (70% of system RAM)
Solutions:
- Use smaller quantized model (e.g., 4-bit instead of 8-bit)
- Add more system RAM
- Try different model architecture
Vision Responses Too Short
Symptom: Responses truncated mid-sentence
Cause: Default max_tokens: 2048 might be too low for complex descriptions
Solution:
{
"model": "mlx-community/Llama-3.2-11B-Vision-Instruct-4bit",
"messages": [...],
"max_tokens": 4096 // Increase limit
}
Image Upload Fails (HTTP 400)
Common causes:
- Image size > 20 MB per image
- More than 5 images per request
- Unsupported format (use JPEG, PNG, GIF, WebP)
- External URLs (not supported, use Base64 data URLs)
- Invalid Base64 encoding
Solution: Resize images, reduce count, or check encoding
Audio Errors
Audio Request Fails (HTTP 400)
Common causes:
- Audio size > 5 MB
- More than 1 audio per request (multi-audio not supported)
- Unsupported format (use WAV or MP3)
- Invalid Base64 encoding
Solution: Compress audio, ensure single audio per request, use supported format
Audio Model Not Found
Symptom: Model does not support audio input
Cause: Model lacks audio capability
Solution: Use an audio-capable model:
mlxk list | grep +audio
Note: Some HuggingFace models may require mlxk convert --repair-index before use.
Audio Output is Garbled/Multilingual
Symptom: Transcription includes unexpected languages (Arabic, Hindi, etc.)
Cause: Temperature too high (default text temperature 0.7 causes drift)
Solution: Use temperature 0.0 for audio:
{
"temperature": 0.0
}
Transcription Endpoint Returns Wrong Model Error
Symptom: Model 'xxx' is not an audio transcription model
Cause: /v1/audio/transcriptions only works with STT models (Whisper, Voxtral)
Solution: Use the correct model type:
# For transcription endpoint: STT models
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper-large"
# For multimodal chat: Gemma-3n (use chat/completions instead)
# See "Audio Messages Format" in Appendix
mlx-audio Not Installed
Symptom: STT models require mlx-audio
Solution:
pip install mlx-knife[audio]
Limits Summary
| Resource | Limit | Reason |
|---|---|---|
| Images per request | 5 | Metal OOM prevention |
| Image size | 20 MB | Metal OOM prevention |
| Total image size | 50 MB | Metal OOM prevention |
| Audio per request (chat) | 1 | mlx-vlm limitation |
| Audio size (chat) | 5 MB | Token count constraint |
| Audio size (transcriptions) | 50 MB | ~15 min @ 16kHz mono |
| Vision model RAM | 70% system | Metal OOM prevention |
| Text model RAM | 70% (warning) | Swap tolerance |
| Vision max_tokens | 2048 (default) | Stateless, slow inference |
| Audio max_tokens | 2048 (default) | Stateless, like Vision |
| Text max_tokens | context_length/2 | Shift-window reservation |
Migration Guide
From 2.0.3 → 2.0.4
New Features:
| Feature | Endpoint | Requirements |
|---|---|---|
| Vision (images) | /v1/chat/completions |
pip install mlx-knife[vision] |
| Audio Chat (Gemma-3n) | /v1/chat/completions |
pip install mlx-knife[vision] |
| Audio STT (Whisper) | /v1/audio/transcriptions |
pip install mlx-knife[audio] |
| Memory pre-load checks | All endpoints | Built-in (HTTP 507) |
| Server audio preload | mlxk serve --model whisper-large |
Built-in |
Breaking Changes:
| Change | Before | After | Impact |
|---|---|---|---|
| Python version | 3.9+ | 3.10-3.12 | Upgrade required |
Vision max_tokens default |
1024 | 2048 | Longer responses |
| Memory checks (Vision) | None | 70% RAM limit | HTTP 507 possible |
New Dependencies (auto-installed):
mlx-vlm>=0.3.10(Vision + Gemma-3n audio)mlx-audio>=0.3.1(Whisper STT)python-multipart>=0.0.9(file uploads)
Client Updates Required:
- Handle HTTP 507 (Insufficient Storage) for large Vision models
- Update clients expecting
max_tokens: 1024to handle 2048 - Use
temperature: 0.0for audio transcription consistency
Recommendations:
- Pure transcription: Use
/v1/audio/transcriptionswith Whisper - Multimodal chat: Use
/v1/chat/completionswithinput_audio - Test Vision/Audio workflows on Python 3.10+
References
- API Schema:
docs/json-api-specification.md - Architecture Principles:
docs/ARCHITECTURE.md - Testing Details:
TESTING-DETAILS.md - ADR-012: Vision Support (development decisions)
- ADR-016: Memory-Aware Loading (development decisions)
Appendix: Client Requirements
Audience: Client developers integrating with MLX Knife server
OpenAI API Compliance
Clients MUST follow the OpenAI Chat Completions API format. MLX Knife is designed to work with any OpenAI-compatible client.
Conversation History
Clients MUST send the full message list with each request:
{
"model": "mlx-community/Llama-3.2-11B-Vision-Instruct-4bit",
"messages": [
{"role": "user", "content": [...]},
{"role": "assistant", "content": "..."},
{"role": "user", "content": [...]}
]
}
Why: The server reconstructs stable image IDs from the history. Without full history, image numbering restarts at 1 with each request.
What "full history" means:
- ✅ All messages with correct roles (
user,assistant,system) - ✅ Complete assistant responses (including
<!-- mlxk:filenames -->markers) - ⚠️ Media payloads (Base64) can be dropped after first Vision request (see Image ID Persistence)
Note: For Vision models, the server only forwards the last user message to the model (stateless prompt), but still scans the full history for image ID reconstruction.
Vision Messages Format
Multimodal content uses the OpenAI array format:
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}
Image URLs:
- ✅ Base64 Data URLs:
data:image/jpeg;base64,/9j/4AAQ... - ❌ HTTP URLs: Not supported (no external fetching)
Supported formats: JPEG, PNG, GIF, WebP
Vision: Stateless Prompt, History-Based IDs
Important architectural distinction for Vision requests:
| Aspect | Behavior | Reason |
|---|---|---|
| Prompt to model | Only last user message | Prevents pattern reproduction (model copying old mappings) |
| Image ID assignment | Full history scanned | Consistent numbering across session (Image 1, 2, 3...) |
What this means:
- The Vision model does NOT see previous assistant responses
- But image numbering remains stable across the conversation
- Follow-up questions about image descriptions should use a Text model (which has full history)
Recommended workflow:
1. Vision model: User sends beach.jpg → "Image 1 shows a beach..."
2. Vision model: User sends mountain.jpg → "Image 2 shows a mountain..."
3. Text model: User asks "Compare these two locations" → Full context available
Rationale:
- Vision models can't "see" previous images anyway (Metal memory limitations)
- Sending history caused pattern reproduction (model hallucinating mappings)
- Clean separation: Vision=describe, Text=discuss
Image Deduplication
Same image content = same ID (content-hash based).
Client behavior:
- Re-uploading the same image → Server assigns same ID
- No client-side deduplication needed
Image ID Persistence (Stateless)
Problem: How do Image IDs remain stable across Vision→Text→Vision workflows when clients drop Base64 data from history (storage optimization)?
Solution: The server reads its own filename mapping tables from assistant responses.
Workflow:
-
Request 1 (Vision): Client sends beach.jpg
{"role": "user", "content": [ {"type": "text", "text": "describe"}, {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}} ]} -
Server Response: Includes filename mapping table (wrapped in
<details>)<details> <summary>📸 Image Metadata (1 image)</summary> <!-- mlxk:filenames --> | Image | Filename | Original | Location | Date | Camera | |-------|----------|----------|----------|------|--------| | 1 | image_5733332c.jpeg | beach.jpg | 📍 34.0522°N, 118.2437°W | 📅 2024-06-15 | iPhone 14 | </details> A sandy beach with blue water.Note: EXIF columns (Original, Location, Date, Camera) are enabled by default. Disable with
MLXK2_EXIF_METADATA=0for minimal output (Image, Filename only). -
Client Storage Optimization: Client can drop Base64 from history, keep only:
{"role": "user", "content": "describe"} {"role": "assistant", "content": "A sandy beach...\n\n<!-- mlxk:filenames -->\n..."} -
Request 3 (Vision after Text): Client sends mountain.jpg with text-only history
{ "messages": [ {"role": "user", "content": "describe"}, // No Base64! {"role": "assistant", "content": "Beach...\n\n| 1 | image_5733332c.jpeg |"}, {"role": "user", "content": "What color?"}, {"role": "assistant", "content": "Blue."}, {"role": "user", "content": [ {"type": "text", "text": "new picture"}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}} ]} ] } -
Server Reconstruction: Server scans history:
- Finds
<!-- mlxk:filenames -->marker in assistant response - Parses:
image_5733332c.jpeg→ Image ID 1 - Assigns: mountain.jpg → Image ID 2 ✅
- Finds
Benefits:
- ✅ Zero client changes - Works with standard OpenAI message format
- ✅ Storage optimization - Client can drop large Base64 data (2 MB → 2 KB)
- ✅ No protocol extensions - Standard messages[] array, no custom headers
- ✅ Stateless server - No server-side session state required
- ✅ Scales to 100+ images - Clients only store small text mappings
Client Recommendations:
- After first Vision request: Drop Base64 image_url from history, keep text + assistant response
- Store locally: Small thumbnails for UI (~20 KB/image via IndexedDB)
- History format: Text-only user messages + full assistant responses (with mapping tables)
- ⚠️ Preserve verbatim: Do not sanitize or strip HTML comments from assistant responses — the
<!-- mlxk:filenames -->markers are required for ID reconstruction
Example client storage (100 images):
- ❌ Before: 100 images × 2 MB Base64 = 200 MB (exceeds browser limits)
- ✅ After: 100 thumbnails × 20 KB + text history = ~2 MB (fits in IndexedDB)
Audio Messages Format
Audio content uses the OpenAI input_audio format:
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe this audio"},
{
"type": "input_audio",
"input_audio": {
"data": "<base64-encoded>",
"format": "wav"
}
}
]
}
Supported formats: wav, mp3 (or mpeg alias)
Limitations:
- ❌ Only 1 audio per request (multi-audio causes mlx-vlm token mismatch)
- ❌ Audio + Vision combined: audio is silently ignored
Audio Transcriptions (File Upload)
For direct STT transcription with dedicated models (Whisper, Voxtral), use the /v1/audio/transcriptions endpoint:
Request (multipart/form-data):
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper-large" \
-F "language=en" \
-F "response_format=json"
Form Fields:
| Field | Required | Description |
|---|---|---|
file |
✅ | Audio file (WAV, MP3, M4A, FLAC, OGG) |
model |
✅ | Model ID (e.g., whisper-large, full HF path) |
language |
❌ | Language code (en, de, etc.). Auto-detect if omitted. |
response_format |
❌ | json (default), text, verbose_json |
temperature |
❌ | Sampling temperature (default: 0.0) |
Response Formats:
// json (default)
{"text": "Hello world."}
// verbose_json
{"task": "transcribe", "language": "en", "duration": 2.5, "text": "Hello world."}
// text
Hello world.
When to use which endpoint:
| Use Case | Endpoint | Model Type | Format |
|---|---|---|---|
| Pure transcription | /v1/audio/transcriptions |
STT (Whisper, Voxtral) | File upload |
| Chat with audio context | /v1/chat/completions |
Multimodal (Gemma-3n) | Base64 JSON |
| Long audio (>30s) | /v1/audio/transcriptions |
STT (Whisper) | File upload |
Client Implementation Notes:
- Use
multipart/form-datacontent type (notapplication/json) - File field name must be
file - Maximum file size: 50 MB (~15 min @ 16kHz mono)
- Requires
mlx-audioon server (pip install mlx-knife[audio])
Cross-Model Workflows (Vision/Audio → Text)
When switching from Vision or Audio to Text model mid-conversation:
- Client: Continue sending full message list (media payloads can be stripped if mapping tables exist)
- Server: Automatically filters any remaining media for text models, replaces with placeholders
- Result: Text model sees
[n image(s) were attached]or[n audio(s) were attached]
Example workflow:
1. Vision model: User sends 2 images → Model describes both
2. Vision model: User asks "What's different?" → Model compares
3. Switch to Text model: User asks "Which is better for vacation?"
4. Text model: Sees "[2 image(s) were attached]" in history, can reference the conversation
Storage optimization: After the first Vision request, clients can drop Base64 payloads from history while preserving assistant responses with <!-- mlxk:filenames --> markers. The server reconstructs image IDs from these markers.
Changelog
-
2026-01-31: 2.0.4-beta.9
- NEW:
/v1/audio/transcriptionsendpoint (OpenAI Whisper API compatible) - Direct file upload for STT models (Whisper, Voxtral)
- Server preload support for audio models
- Response formats:
json,text,verbose_json - Supported audio formats: WAV, MP3, M4A, FLAC, OGG
- NEW:
-
2026-01-20: 2.0.4-beta.8
- NEW: Audio input support via OpenAI
input_audioformat (chat completions) - Supported formats: WAV, MP3
- Audio-capable models: Gemma-3n (others as available)
- Limits: 5 MB per audio, 1 audio per request
- Temperature: 0.0 for transcription consistency
- History filter:
input_audio→[n audio(s) were attached]
- NEW: Audio input support via OpenAI
-
2025-12-15: 2.0.4-beta.1 WIP
- Vision support: Base64 images, multiple images, limits
- History-based stable image IDs (stateless, OpenAI-compatible)
- NEW: Server reads mapping tables from assistant responses (Image ID persistence without Base64)
- Vision: Stateless prompt + history-based IDs (pattern reproduction fix)
- Vision: temperature=0.0 (greedy sampling, reduces hallucinations)
- Vision vs Text max_tokens strategy
- Memory-aware loading (HTTP 507)
- Feature gates and troubleshooting
📝 Note: This handbook will be updated continuously until 2.1 stable release. Check version header for freshness.