mirror of https://github.com/cloudstack-llc/mlx-knife.git synced 2026-07-01 20:44:14 -04:00

Files

T

The BROKE Cluster Team 5751545b8b Release 2.0.4-beta.7: Server robustness + Vision per-chunk streaming

- Server: exit codes, /v1/models crash fix, vision routing, MLXK2_MAX_TOKENS
- Vision: true SSE streaming, hallucination fix (local numbering)
- Workspace: list prefix-match, push ambiguous pattern handling
- Docs: SERVER-HANDBOOK accuracy updates

See CHANGELOG.md for details.

2026-01-18 16:57:32 +01:00

19 KiB

Raw Blame History

MLX Knife Server Handbook

Version: 2.0.4-beta.1 (WIP) Status: ⚠️ WORK IN PROGRESS - This document will evolve until 2.1 stable release Last Updated: 2025-12-15

Audience: Server operators, DevOps, API consumers For implementation details: See ARCHITECTURE.md and docs/ADR/ (developer documentation)

Quick Start

# Basic server
mlxk serve --port 8000

# JSON logging (production)
mlxk serve --port 8000 --log-json

# Custom host
mlxk serve --host 0.0.0.0 --port 8000

Requirements:

Python 3.9+ (Text models)
Python 3.10+ (Vision models)
mlx-lm 0.28.4+
mlx-vlm 0.3.9+ (optional, for vision; beta.3 recommends commit c4ea290e47e2155b67d94c708c662f8ab64e1b37)

API Endpoints

POST /v1/chat/completions

OpenAI-compatible chat completion endpoint.

Request:

{
  "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ],
  "max_tokens": null,
  "temperature": 0.7,
  "stream": false
}

Vision Request (Base64 Images):

{
  "model": "mlx-community/Llama-3.2-11B-Vision-Instruct-4bit",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What's in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }
  ],
  "max_tokens": 2048,
  "temperature": 0.4,
  "chunk": 1
}

mlx-knife Extension Parameters:

chunk (integer, optional): Batch size for vision processing (default: 1). Controls how many images are processed per inference session. Higher values may trigger OOM on resource-constrained systems. Maximum: 5 (enforced by server).

Default chunk size:

Request parameter chunk (highest priority)
Server startup: mlxk server --chunk N
Environment: MLXK2_VISION_CHUNK_SIZE=N
Default: 1 (maximum safety)

Response:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1702345678,
  "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 8,
    "total_tokens": 20
  }
}

POST /v1/completions

Legacy completion endpoint (text-only, no chat template).

Request:

{
  "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
  "prompt": "Once upon a time",
  "max_tokens": 100,
  "temperature": 0.7
}

GET /v1/models

List available models.

Returns all cached models that are healthy and runtime-compatible. Models are sorted with preloaded model first (if any), then alphabetically.

Response:

{
  "object": "list",
  "data": [
    {
      "id": "mlx-community/Llama-3.2-3B-Instruct-4bit",
      "object": "model",
      "owned_by": "mlx-knife-2.0",
      "permission": [],
      "context_length": 8192
    }
  ]
}

Fields:

id: Model identifier (HuggingFace name or workspace path)
object: Always "model" (OpenAI-compatible)
owned_by: "mlx-knife-2.0" for cached models, "workspace" for local directories
permission: Empty array (OpenAI legacy field)
context_length: Maximum context window in tokens (may be null if unavailable)

Why context_length matters:

MLX Knife uses client-side context management (unlike OpenAI's server-side history):

Vision models: Fully stateless - client holds entire conversation history
Text models: Shift-window (context_length / 2 reserved for history on server)
Clients need this to manage conversation pruning and token budgets
Load balancing: BROKE Cluster and similar tools use this for scheduling decisions

Note: LM Studio provides similar field as max_context_length.

GET /health

Server health check (200 OK if server is running).

Features & Capabilities

Vision Support (2.0.4-beta.1)

See examples/vision_pipe.sh for a practical Vision→Text pipeline example (CLI).

Supported:

✅ Base64 data URLs (data:image/jpeg;base64,...)
✅ Multiple images (up to 5 per request)
✅ Formats: JPEG, PNG, GIF, WebP

Limits:

Per-image: 20 MB max
Count: 5 images max per request

Important Characteristics:

Stateless Server: No server-side state required
Sequential Images: Only images from the last user message are processed (OpenAI API compliant)
Each request is independent: No "shift-window" context like text models (Metal memory limitations)

Stable Image IDs (History-Based)

Problem: How to maintain stable "Image 1, 2, 3..." numbering across multiple requests?

Solution: The conversation history IS the session.

The server scans the full messages[] array (which clients send with each request per OpenAI API) and assigns IDs chronologically based on content hash:

Request 1: beach.jpg (hash: 5c691ddb) → Image 1
Request 2: beach.jpg + mountain.jpg in history → Image 1, Image 2
Request 3: Re-upload beach.jpg → Still Image 1 (hash match)

Properties:

✅ 100% OpenAI API compatible — standard messages[] format, no custom headers
✅ Stateless server — no registry, no TTL, no cleanup
✅ Content-hash deduplication — same image always gets same ID
✅ Cross-model workflows — "Image 1" stable across Vision↔Text model switches

Client Responsibility:

Maintain full conversation history in messages[] array
Same content = same ID (content-hash based)

Python Version:

✅ Python 3.10+ required (mlx-vlm dependency)
❌ Python 3.9: Vision requests → HTTP 501

Token Limits: Vision vs Text Models

Critical Difference: Vision and text models use different max_tokens strategies.

Text Models (MLXRunner)

Strategy: Shift-window context management

Conversation history maintained in context buffer
Server reserves space for history

Defaults:

Server: context_length / 2 (reserve half for history, half for generation)
CLI: context_length (full context, no reservation)

Example:

Llama-3.2-3B (128K context) → Server default: 64K max_tokens

Vision Models (VisionRunner)

Strategy: Stateless processing

Each request is independent (no conversation history in context)
Metal limitations prevent context preservation

Defaults:

Server/CLI: 2048 tokens (conservative, works for all models)

Rationale:

No need for /2 division (no history to reserve)
Vision inference is slow → 2048 adequate for image descriptions
Prevents accidentally generating 64K+ tokens

Override:

{
  "model": "mlx-community/Llama-3.2-11B-Vision-Instruct-4bit",
  "messages": [...],
  "max_tokens": 4096  // Explicit override
}

Memory-Aware Loading (ADR-016)

Pre-load memory checks prevent OOM crashes.

Vision Models

Threshold: 70% system RAM
Behavior: Model size > 70% → HTTP 507 (Insufficient Storage)
Rationale: Vision Encoder has unpredictable per-image overhead

Example (64GB system):

Llama-3.2-11B-Vision (5.6GB) → ✅ Loads (8.75% of RAM)
Llama-3.2-90B-Vision (46.4GB) → ❌ HTTP 507 (72.5% of RAM)

Text Models

Threshold: 70% system RAM
Behavior: Model size > 70% → Warning only (backwards compatible)
Rationale: Text models swap gracefully, no hard memory spikes

Streaming (SSE - Server-Sent Events)

Text Models

✅ True streaming: Tokens streamed as generated
Format: SSE (data: {...}\n\n)
Completion: data: [DONE]\n\n

Vision Models

✅ Per-chunk streaming: Real SSE events as each image chunk completes (2.0.4-beta.7+)
Multiple images: Each chunk (1-5 images) streams as it finishes processing
Single image: Behaves like batch mode (one SSE event)
Format: OpenAI-compatible SSE with per-chunk deltas

Request:

{
  "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
  "messages": [...],
  "stream": true
}

Response (SSE stream):

data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"Hello"},"index":0}],...}

data: {"id":"chatcmpl-...","choices":[{"delta":{"content":" there"},"index":0}],...}

data: [DONE]

Configuration

Environment Variables

# Server binding
MLXK2_HOST=0.0.0.0
MLXK2_PORT=8000

# Logging
MLXK2_LOG_JSON=1          # JSON logs (production)
MLXK2_LOG_LEVEL=info      # debug|info|warning|error

# Feature gates (beta features)
MLXK2_ENABLE_PIPES=1      # Unix pipe integration (2.0.4-beta.1)

Supervised Mode (Default)

Behavior:

Handles Ctrl-C gracefully (clean shutdown with 5s timeout)
Runs server in subprocess for improved signal handling
Logs go to stderr
--log-json produces 100% JSON output
Note: No auto-restart on crashes (use systemd/supervisor for production)

Start:

mlxk serve --port 8000 --log-json

Direct Mode (Development)

Behavior:

No auto-restart
Direct uvicorn process

Start:

python -m mlxk2.core.server_base

HTTP Status Codes

Success

200 OK: Request successful
201 Created: Resource created (future)

Client Errors (4xx)

400 Bad Request: Invalid input (e.g., missing images for vision model, too many images)
404 Not Found: Model not found in cache

Server Errors (5xx)

500 Internal Server Error: Unexpected backend failure
501 Not Implemented: Feature not supported (e.g., vision on Python 3.9)
503 Service Unavailable: Server shutting down
507 Insufficient Storage: Memory constraints violated (vision model >70% RAM)

Performance Characteristics

Model Loading

Time: ~5-10 seconds (first request only)
Caching: Model stays loaded until server restart or model switch
Memory: Held in RAM until explicitly unloaded

Inference Speed

Text Models:

Typical: 20-50 tokens/sec (depends on model size, hardware)
Streaming: Real-time token output

Vision Models:

Slower than text: Vision Encoder adds overhead
Per-image: ~2-5 seconds baseline + generation time
Multiple images: Processed in chunks (default: 1, max: 5 via --chunk)
Streaming: Each chunk delivers results immediately (see Streaming section above)

Concurrent Requests

Current: Sequential processing (one request at a time)
Reason: Metal backend, single GPU
Future: May add request queuing

Troubleshooting

Vision Model Fails on Python 3.9

Symptom: HTTP 501 "Vision models require Python 3.10+"

Solution:

# Upgrade Python
pyenv install 3.10
pyenv local 3.10
pip install mlx-lm mlx-vlm

# Beta.3 (pre-0.3.10 fix)
pip install mlx-lm "mlx-vlm @ git+https://github.com/Blaizzy/mlx-vlm.git@c4ea290e47e2155b67d94c708c662f8ab64e1b37"

Memory Constraint Errors (HTTP 507)

Symptom: Model requires XGB but only YGB available (70% of system RAM)

Solutions:

Use smaller quantized model (e.g., 4-bit instead of 8-bit)
Add more system RAM
Try different model architecture

Vision Responses Too Short

Symptom: Responses truncated mid-sentence

Cause: Default max_tokens: 2048 might be too low for complex descriptions

Solution:

{
  "model": "mlx-community/Llama-3.2-11B-Vision-Instruct-4bit",
  "messages": [...],
  "max_tokens": 4096  // Increase limit
}

Image Upload Fails (HTTP 400)

Common causes:

Image size > 20 MB per image
More than 5 images per request
Unsupported format (use JPEG, PNG, GIF, WebP)
External URLs (not supported, use Base64 data URLs)
Invalid Base64 encoding

Solution: Resize images, reduce count, or check encoding

Limits Summary

Resource	Limit	Reason
Images per request	5	Metal OOM prevention
Image size	20 MB	Metal OOM prevention
Total image size	50 MB	Metal OOM prevention
Vision model RAM	70% system	Metal OOM prevention
Text model RAM	70% (warning)	Swap tolerance
Vision max_tokens	2048 (default)	Stateless, slow inference
Text max_tokens	context_length/2	Shift-window reservation

Migration Guide

From 2.0.3 → 2.0.4-beta.1

New Features:

✅ Vision support (Python 3.10+)
✅ Memory pre-load checks (HTTP 507)
✅ Unix pipe integration (MLXK2_ENABLE_PIPES=1)

Breaking Changes:

⚠️ Vision models: max_tokens default changed from 1024 → 2048
⚠️ Memory checks: Vision models >70% RAM now blocked (was: no check)

Recommendations:

Update clients expecting vision max_tokens: 1024 to handle 2048
Monitor for HTTP 507 errors (memory constraints)
Test vision workflows on Python 3.10+

References

API Schema: docs/json-api-specification.md
Architecture Principles: docs/ARCHITECTURE.md
Testing Details: TESTING-DETAILS.md
ADR-012: Vision Support (development decisions)
ADR-016: Memory-Aware Loading (development decisions)

Appendix: Client Requirements

Audience: Client developers integrating with MLX Knife server

OpenAI API Compliance

Clients MUST follow the OpenAI Chat Completions API format. MLX Knife is designed to work with any OpenAI-compatible client.

Conversation History

Clients MUST send the full conversation history with each request:

{
  "model": "mlx-community/Llama-3.2-11B-Vision-Instruct-4bit",
  "messages": [
    {"role": "user", "content": [...]},
    {"role": "assistant", "content": "..."},
    {"role": "user", "content": [...]}
  ]
}

Why: The server reconstructs stable image IDs from the history. Without full history, image numbering restarts at 1 with each request.

Vision: Stateless Prompt, History-Based IDs

Important architectural distinction for Vision requests:

Aspect	Behavior	Reason
Prompt to model	Only last user message	Prevents pattern reproduction (model copying old mappings)
Image ID assignment	Full history scanned	Consistent numbering across session (Image 1, 2, 3...)

What this means:

The Vision model does NOT see previous assistant responses
But image numbering remains stable across the conversation
Follow-up questions about image descriptions should use a Text model (which has full history)

Recommended workflow:

1. Vision model: User sends beach.jpg → "Image 1 shows a beach..."
2. Vision model: User sends mountain.jpg → "Image 2 shows a mountain..."
3. Text model: User asks "Compare these two locations" → Full context available

Rationale:

Vision models can't "see" previous images anyway (Metal memory limitations)
Sending history caused pattern reproduction (model hallucinating mappings)
Clean separation: Vision=describe, Text=discuss

Vision Messages Format

Multimodal content uses the OpenAI array format:

{
  "role": "user",
  "content": [
    {"type": "text", "text": "What's in this image?"},
    {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
  ]
}

Image URLs:

✅ Base64 Data URLs: data:image/jpeg;base64,/9j/4AAQ...
❌ HTTP URLs: Not supported (no external fetching)

Supported formats: JPEG, PNG, GIF, WebP

Cross-Model Workflows (Vision → Text)

When switching from Vision to Text model mid-conversation:

Client: Continue sending full history (including previous image_url content)
Server: Automatically filters images for text models, replaces with placeholders
Result: Text model sees [n image(s) were attached] instead of binary data

Example workflow:

1. Vision model: User sends 2 images → Model describes both
2. Vision model: User asks "What's different?" → Model compares
3. Switch to Text model: User asks "Which is better for vacation?"
4. Text model: Sees "[2 image(s) were attached]" in history, can reference the conversation

Image Deduplication

Same image content = same ID (content-hash based).

Client behavior:

Re-uploading the same image → Server assigns same ID
No client-side deduplication needed

Image ID Persistence (100% OpenAI-Compatible)

Problem: How do Image IDs remain stable across Vision→Text→Vision workflows when clients drop Base64 data from history (storage optimization)?

Solution: The server reads its own filename mapping tables from assistant responses.

Workflow:

Request 1 (Vision): Client sends beach.jpg

{"role": "user", "content": [
  {"type": "text", "text": "describe"},
  {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]}

Server Response: Includes filename mapping table

A sandy beach with blue water.

<!-- mlxk:filenames -->
| Image | Filename |
|-------|----------|
| 1 | image_5733332c.jpeg |

Client Storage Optimization: Client can drop Base64 from history, keep only:

{"role": "user", "content": "describe"}
{"role": "assistant", "content": "A sandy beach...\n\n<!-- mlxk:filenames -->\n..."}

Request 3 (Vision after Text): Client sends mountain.jpg with text-only history

{
  "messages": [
    {"role": "user", "content": "describe"},  // No Base64!
    {"role": "assistant", "content": "Beach...\n\n| 1 | image_5733332c.jpeg |"},
    {"role": "user", "content": "What color?"},
    {"role": "assistant", "content": "Blue."},
    {"role": "user", "content": [
      {"type": "text", "text": "new picture"},
      {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
    ]}
  ]
}

Server Reconstruction: Server scans history:
- Finds  marker in assistant response
- Parses: image_5733332c.jpeg → Image ID 1
- Assigns: mountain.jpg → Image ID 2 ✅

Benefits:

✅ Zero client changes - Works with standard OpenAI API
✅ Storage optimization - Client can drop large Base64 data (2 MB → 2 KB)
✅ 100% OpenAI-compatible - No protocol extensions needed
✅ Stateless server - No server-side session state required
✅ Scales to 100+ images - Clients only store small text mappings

Client Recommendations:

After first Vision request: Drop Base64 image_url from history, keep text + assistant response
Store locally: Small thumbnails for UI (~20 KB/image via IndexedDB)
History format: Text-only user messages + full assistant responses (with mapping tables)

Example client storage (100 images):

❌ Before: 100 images × 2 MB Base64 = 200 MB (exceeds browser limits)
✅ After: 100 thumbnails × 20 KB + text history = ~2 MB (fits in IndexedDB)

Changelog

2025-12-15: 2.0.4-beta.1 WIP
- Vision support: Base64 images, multiple images, limits
- History-based stable image IDs (stateless, OpenAI-compatible)
- NEW: Server reads mapping tables from assistant responses (Image ID persistence without Base64)
- Vision: Stateless prompt + history-based IDs (pattern reproduction fix)
- Vision: temperature=0.0 (greedy sampling, reduces hallucinations)
- Vision vs Text max_tokens strategy
- Memory-aware loading (HTTP 507)
- Feature gates and troubleshooting

📝 Note: This handbook will be updated continuously until 2.1 stable release. Check version header for freshness.

19 KiB Raw Blame History Unescape Escape

MLX Knife Server Handbook

Quick Start

API Endpoints

POST /v1/chat/completions

POST /v1/completions

GET /v1/models

GET /health

Features & Capabilities

Vision Support (2.0.4-beta.1)

Stable Image IDs (History-Based)

Token Limits: Vision vs Text Models

Text Models (MLXRunner)

Vision Models (VisionRunner)

Memory-Aware Loading (ADR-016)

Vision Models

Text Models

Streaming (SSE - Server-Sent Events)

Text Models

Vision Models

Configuration

Environment Variables

Supervised Mode (Default)

Direct Mode (Development)

HTTP Status Codes

Success

Client Errors (4xx)

Server Errors (5xx)

Performance Characteristics

Model Loading

Inference Speed

Concurrent Requests

Troubleshooting

Vision Model Fails on Python 3.9

Memory Constraint Errors (HTTP 507)

Vision Responses Too Short

Image Upload Fails (HTTP 400)

Limits Summary

Migration Guide

From 2.0.3 → 2.0.4-beta.1

References

Appendix: Client Requirements

OpenAI API Compliance

Conversation History

Vision: Stateless Prompt, History-Based IDs

Vision Messages Format

Cross-Model Workflows (Vision → Text)

Image Deduplication

Image ID Persistence (100% OpenAI-Compatible)

Changelog

19 KiB

Raw Blame History