MLX Knife
A lightweight, ollama-like CLI for managing and running MLX models on Apple Silicon. Designed for personal, local use - perfect for individual developers and researchers working with MLX models.
Current Version: 1.0-rc1 (August 2025)
Features
Core Functionality
- List & Manage Models: Browse your HuggingFace cache with MLX-specific filtering
- Model Information: Detailed model metadata including quantization info
- Download Models: Pull models from HuggingFace with progress tracking
- Run Models: Native MLX execution with streaming and chat modes
- Health Checks: Verify model integrity and completeness
- Cache Management: Clean up and organize your model storage
Local Server & Web Interface
- OpenAI-Compatible API: Local REST API with
/v1/chat/completions,/v1/completions,/v1/models - Web Chat Interface: Built-in HTML chat interface with markdown rendering
- Single-User Design: Optimized for personal use, not multi-user production environments
- Conversation Context: Full chat history maintained for follow-up questions
- Streaming Support: Real-time token streaming via Server-Sent Events
- Configurable Limits: Set default max tokens via
--max-tokensparameter - Model Hot-Swapping: Switch between models per conversation
- Tool Integration: Compatible with OpenAI-compatible clients (Cursor IDE, etc.)
Run Experience
- Direct MLX Integration: Models load and run natively without subprocess overhead
- Real-time Streaming: Watch tokens generate with proper spacing and formatting
- Interactive Chat: Full conversational mode with history tracking
- Memory Insights: See GPU memory usage after model loading and generation
- Dynamic Stop Tokens: Automatic detection and filtering of model-specific stop tokens
- Customizable Generation: Control temperature, max_tokens, top_p, and repetition penalty
- RAII Memory Management: Context manager pattern ensures automatic cleanup and no memory leaks
- Exception-Safe: Robust error handling with guaranteed resource cleanup
Installation
Requirements
- macOS with Apple Silicon (M1/M2/M3)
- Python 3.9+ (native macOS version or newer)
- 8GB+ RAM recommended + RAM to run LLM
Python Compatibility
MLX Knife has been comprehensively tested and verified on:
✅ Python 3.9.6 (native macOS) - Primary target
✅ Python 3.10-3.13 - Fully compatible
All versions include full MLX model execution testing with real models.
Install from Source
# Clone the repository
git clone https://github.com/mzau/mlx-knife.git
cd mlx-knife
# Install in development mode
pip install -e .
# Or install normally
pip install .
# Install with development tools (ruff, mypy, tests)
pip install -e ".[dev,test]"
Install Dependencies Only
pip install -r requirements.txt
Quick Start
CLI Usage
# List all MLX models in your cache
mlxk list
# Show detailed info about a model
mlxk show Phi-3-mini-4k-instruct-4bit
# Download a new model
mlxk pull mlx-community/Mistral-7B-Instruct-v0.3-4bit
# Run a model with a prompt
mlxk run Phi-3-mini "What is the capital of France?"
# Start interactive chat
mlxk run Phi-3-mini
# Check model health
mlxk health
Web Chat Interface
MLX Knife includes a built-in web interface for easy model interaction:
# Start the OpenAI-compatible API server
mlxk server --port 8000 --max-tokens 4000
# Open web chat interface in your browser
open simple_chat.html
Features:
- No installation required - Pure HTML/CSS/JS
- Real-time streaming - Watch tokens appear as they're generated
- Model selection - Choose any MLX model from your cache
- Conversation history - Full context for follow-up questions
- Markdown rendering - Proper formatting for code, lists, tables
- Mobile-friendly - Responsive design works on all devices
Local API Server Integration
The MLX Knife server provides OpenAI-compatible endpoints for local development and personal use:
# Start local server (single-user, no authentication)
mlxk server --host 127.0.0.1 --port 8000
# Test with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"model": "Phi-3-mini-4k-instruct-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'
# Integration with development tools (community-tested):
# - Cursor IDE: Set API URL to http://localhost:8000/v1
# - LibreChat: Configure as custom OpenAI endpoint
# - Open WebUI: Add as local OpenAI-compatible API
# - SillyTavern: Add as OpenAI API with custom URL
Note: Tool integrations are community-tested. Some tools may require specific configuration or have compatibility limitations. Please report issues via GitHub.
Command Reference
Available Commands
list - Browse Models
mlxk list # Show MLX models only (short names)
mlxk list --verbose # Show MLX models with full paths
mlxk list --all # Show all models with framework info
mlxk list --all --verbose # All models with full paths
mlxk list --health # Include health status
mlxk list Phi-3 # Filter by model name
mlxk list --verbose Phi-3 # Show detailed info (same as show)
show - Model Details
mlxk show <model> # Display model information
mlxk show <model> --files # Include file listing
mlxk show <model> --config # Show config.json content
pull - Download Models
mlxk pull <model> # Download from HuggingFace
mlxk pull <org>/<model> # Full model path
run - Execute Models
mlxk run <model> "prompt" # Single prompt (minimal output)
mlxk run <model> "prompt" --verbose # Show loading, memory, and stats
mlxk run <model> # Interactive chat
mlxk run <model> "prompt" --no-stream # Batch output
mlxk run <model> --max-tokens 1000 # Custom length
mlxk run <model> --temperature 0.9 # Higher creativity
mlxk run <model> --no-chat-template # Raw completion mode
rm - Remove Models
mlxk rm <model> # Delete a model
mlxk rm <model> --force # Skip confirmation
health - Check Integrity
mlxk health # Check all models
mlxk health <model> # Check specific model
server - Start API Server
mlxk server # Start on localhost:8000
mlxk server --port 8001 # Custom port
mlxk server --host 0.0.0.0 --port 8000 # Allow external access
mlxk server --max-tokens 4000 # Set default max tokens (default: 2000)
mlxk server --reload # Development mode with auto-reload
Command Aliases
After installation, these commands are equivalent:
mlxk(recommended)mlx-knifemlx_knife
Project Structure
mlx_knife/
├── __init__.py # Package metadata and version
├── cli.py # Command-line interface and argument parsing
├── cache_utils.py # Core model management functionality
├── mlx_runner.py # Native MLX model execution
├── server.py # OpenAI-compatible API server with FastAPI
├── hf_download.py # HuggingFace download integration
├── throttled_download_worker.py # Background download worker
├── requirements.txt # Python dependencies
├── pyproject.toml # Package configuration
├── simple_chat.html # Built-in web chat interface
└── README.md # This file
Module Overview
cli.py: Entry point handling command parsing and dispatchcache_utils.py: Model discovery, metadata extraction, and cache operationsmlx_runner.py: MLX model loading, token generation, and streamingserver.py: FastAPI-based REST API server with OpenAI compatibilitysimple_chat.html: Standalone web chat interface for immediate usehf_download.py: Robust downloading with progress trackingthrottled_download_worker.py: Prevents network overload during downloads
Configuration
Cache Location
By default, models are stored in ~/.cache/huggingface/hub. Configure with:
# Set custom cache location
export HF_HOME="/path/to/your/cache"
# Example: External SSD
export HF_HOME="/Volumes/ExternalSSD/models"
Model Name Expansion
Short names are automatically expanded for MLX models:
Phi-3-mini-4k-instruct-4bit→mlx-community/Phi-3-mini-4k-instruct-4bit- Models already containing
/are used as-is
Advanced Usage
Generation Parameters
# Creative writing (high temperature, diverse output)
mlxk run Mistral-7B "Write a story" --temperature 0.9 --top-p 0.95
# Precise tasks (low temperature, focused output)
mlxk run Phi-3-mini "Extract key points" --temperature 0.3 --top-p 0.9
# Long-form generation
mlxk run Mixtral-8x7B "Explain quantum computing" --max-tokens 2000
# Reduce repetition
mlxk run model "prompt" --repetition-penalty 1.2
Working with Specific Commits
# Use specific model version
mlxk show model@commit_hash
mlxk run model@commit_hash "prompt"
Non-MLX Model Handling
The tool automatically detects framework compatibility:
# Attempting to run PyTorch model
mlxk run bert-base-uncased
# Error: Model bert-base-uncased is not MLX-compatible (Framework: PyTorch)!
# Use MLX-Community models: https://huggingface.co/mlx-community
Testing
MLX Knife includes comprehensive test coverage with 86/86 tests passing across all supported Python versions.
Verification Status
✅ All tests verified on Python 3.9-3.13
✅ Real MLX model execution testing (Phi-3-mini-4k-instruct-4bit)
✅ Full MLX Knife functionality coverage
✅ Code quality standards maintained
# Quick test run
pip install -e ".[test]"
pytest
# Code quality check
pip install -e ".[dev]"
ruff check mlx_knife/ && mypy mlx_knife/
# Multi-Python verification (requires multiple Python versions)
./test-multi-python.sh
For detailed testing information, development workflows, and multi-Python version testing, see TESTING.md.
Technical Details
Token Decoding
MLX Knife uses context-aware decoding to handle tokenizers that encode spaces as separate tokens:
# Sliding window approach maintains context for proper spacing
window_tokens = generated_tokens[-10:] # Last 10 tokens
window_text = tokenizer.decode(window_tokens)
Stop Token Detection
Stop tokens are dynamically extracted from each model's tokenizer:
- Primary:
tokenizer.eos_token - Secondary:
tokenizer.pad_token(if different) - Additional: Special tokens containing 'end', 'stop', or 'eot'
- Common tokens verified as single-token entities
Memory Management
- RAII Pattern: Context manager ensures automatic resource cleanup
- Exception-Safe: Model cleanup guaranteed even on errors
- Baseline Tracking: Memory captured before model loading
- Real-time Monitoring: GPU memory tracking via
mlx.core.get_active_memory() - Memory Statistics: Detailed usage displayed after generation
- Leak Prevention: Automatic
mx.clear_cache()and garbage collection
# Context manager pattern (automatic cleanup)
with MLXRunner(model_path) as runner:
response = runner.generate_batch(prompt)
# Model automatically cleaned up here
Troubleshooting
Model Not Found
# If model isn't found, try full path
mlxk pull mlx-community/Model-Name-4bit
# List available models
mlxk list --all
Performance Issues
- Ensure sufficient RAM for model size
- Close other applications to free memory
- Use smaller quantized models (4-bit recommended)
Streaming Issues
- Some models may have spacing issues - this is handled automatically
- Use
--no-streamfor batch output if needed
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Quick Start:
- Fork and clone the repository
- Install with development tools:
pip install -e ".[dev,test]" - Make your changes and add tests
- Run tests:
pytest - Check code style:
ruff check mlx_knife/ --fix - Submit a pull request
We prioritize compatibility with Python 3.9 (native macOS) but welcome contributions tested on any version 3.9+.
Security
For security concerns, please see SECURITY.md or contact us at broke@gmx.eu.
MLX Knife runs entirely locally - no data is sent to external servers except when downloading models from HuggingFace.
License
MIT License - see LICENSE file for details
Copyright (c) 2025 The BROKE team 🦫
Acknowledgments
- Built for Apple Silicon using the MLX framework
- Models hosted by the MLX Community on HuggingFace
- Inspired by ollama's user experience
Made with ❤️ by The BROKE team ![]()
Version 1.0-rc1 | August 2025
