mzfive b927fa1e33 Update documentation and remove GitHub Actions - no testing on github possible
- Remove .github/workflows/tests.yml (local testing only)
  - Update CONTRIBUTING.md with current development workflow
  - Refine README.md for 1.0-rc1 release readiness
  - Update TESTING.md with comprehensive testing guide
2025-08-13 16:14:15 +02:00
2025-08-12 23:00:55 +02:00
2025-08-12 23:00:55 +02:00
2025-08-12 23:00:55 +02:00
2025-08-12 23:00:55 +02:00
2025-08-12 23:00:55 +02:00

BROKE Logo MLX Knife

MLX Knife Demo

A lightweight, ollama-like CLI for managing and running MLX models on Apple Silicon. Designed for personal, local use - perfect for individual developers and researchers working with MLX models.

Current Version: 1.0-rc1 (August 2025)

GitHub Release License: MIT

Python 3.9+ Apple Silicon MLX Tests

Features

Core Functionality

  • List & Manage Models: Browse your HuggingFace cache with MLX-specific filtering
  • Model Information: Detailed model metadata including quantization info
  • Download Models: Pull models from HuggingFace with progress tracking
  • Run Models: Native MLX execution with streaming and chat modes
  • Health Checks: Verify model integrity and completeness
  • Cache Management: Clean up and organize your model storage

Local Server & Web Interface

  • OpenAI-Compatible API: Local REST API with /v1/chat/completions, /v1/completions, /v1/models
  • Web Chat Interface: Built-in HTML chat interface with markdown rendering
  • Single-User Design: Optimized for personal use, not multi-user production environments
  • Conversation Context: Full chat history maintained for follow-up questions
  • Streaming Support: Real-time token streaming via Server-Sent Events
  • Configurable Limits: Set default max tokens via --max-tokens parameter
  • Model Hot-Swapping: Switch between models per conversation
  • Tool Integration: Compatible with OpenAI-compatible clients (Cursor IDE, etc.)

Run Experience

  • Direct MLX Integration: Models load and run natively without subprocess overhead
  • Real-time Streaming: Watch tokens generate with proper spacing and formatting
  • Interactive Chat: Full conversational mode with history tracking
  • Memory Insights: See GPU memory usage after model loading and generation
  • Dynamic Stop Tokens: Automatic detection and filtering of model-specific stop tokens
  • Customizable Generation: Control temperature, max_tokens, top_p, and repetition penalty
  • RAII Memory Management: Context manager pattern ensures automatic cleanup and no memory leaks
  • Exception-Safe: Robust error handling with guaranteed resource cleanup

Installation

Requirements

  • macOS with Apple Silicon (M1/M2/M3)
  • Python 3.9+ (native macOS version or newer)
  • 8GB+ RAM recommended + RAM to run LLM

Python Compatibility

MLX Knife has been comprehensively tested and verified on:

Python 3.9.6 (native macOS) - Primary target
Python 3.10-3.13 - Fully compatible

All versions include full MLX model execution testing with real models.

Install from Source

# Clone the repository
git clone https://github.com/mzau/mlx-knife.git
cd mlx-knife

# Install in development mode
pip install -e .

# Or install normally
pip install .

# Install with development tools (ruff, mypy, tests)
pip install -e ".[dev,test]"

Install Dependencies Only

pip install -r requirements.txt

Quick Start

CLI Usage

# List all MLX models in your cache
mlxk list

# Show detailed info about a model
mlxk show Phi-3-mini-4k-instruct-4bit

# Download a new model
mlxk pull mlx-community/Mistral-7B-Instruct-v0.3-4bit

# Run a model with a prompt
mlxk run Phi-3-mini "What is the capital of France?"

# Start interactive chat
mlxk run Phi-3-mini

# Check model health
mlxk health

Web Chat Interface

MLX Knife includes a built-in web interface for easy model interaction:

# Start the OpenAI-compatible API server
mlxk server --port 8000 --max-tokens 4000

# Open web chat interface in your browser
open simple_chat.html

Features:

  • No installation required - Pure HTML/CSS/JS
  • Real-time streaming - Watch tokens appear as they're generated
  • Model selection - Choose any MLX model from your cache
  • Conversation history - Full context for follow-up questions
  • Markdown rendering - Proper formatting for code, lists, tables
  • Mobile-friendly - Responsive design works on all devices

Local API Server Integration

The MLX Knife server provides OpenAI-compatible endpoints for local development and personal use:

# Start local server (single-user, no authentication)
mlxk server --host 127.0.0.1 --port 8000

# Test with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"model": "Phi-3-mini-4k-instruct-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'

# Integration with development tools (community-tested):
# - Cursor IDE: Set API URL to http://localhost:8000/v1
# - LibreChat: Configure as custom OpenAI endpoint  
# - Open WebUI: Add as local OpenAI-compatible API
# - SillyTavern: Add as OpenAI API with custom URL

Note: Tool integrations are community-tested. Some tools may require specific configuration or have compatibility limitations. Please report issues via GitHub.

Command Reference

Available Commands

list - Browse Models

mlxk list                    # Show MLX models only (short names)
mlxk list --verbose          # Show MLX models with full paths
mlxk list --all              # Show all models with framework info
mlxk list --all --verbose    # All models with full paths
mlxk list --health           # Include health status
mlxk list Phi-3              # Filter by model name
mlxk list --verbose Phi-3    # Show detailed info (same as show)

show - Model Details

mlxk show <model>            # Display model information
mlxk show <model> --files    # Include file listing
mlxk show <model> --config   # Show config.json content

pull - Download Models

mlxk pull <model>            # Download from HuggingFace
mlxk pull <org>/<model>      # Full model path

run - Execute Models

mlxk run <model> "prompt"              # Single prompt (minimal output)
mlxk run <model> "prompt" --verbose    # Show loading, memory, and stats
mlxk run <model>                       # Interactive chat
mlxk run <model> "prompt" --no-stream  # Batch output
mlxk run <model> --max-tokens 1000     # Custom length
mlxk run <model> --temperature 0.9     # Higher creativity
mlxk run <model> --no-chat-template    # Raw completion mode

rm - Remove Models

mlxk rm <model>              # Delete a model
mlxk rm <model> --force      # Skip confirmation

health - Check Integrity

mlxk health                  # Check all models
mlxk health <model>          # Check specific model

server - Start API Server

mlxk server                           # Start on localhost:8000
mlxk server --port 8001               # Custom port
mlxk server --host 0.0.0.0 --port 8000  # Allow external access
mlxk server --max-tokens 4000         # Set default max tokens (default: 2000)
mlxk server --reload                  # Development mode with auto-reload

Command Aliases

After installation, these commands are equivalent:

  • mlxk (recommended)
  • mlx-knife
  • mlx_knife

Project Structure

mlx_knife/
├── __init__.py                    # Package metadata and version
├── cli.py                         # Command-line interface and argument parsing
├── cache_utils.py                 # Core model management functionality
├── mlx_runner.py                  # Native MLX model execution
├── server.py                      # OpenAI-compatible API server with FastAPI
├── hf_download.py                 # HuggingFace download integration
├── throttled_download_worker.py   # Background download worker
├── requirements.txt               # Python dependencies
├── pyproject.toml                 # Package configuration
├── simple_chat.html               # Built-in web chat interface
└── README.md                      # This file

Module Overview

  • cli.py: Entry point handling command parsing and dispatch
  • cache_utils.py: Model discovery, metadata extraction, and cache operations
  • mlx_runner.py: MLX model loading, token generation, and streaming
  • server.py: FastAPI-based REST API server with OpenAI compatibility
  • simple_chat.html: Standalone web chat interface for immediate use
  • hf_download.py: Robust downloading with progress tracking
  • throttled_download_worker.py: Prevents network overload during downloads

Configuration

Cache Location

By default, models are stored in ~/.cache/huggingface/hub. Configure with:

# Set custom cache location
export HF_HOME="/path/to/your/cache"

# Example: External SSD
export HF_HOME="/Volumes/ExternalSSD/models"

Model Name Expansion

Short names are automatically expanded for MLX models:

  • Phi-3-mini-4k-instruct-4bitmlx-community/Phi-3-mini-4k-instruct-4bit
  • Models already containing / are used as-is

Advanced Usage

Generation Parameters

# Creative writing (high temperature, diverse output)
mlxk run Mistral-7B "Write a story" --temperature 0.9 --top-p 0.95

# Precise tasks (low temperature, focused output)
mlxk run Phi-3-mini "Extract key points" --temperature 0.3 --top-p 0.9

# Long-form generation
mlxk run Mixtral-8x7B "Explain quantum computing" --max-tokens 2000

# Reduce repetition
mlxk run model "prompt" --repetition-penalty 1.2

Working with Specific Commits

# Use specific model version
mlxk show model@commit_hash
mlxk run model@commit_hash "prompt"

Non-MLX Model Handling

The tool automatically detects framework compatibility:

# Attempting to run PyTorch model
mlxk run bert-base-uncased
# Error: Model bert-base-uncased is not MLX-compatible (Framework: PyTorch)!
# Use MLX-Community models: https://huggingface.co/mlx-community

Testing

MLX Knife includes comprehensive test coverage with 86/86 tests passing across all supported Python versions.

Quick Start

Prerequisites:

  • Apple Silicon Mac (M1/M2/M3)
  • Python 3.9+
  • At least one MLX model: mlxk pull mlx-community/Phi-3-mini-4k-instruct-4bit

Run Tests:

pip install -e ".[test]"
pytest

Why Local Testing?

MLX requires Apple Silicon hardware and real models (4GB+) for testing. This is standard for MLX projects and ensures tests reflect real-world usage.

For detailed testing documentation, development workflows, and multi-Python verification, see TESTING.md.

Part of the BROKE Ecosystem 🦫

MLX Knife is the first component of BROKE Cluster, our research project for intelligent LLM routing across heterogeneous Apple Silicon networks.

  • Use MLX Knife: For single Mac setups (available now)
  • Use BROKE Cluster: For multi-Mac environments (in development)

Technical Details

Token Decoding

MLX Knife uses context-aware decoding to handle tokenizers that encode spaces as separate tokens:

# Sliding window approach maintains context for proper spacing
window_tokens = generated_tokens[-10:]  # Last 10 tokens
window_text = tokenizer.decode(window_tokens)

Stop Token Detection

Stop tokens are dynamically extracted from each model's tokenizer:

  • Primary: tokenizer.eos_token
  • Secondary: tokenizer.pad_token (if different)
  • Additional: Special tokens containing 'end', 'stop', or 'eot'
  • Common tokens verified as single-token entities

Memory Management

  • RAII Pattern: Context manager ensures automatic resource cleanup
  • Exception-Safe: Model cleanup guaranteed even on errors
  • Baseline Tracking: Memory captured before model loading
  • Real-time Monitoring: GPU memory tracking via mlx.core.get_active_memory()
  • Memory Statistics: Detailed usage displayed after generation
  • Leak Prevention: Automatic mx.clear_cache() and garbage collection
# Context manager pattern (automatic cleanup)
with MLXRunner(model_path) as runner:
    response = runner.generate_batch(prompt)
# Model automatically cleaned up here

Troubleshooting

Model Not Found

# If model isn't found, try full path
mlxk pull mlx-community/Model-Name-4bit

# List available models
mlxk list --all

Performance Issues

  • Ensure sufficient RAM for model size
  • Close other applications to free memory
  • Use smaller quantized models (4-bit recommended)

Streaming Issues

  • Some models may have spacing issues - this is handled automatically
  • Use --no-stream for batch output if needed

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Quick Start:

  1. Fork and clone the repository
  2. Install with development tools: pip install -e ".[dev,test]"
  3. Make your changes and add tests
  4. Run tests locally on Apple Silicon: pytest
  5. Check code style: ruff check mlx_knife/ --fix
  6. Submit a pull request

We prioritize compatibility with Python 3.9 (native macOS) but welcome contributions tested on any version 3.9+.

Security

For security concerns, please see SECURITY.md or contact us at broke@gmx.eu.

MLX Knife runs entirely locally - no data is sent to external servers except when downloading models from HuggingFace.

License

MIT License - see LICENSE file for details

Copyright (c) 2025 The BROKE team 🦫

Acknowledgments


Made with ❤️ by The BROKE team BROKE Logo
Version 1.0-rc1 | August 2025

S
Description
ollama like cli tool for MLX models on huggingface (pull, rm, list, show, serve etc.)
Readme Apache-2.0 81 MiB
Languages
Python 98.4%
Shell 1.6%