# Check specific model
```
#### `server` - Start API Server
```bash
mlxk server # Start on localhost:8000
mlxk server --port 8001 # Custom port
mlxk server --host 0.0.0.0 --port 8000 # Allow external access
mlxk server --max-tokens 4000 # Set default max tokens (default: 2000)
mlxk server --reload # Development mode with auto-reload
```
### Command Aliases
After installation, these commands are equivalent:
- `mlxk` (recommended)
- `mlx-knife`
- `mlx_knife`
## Project Structure
```
mlx_knife/
├── __init__.py # Package metadata and version
├── cli.py # Command-line interface and argument parsing
├── cache_utils.py # Core model management functionality
├── mlx_runner.py # Native MLX model execution
├── server.py # OpenAI-compatible API server with FastAPI
├── hf_download.py # HuggingFace download integration
├── throttled_download_worker.py # Background download worker
├── requirements.txt # Python dependencies
├── pyproject.toml # Package configuration
├── simple_chat.html # Built-in web chat interface
└── README.md # This file
```
### Module Overview
- **`cli.py`**: Entry point handling command parsing and dispatch
- **`cache_utils.py`**: Model discovery, metadata extraction, and cache operations
- **`mlx_runner.py`**: MLX model loading, token generation, and streaming
- **`server.py`**: FastAPI-based REST API server with OpenAI compatibility
- **`simple_chat.html`**: Standalone web chat interface for immediate use
- **`hf_download.py`**: Robust downloading with progress tracking
- **`throttled_download_worker.py`**: Prevents network overload during downloads
## Configuration
### Cache Location
By default, models are stored in `~/.cache/huggingface/hub`. Configure with:
```bash
# Set custom cache location
export HF_HOME="/path/to/your/cache"
# Example: External SSD
export HF_HOME="/Volumes/ExternalSSD/models"
```
### Model Name Expansion
Short names are automatically expanded for MLX models:
- `Phi-3-mini-4k-instruct-4bit` → `mlx-community/Phi-3-mini-4k-instruct-4bit`
- Models already containing `/` are used as-is
## Advanced Usage
### Generation Parameters
```bash
# Creative writing (high temperature, diverse output)
mlxk run Mistral-7B "Write a story" --temperature 0.9 --top-p 0.95
# Precise tasks (low temperature, focused output)
mlxk run Phi-3-mini "Extract key points" --temperature 0.3 --top-p 0.9
# Long-form generation
mlxk run Mixtral-8x7B "Explain quantum computing" --max-tokens 2000
# Reduce repetition
mlxk run model "prompt" --repetition-penalty 1.2
```
### Working with Specific Commits
```bash
# Use specific model version
mlxk show model@commit_hash
mlxk run model@commit_hash "prompt"
```
### Non-MLX Model Handling
The tool automatically detects framework compatibility:
```bash
# Attempting to run PyTorch model
mlxk run bert-base-uncased
# Error: Model bert-base-uncased is not MLX-compatible (Framework: PyTorch)!
# Use MLX-Community models: https://huggingface.co/mlx-community
```
## Testing
MLX Knife includes comprehensive test coverage across all supported Python versions.
### Quick Start
**Prerequisites:**
- Apple Silicon Mac (M1/M2/M3)
- Python 3.9+
- At least one MLX model: `mlxk pull mlx-community/Phi-3-mini-4k-instruct-4bit`
**Run Tests:**
```bash
pip install -e ".[test]"
pytest
```
### Why Local Testing?
MLX requires Apple Silicon hardware and real models (4GB+) for testing. This is standard for MLX projects and ensures tests reflect real-world usage.
For detailed testing documentation, development workflows, and multi-Python verification, see **[TESTING.md](TESTING.md)**.
## Part of the BROKE Ecosystem 🦫
MLX Knife is the first component of [BROKE Cluster](https://github.com/mzau/broke-cluster),
our research project for intelligent LLM routing across heterogeneous Apple Silicon networks.
- **Use MLX Knife**: For single Mac setups (available now)
- **Use BROKE Cluster**: For multi-Mac environments (in development)
## Technical Details
### Token Decoding
MLX Knife uses context-aware decoding to handle tokenizers that encode spaces as separate tokens:
```python
# Sliding window approach maintains context for proper spacing
window_tokens = generated_tokens[-10:] # Last 10 tokens
window_text = tokenizer.decode(window_tokens)
```
### Stop Token Detection
Stop tokens are dynamically extracted from each model's tokenizer:
- Primary: `tokenizer.eos_token`
- Secondary: `tokenizer.pad_token` (if different)
- Additional: Special tokens containing 'end', 'stop', or 'eot'
- Common tokens verified as single-token entities
### Memory Management
- **Context Managers**: Automatic resource cleanup with Python context managers
- **Exception-Safe**: Model cleanup guaranteed even on errors
- **Baseline Tracking**: Memory captured before model loading
- **Real-time Monitoring**: GPU memory tracking via `mlx.core.get_active_memory()`
- **Memory Statistics**: Detailed usage displayed after generation
- **Leak Prevention**: Automatic `mx.clear_cache()` and garbage collection
```python
# Context manager pattern (automatic cleanup)
with MLXRunner(model_path) as runner:
response = runner.generate_batch(prompt)
# Model automatically cleaned up here
```
## Troubleshooting
### Model Not Found
```bash
# If model isn't found, try full path
mlxk pull mlx-community/Model-Name-4bit
# List available models
mlxk list --all
```
### Performance Issues
- Ensure sufficient RAM for model size
- Close other applications to free memory
- Use smaller quantized models (4-bit recommended)
### Streaming Issues
- Some models may have spacing issues - this is handled automatically
- Use `--no-stream` for batch output if needed
## Contributing
Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
**Quick Start:**
1. Fork and clone the repository
2. Install with development tools: `pip install -e ".[dev,test]"`
3. Make your changes and add tests
4. Run tests locally on Apple Silicon: `pytest`
5. Check code style: `ruff check mlx_knife/ --fix`
6. Submit a pull request
We prioritize compatibility with Python 3.9 (native macOS) but welcome contributions tested on any version 3.9+.
## Security
For security concerns, please see [SECURITY.md](SECURITY.md) or contact us at broke@gmx.eu.
MLX Knife runs entirely locally - no data is sent to external servers except when downloading models from HuggingFace.
## License
MIT License - see [LICENSE](LICENSE) file for details
Copyright (c) 2025 The BROKE team 🦫
## Acknowledgments
- Built for Apple Silicon using the [MLX framework](https://github.com/ml-explore/mlx)
- Models hosted by the [MLX Community](https://huggingface.co/mlx-community) on HuggingFace
- Inspired by [ollama](https://ollama.ai)'s user experience
---
Made with ❤️ by The BROKE team 
Version 1.0-rc3 | August 2025
🔮 Next: BROKE Cluster for multi-node deployments