Complete isolated test system with 150/150 tests passing. Production-ready after successful beta testing cycle. See CHANGELOG.md for comprehensive details including: - All critical issues from 1.1.0-beta3 resolved - Enhanced test infrastructure with real model validation - Multi-Python compatibility (3.9-3.13)
18 KiB
MLX Knife Testing Guide
Current Status
✅ 150/150 tests passing (August 2025) - STABLE RELEASE 🚀
✅ Apple Silicon verified (M1/M2/M3)
✅ Python 3.9-3.13 compatible
✅ Production ready - comprehensive testing with real model execution
✅ Isolated test system - user cache stays pristine with temp cache isolation
✅ 3-category test strategy - optimized for performance and safety
Quick Start
# Install with test dependencies
pip install -e ".[test]"
# Download test model (optional - most tests use isolated cache)
mlxk pull mlx-community/Phi-3-mini-4k-instruct-4bit
# Run all tests
pytest
# Fast unit tests only
pytest tests/unit/
# Before committing
ruff check mlx_knife/ --fix && mypy mlx_knife/ && pytest
Why Local Testing?
MLX Knife requires Apple Silicon hardware and real MLX models for comprehensive testing:
- Hardware Requirement: MLX framework only runs on Apple Silicon (M1/M2/M3)
- Model Requirement: Tests use actual models (4GB+) for realistic validation
- Industry Standard: Local testing is normal for MLX projects
- Quality Assurance: Real hardware testing ensures actual functionality
This approach ensures our tests reflect real-world usage, not mocked behavior.
Test Structure
tests/
├── conftest.py # Shared fixtures and utilities
├── integration/ # System-level integration tests (78 tests)
│ ├── test_core_functionality.py # Basic CLI operations (isolated cache)
│ ├── test_health_checks.py # Model corruption detection (isolated cache)
│ ├── test_lock_cleanup_bug.py # Issue #23: Lock cleanup (isolated cache)
│ ├── test_process_lifecycle.py # Process management (isolated cache)
│ ├── test_real_model_lifecycle.py # Full model lifecycle (isolated cache)
│ ├── test_run_command_advanced.py # Run command edge cases (isolated cache)
│ ├── test_server_functionality.py # Server lifecycle tests
│ ├── test_end_token_issue.py # Issue #20: End-token filtering (@server)
│ ├── test_issue_14.py # Issue #14: Chat self-conversation (@server)
│ └── test_issue_15_16.py # Issues #15/#16: Dynamic token limits (@server)
└── unit/ # Module-level unit tests (72 tests)
├── test_cache_utils.py # Cache management & Issue #21/#23 tests
├── test_cli.py # CLI argument parsing
└── test_mlx_runner_memory.py # Memory management tests
3-Category Test Strategy (MLX Knife 1.1.0+)
MLX Knife uses a 3-category test strategy to balance test isolation, performance, and user cache protection:
🏠 CATEGORY 1: ISOLATED CACHE (Most Tests)
✅ User cache stays pristine - Tests use temporary isolated caches with automatic cleanup
Implemented Tests (78 tests):
- ✅
test_real_model_lifecycle.py- Full model lifecycle withtiny-random-gpt2(~12MB download) - ✅
test_core_functionality.py- Basic CLI operations withpatch_model_cacheisolation - ✅
test_process_lifecycle.py- Process management with isolated cache + MODEL_CACHE patching - ✅
test_run_command_advanced.py- Run command edge cases withmock_model_cachein isolation - ✅
test_lock_cleanup_bug.py- Lock cleanup testing with temporary MODEL_CACHE override - ✅
test_health_checks.py- Mock corruption testing with isolatedtemp_cache_dir
Technical Pattern:
@pytest.mark.usefixtures("temp_cache_dir")
class TestBasicLifecycle:
def test_something(self, temp_cache_dir, patch_model_cache):
with patch_model_cache(temp_cache_dir / "hub"):
# Test operates in complete isolation
# User cache never touched, automatic cleanup
Benefits:
- ✅ Clean User Cache: No test artifacts or broken models ever
- ✅ Parallel Testing: No cache conflicts between test runs
- ✅ Reproducible: No dependency on existing models in user cache
- ✅ Fast CI: Small models (12MB vs 4GB) for most tests
🏥 CATEGORY 2: USER CACHE (Framework Diversity)
📋 Reserved for future - Real model diversity that cannot be mocked
Future Framework Validation Tests:
- Multiple framework detection (MLX + PyTorch + Tokenizer-only models)
- Health check diversity testing with naturally corrupted models
- Cross-framework model compatibility validation
Currently: All health/framework tests use mock_model_cache and are Category 1 (isolated)
🖥️ CATEGORY 3: SERVER CACHE (Performance Tests)
🔒 Large models, user cache expected - Marked with @pytest.mark.server
Server Tests (Excluded from default pytest):
- 🔒
test_issue_14.py- Chat self-conversation regression tests - 🔒
test_issue_15_16.py- Dynamic token limit validation - 🔒
test_end_token_issue.py- End-token filtering consistency - 🔒
test_server_functionality.py- OpenAI API compliance (basic tests only)
Technical Pattern:
@pytest.mark.server # Excluded from default pytest
def test_server_feature(mlx_server, model_name: str):
# Uses real models in user cache
# Requires significant RAM and time
Characteristics:
- 🔒 Not run by default - Must use
pytest -m server - 💾 RAM-aware - Auto-skip models exceeding available memory
- ⏱️ Longer execution - 20-40 minutes for full suite
- 🎯 Model diversity - Tests across different model sizes/architectures
Test Prerequisites
Required Setup
- Apple Silicon Mac (M1/M2/M3)
- Python 3.9 or newer
- Test dependencies installed:
pip install -e ".[test]"
That's it! Most tests (Category 1) use isolated caches and download small test models automatically (~12MB).
Optional Setup (Server Tests Only)
For server tests (@pytest.mark.server - excluded by default):
# Medium model for server testing
mlxk pull mlx-community/Phi-3-mini-4k-instruct-4bit
# Different architecture for variety
mlxk pull mlx-community/Mistral-7B-Instruct-v0.3-4bit
Note: Server tests are excluded from default pytest and require manual execution with pytest -m server.
Test Commands
Basic Test Execution
# All tests (recommended before commits)
pytest
# Only integration tests (system-level)
pytest tests/integration/
# Only unit tests (fast)
pytest tests/unit/
# Verbose output
pytest -v
# Show test coverage
pytest --cov=mlx_knife --cov-report=html
Specific Test Categories
# Process lifecycle tests (critical for production)
pytest tests/integration/test_process_lifecycle.py -v
# Health check robustness (model corruption detection)
pytest tests/integration/test_health_checks.py -v
# Core functionality (basic CLI commands)
pytest tests/integration/test_core_functionality.py -v
# Issue #20: End-token filtering consistency (new in 1.1.0-beta2)
pytest tests/integration/test_end_token_issue.py -v
# Advanced run command tests
pytest tests/integration/test_run_command_advanced.py -v
# Server functionality tests
pytest tests/integration/test_server_functionality.py -v
# Lock cleanup bug tests (Issue #23 - new in 1.1.0-beta3)
pytest tests/integration/test_lock_cleanup_bug.py -v
Test Filtering
# Run only basic operations tests
pytest -k "TestBasicOperations" -v
# Server tests are excluded by default (marked with @pytest.mark.server)
# Run server tests manually (requires large models in user cache)
pytest -m server -v
# Skip server tests explicitly (default behavior)
pytest -m "not server" -v
# Run only process lifecycle tests
pytest -k "process_lifecycle or zombie" -v
# Run health check tests only
pytest -k "health" -v
Timeout and Performance
# Set custom timeout (default: 300s)
pytest --timeout=60
# Show slowest tests
pytest --durations=10
# Parallel execution (if pytest-xdist installed)
pytest -n auto
Server Tests (Advanced)
⚠️ Warning: Server tests require significant system resources and time.
# Run comprehensive Issue #20 server tests (48 tests, ~30 minutes)
pytest tests/integration/test_end_token_issue.py -m server -v
# All server-marked tests (includes above + server functionality)
pytest -m server -v
# Quick server functionality test only
pytest tests/integration/test_server_functionality.py -v
# Server tests are RAM-aware - automatically skip models that don't fit
Server Test Requirements:
- RAM: 8GB+ recommended (16GB+ for large models)
- Time: 20-40 minutes for full suite
- Models: Multiple 4-bit quantized models (1B-30B parameters)
- Coverage: Streaming vs non-streaming consistency, token limits, API compliance
Python Version Compatibility
Verification Results (August 2025)
✅ 150/150 tests passing - All standard tests validated on Apple Silicon with isolated cache system
| Python Version | Status | Tests Passing |
|---|---|---|
| 3.9.6 (macOS) | ✅ Verified | 150/150 |
| 3.10.x | ✅ Verified | 150/150 |
| 3.11.x | ✅ Verified | 150/150 |
| 3.12.x | ✅ Verified | 150/150 |
| 3.13.x | ✅ Verified | 150/150 |
All versions tested with isolated cache system. Real MLX execution verified separately with server/run commands.
Manual Multi-Python Testing
If you have multiple Python versions installed, you can verify compatibility:
# Run the multi-Python verification script
./test-multi-python.sh
# Or manually test specific versions
python3.9 -m venv test_39
source test_39/bin/activate
pip install -e ".[test]"
pytest
deactivate && rm -rf test_39
Code Quality & Development
Code Quality Tools
MLX Knife includes comprehensive code quality tools:
# Install development dependencies
pip install -e ".[dev]"
# Automatic code formatting and linting
ruff check mlx_knife/ --fix
# Type checking with mypy
mypy mlx_knife/
# Complete development workflow
ruff check mlx_knife/ --fix && mypy mlx_knife/ && pytest
Development Workflow
Before committing changes:
#!/bin/bash
# pre-commit-check.sh - Run before committing
set -e
echo "🧪 Running MLX Knife pre-commit checks..."
# 1. Code style
echo "Checking code style..."
ruff check mlx_knife/ --fix
# 2. Type checking
echo "Checking types..."
mypy mlx_knife/
# 3. Quick smoke test
echo "Running quick tests..."
pytest tests/unit/ -v
echo "✅ All checks passed. Safe to commit!"
Local Development Testing
Adding New Tests
- Integration tests go in
tests/integration/ - Unit tests go in
tests/unit/ - Use existing fixtures from
conftest.py - Follow naming:
test_*.py,Test*classes,test_*methods
Test Categories (Markers)
@pytest.mark.integration # Slower system tests
@pytest.mark.unit # Fast isolated tests
@pytest.mark.slow # Tests >30 seconds
@pytest.mark.requires_model # Needs actual MLX model
@pytest.mark.network # Requires internet
@pytest.mark.server # Requires MLX Knife server (excluded from default pytest)
Mock Utilities
mock_model_cache(): Creates fake model directoriesmlx_knife_process(): Manages subprocess lifecycleprocess_monitor(): Tracks zombie processestemp_cache_dir(): Isolated test environment
Test Philosophy
Following the "Process Hygiene over Edge-Case Perfection" principle:
- Process Cleanliness: No zombies, no leaks ✅
- Health Checks: Reliable corruption detection ✅
- Core Operations: Basic functionality works ✅
- Error Handling: Graceful failures ✅
The test suite validates production readiness with real Apple Silicon hardware and actual MLX models.
Troubleshooting
Common Issues
Tests hang forever:
pytest --timeout=60
Import errors:
pip install -e ".[test]"
Process cleanup issues:
ps aux | grep mlx_knife # Check for zombies
Cache conflicts:
export HF_HOME="/tmp/test_cache"
pytest --cache-clear
Test Environment
# Clean test run
rm -rf .pytest_cache __pycache__
pytest tests/ -v --cache-clear
# Debug specific test
pytest tests/integration/test_health_checks.py::TestHealthCheckRobustness::test_healthy_model_detection -v -s
Contributing Test Results
When submitting PRs, please include:
-
Your test environment:
- macOS version
- Apple Silicon chip (M1/M2/M3)
- Python version
- Which model(s) you tested with
-
Test results summary:
Platform: macOS 14.5, M2 Pro Python: 3.11.6 Model: Phi-3-mini-4k-instruct-4bit Results: 150/150 tests passed -
Any issues encountered and how you resolved them
Summary
MLX Knife 1.1.0 STABLE Testing Status:
✅ Production Ready - 150/150 tests passing
✅ Isolated Test System - User cache stays pristine with temp cache isolation
✅ 3-Category Strategy - Optimized for performance and safety
✅ Multi-Python Support - Python 3.9-3.13 verified
✅ Code Quality - ruff/mypy integration working
✅ Real Model Testing - Server/run commands validated with multiple models
✅ Memory Management - Context managers prevent leaks
✅ Exception Safety - Context managers ensure cleanup
✅ Cache Directory Fix - Issue #21: Empty cache crash resolved
✅ LibreSSL Warning Fix - Issue #22: macOS Python 3.9 warning suppression
✅ Lock Cleanup Fix - Issue #23: Enhanced rm command with lock cleanup
This comprehensive testing framework validates MLX Knife's production readiness through isolated testing with automatic model downloads and separate real MLX validation.
Server-Based Testing (Advanced)
Some tests require a running MLX Knife server with loaded models. These tests are marked with @pytest.mark.server and are not run by default with pytest.
Why Separate Server Tests?
- Test count varies by loaded models (makes CI reporting inconsistent)
- Large memory requirements - need different models for different RAM sizes
- Longer execution time - each model needs to load individually
- Manual setup required - need to download appropriate models first
Prerequisites for Server Tests
| System RAM | Recommended Models | Commands |
|---|---|---|
| 16GB | Small models only | mlxk pull mlx-community/Qwen2.5-0.5B-Instruct-4bitmlxk pull mlx-community/Llama-3.2-1B-Instruct-4bitmlxk pull mlx-community/Llama-3.2-3B-Instruct-4bit |
| 32GB | + Medium models | mlxk pull mlx-community/Phi-3-mini-4k-instruct-4bitmlxk pull mlx-community/Mistral-7B-Instruct-v0.2-4bitmlxk pull mlx-community/Mixtral-8x7B-Instruct-v0.1-4bit |
| 64GB | + Large models | mlxk pull mlx-community/Mistral-Small-3.2-24B-Instruct-2506-4bitmlxk pull mlx-community/Qwen3-30B-A3B-Instruct-2507-4bitmlxk pull mlx-community/Llama-3.3-70B-Instruct-4bit |
| 96GB+ | + Huge models | mlxk pull mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit |
Running Server Tests
Issue #14 Regression Tests (Chat Self-Conversation Bug):
# Set environment
export HF_HOME=/path/to/your/cache
# Smoke test first (see which models are available)
python tests/integration/test_issue_14.py
# Run server tests only (excluded from default pytest)
pytest -m server -v
# Run specific Issue #14 tests
pytest tests/integration/test_issue_14.py -m server -v
Expected Output:
🦫 MLX Knife Issue #14 Test - Smoke Test
==================================================
📊 Safe models for this system: 6
💾 System RAM: 64GB total, 40GB available
🎯 mlx-community/Mistral-7B-Instruct-v0.2-4bit
└─ Size: 7B, RAM needed: 8GB
🎯 mlx-community/Llama-3.2-3B-Instruct-4bit
└─ Size: 3B, RAM needed: 4GB
[...]
========== test session starts ==========
tests/integration/test_issue_14.py::test_server_health[mlx_server] PASSED
tests/integration/test_issue_14.py::test_issue_14_self_conversation_regression_original[mlx-community/Mistral-7B-Instruct-v0.2-4bit-7B-8] PASSED
[...6 more model tests...]
========== 7 passed in 45.23s ==========
Additional Server Tests
Issues #15 & #16 - Dynamic Token Limits (Implemented in 1.1.0-beta1):
pytest tests/integration/test_issue_15_16.py -v
Issue #20 - End-Token Filtering (Implemented in 1.1.0-beta2):
pytest tests/integration/test_end_token_issue.py -m server -v
Troubleshooting Server Tests
Permission warnings are normal:
WARNING: ⚠️ Cannot scan network connections (permission denied)
INFO: 🔧 Falling back to process-based cleanup only
This is expected on macOS - the tests continue with process-based cleanup.
Memory issues:
- Tests automatically skip models exceeding 80% available RAM
- Use smaller models if you see consistent memory failures
- Consider external SSD for model cache to reduce memory pressure
Server startup failures:
# Debug server manually
python -m mlx_knife.cli server --port 8000
# Check model health
mlxk health
# Verify environment
echo $HF_HOME
Adding New Server Tests
When contributing server-based tests:
@pytest.mark.server
def test_new_feature(mlx_server, model_name: str, size_str: str, ram_needed: int):
"""Test new feature with MLX models."""
# Use mlx_server fixture for automatic server management
# Test implementation here
- Mark with
@pytest.mark.server- excludes from defaultpytest - Use
mlx_serverfixture - automatic server lifecycle management - Test RAM requirements - use
get_safe_models_for_system()helper - Document in TESTING.md - add to this guide