# MLX Knife Testing Guide ## Overview MLX Knife uses a **3-category test strategy** designed for safety, speed, and reproducibility on Apple Silicon. Most tests run in complete isolation without requiring models or network access. For current test counts, version-specific details, and complete file listings, see [TESTING-DETAILS.md](TESTING-DETAILS.md). ## Test Philosophy **Core Principles:** - **Isolated by default** - User cache stays pristine with sentinel protection - **Opt-in live tests** - Network/model tests require explicit markers/environment - **Mock-heavy** - MLX stubs enable fast testing without model downloads - **Fast feedback** - 500+ tests run in seconds on any Apple Silicon Mac **Cache Architecture:** - **User Cache (Singleton):** ONE permanent cache per system - READ-ONLY in tests - **Isolated Cache (Factory):** NEW temporary cache PER test - full read/write - **Sentinel Safety:** Automatic protection prevents accidental User Cache deletion See [TESTING-DETAILS.md → Fundamental Definitions](TESTING-DETAILS.md#fundamental-definitions-single-source-of-truth) for complete cache architecture and safety mechanisms. **Safety First:** - Tests use temporary caches with `TEST_SENTINEL` protection - Delete operations fail if not in test cache (`MLXK2_STRICT_TEST_DELETE=1`) - Live tests never modify user cache without explicit environment variables **Unit Test Limitations:** MLX Knife has two test categories: 1. **Unit tests** (~500 tests, fast, mocked) - verify code structure 2. **Live E2E tests** (real models, slow) - verify actual functionality **Why both are needed:** When dependencies like `transformers` or `mlx-lm` update their APIs, unit tests (which mock these libraries) continue to pass, but real model loading breaks. Only live E2E tests catch these issues. **Example:** transformers 5.0 changed tokenizer initialization - unit tests passed (mocked API), but vision models failed to load in production. Live E2E tests caught the issue immediately. ## Quick Start ```bash # Install package + development tools (text-only tests) pip install -e ".[dev,test]" # Run default test suite (isolated, no live downloads) pytest -v # Before committing ruff check mlxk2/ --fix && mypy mlxk2/ && pytest -v ``` **That's it!** Default tests use isolated caches and MLX stubs - no model downloads required. > **Vision + Audio Tests:** For complete development setup including Vision and Audio, > see **[README.md → Development Installation](README.md#development-installation)**. ## Running All Real Tests **Single command (recommended):** ```bash ./scripts/test-wet-umbrella.sh ``` This runs all real tests in the correct order. For details on test categories, see [TESTING-DETAILS.md](TESTING-DETAILS.md). **Manual execution (advanced):** ```bash # Portfolio-compatible tests pytest -m wet -v # Isolated Cache WRITE tests MLXK2_TEST_RESUMABLE_DOWNLOAD=1 pytest -m live_resumable -v ``` ## Test Categories ### Category 1: Isolated Cache (Default) **User cache stays pristine** - Tests use temporary caches with sentinel protection **What's tested:** - JSON API contracts (list, show, health) - Human output formatting - Model resolution and naming - Push operations (offline: `--check-only`, `--dry-run`) - Clone operations (offline: APFS validation, CoW workflow) - Run command and generation (with MLX stubs) - Server API endpoints (minimal, no real models) - Schema validation and spec compliance **How to run:** ```bash pytest -v # Runs all isolated tests ``` **Technical pattern:** ```python def test_something(isolated_cache): # Complete isolation with sentinel protection assert_is_test_cache(isolated_cache) # Test implementation ``` ### Category 2: Live Tests (Opt-in) **Require explicit environment setup** - Network or user cache dependent **What's tested:** - Real HuggingFace push operations - APFS same-volume clone workflows - Stop token validation with real models - Framework detection with private/org models - Multi-shard model health validation **Markers:** `live_push`, `live_clone`, `live_list`, `live_stop_tokens`, `live_e2e`, `live_run`, `issue27` **How to run:** ```bash # Live stop tokens (requires models in cache or HF_HOME) pytest -m live_stop_tokens -v # Live push (requires credentials + workspace) export MLXK2_ENABLE_ALPHA_FEATURES=1 export MLXK2_LIVE_PUSH=1 export HF_TOKEN=... export MLXK2_LIVE_REPO=org/model export MLXK2_LIVE_WORKSPACE=/path/to/workspace pytest -m live_push -v ``` See [TESTING-DETAILS.md](TESTING-DETAILS.md) for complete environment setup instructions. ### Category 3: Server Tests (Default) **Basic server functionality** - Lightweight API validation **What's tested:** - OpenAI-compatible endpoints - SSE streaming functionality - Model loading and error handling - Token limit enforcement **How to run:** ```bash pytest -k server -v # Optional, included in default suite ``` **Note:** Basic server tests use MLX stubs and run by default. Comprehensive E2E tests with real models are available via `live_e2e` marker (ADR-011). ## Test Structure ``` tests_2.0/ ├── conftest.py # Isolated cache, safety sentinel, core fixtures ├── conftest_runner.py # Runner-specific fixtures/mocks ├── stubs/ # Minimal MLX/MLX-LM stubs for unit tests │ ├── mlx/core.py │ └── mlx_lm/... ├── spec/ # JSON API spec/contract validation │ ├── test_cli_commands_json_flag.py │ ├── test_spec_version_sync.py │ └── ... ├── live/ # Opt-in live tests (markers required) │ ├── test_push_live.py │ ├── test_clone_live.py │ └── test_list_human_live.py ├── test_*.py # Core test files └── test_*.py.disabled # Intentionally disabled (WIP) ``` **Legend:** - `spec/` - API contract validation (stays in sync with `docs/schema`) - `live/` - **User Cache READ only** - Portfolio Discovery tests (parametrized across many models) - `stubs/` - Lightweight MLX replacements for unit tests - `conftest.py` - Isolated HF cache (temp), safety sentinel, fixtures - Parent `conftest.py` applies globally - Subdirectory `conftest.py` (live/, spec/) MUST limit scope to own directory only - See [TESTING-DETAILS.md → conftest.py Scope Rules](TESTING-DETAILS.md#conftestpy-scope-rules) **CRITICAL RULE:** ❌ **NEVER write to User Cache** ❌ **Test organization by cache strategy:** - **User Cache READ** → `tests_2.0/live/` (Portfolio Discovery with many models) - **Isolated Cache WRITE** → `tests_2.0/` (fresh downloads, mock creation) - **Isolated Cache READ** → `tests_2.0/` (safety copies from User Cache) - **Schema validation** → `tests_2.0/spec/` (mocks, fast) - **Workspace operations** → `tmp_path` fixture (Clone/Push tests, separate from cache) **Note:** Workspace is semantically distinct from Cache - see [TESTING-DETAILS.md → Workspace](TESTING-DETAILS.md#workspace-separate-concept---not-a-cache) for details. See [TESTING-DETAILS.md → Truth Table](TESTING-DETAILS.md#truth-table-cache-type--operation) for complete categorization and decision tree. ## MLX Stubs (Fast Testing Without Model Downloads) **Purpose:** Unit tests run without loading real models **How it works:** - `conftest.py` prepends `tests_2.0/stubs/` to `sys.path` - `import mlx` / `import mlx_lm` resolve to minimal stubs - Tests use mock models (~50KB fake files instead of 50GB real models) **Benefits:** - Fast test runs (seconds instead of minutes) - Low RAM usage (default suite: 16GB sufficient) - No model downloads required - Deterministic behavior **Limitations:** - Tests requiring real mlx-lm integration use `@requires_mlx_lm` marker - Production CLI/server still use real packages (stubs not installed) ## Common Test Commands ```bash # Default suite (isolated, fast) pytest -v # Specific categories pytest -m spec -v # Only spec/schema tests pytest -m "not spec" -v # Exclude spec tests pytest -k push -v # Push tests (offline) pytest -k server -v # Server tests # Live tests (opt-in) pytest -m live_stop_tokens -v # Stop token validation pytest -m live_push -v # Real HF push pytest -m live_clone -v # APFS clone workflow # Development pytest --durations=10 # Show slowest tests pytest -k "test_name" -v # Run specific test ``` ## Test Prerequisites ### Required Setup 1. **Apple Silicon Mac (M1/M2/M3)** - Required (MLX uses Metal) 2. **Python 3.9 or newer** 3. **RAM Requirements:** - **Default suite:** 16GB minimum (isolated tests, mock models) - **Live E2E tests:** 32GB minimum (real models, Portfolio Discovery) - **Full suite (wet-umbrella):** **64GB recommended** - Wet umbrella Phase 4 (Vision→Geo pipe): ~29GB peak observed (M2 Max) - Sequential loading: Vision unloads before text model loads (not parallel) - Portfolio Discovery selects largest eligible models for quality - **Tested:** M2 Max 64GB (comfortable headroom) - **Untested:** M1 Max 32GB (theoretically viable but Metal limits unknown) - **Note:** Metal memory limits may vary by chip generation 4. **~10-20MB disk space** for test temp files (default suite) 5. **Test dependencies:** ```bash pip install -e .[test] ``` **Default suite (16GB):** Mock models, fast, no downloads needed. **Full suite (64GB):** Real models, comprehensive validation, recommended for development. ### Optional Setup (Live Tests) Live tests require additional environment setup: **🔍 Show which models would be tested:** ```bash HF_HOME=/path/to/cache pytest -m show_model_portfolio -s ``` This displays all models that would be used in E2E tests (no actual testing). **E2E tests** (ADR-011): ```bash # Full E2E test suite with real models HF_HOME=/path/to/cache pytest -m live_e2e -v ``` **Stop token validation** (ADR-009): ```bash pytest -m live_stop_tokens -v # Uses Portfolio Discovery if models found, else fallback models # See TESTING-DETAILS.md "Required Models for Live Tests" ``` **Push/Clone tests** (alpha features): ```bash # See TESTING-DETAILS.md for complete environment setup ``` ## Environment & Caches **User cache** (persistent): - Real cache for manual operations - Example: `export HF_HOME="/Volumes/SSD/models"` - Safe ops: `list`, `health`, `show` **Test cache** (isolated): - Ephemeral via fixtures - Default tests never touch user cache - Deletion safety: `MLXK2_STRICT_TEST_DELETE=1` **Best practice:** - Use isolated tests for development (default `pytest`) - Use live tests for validation (opt-in with markers) - Set `HF_HOME` to external SSD for live tests ## Python Version Compatibility **Tests validated on Python 3.10-3.12** (Python 3.9 not supported since 2.0.4) Multi-version testing: ```bash # Automated script ./test-multi-python.sh # Manual verification python3.10 -m venv test_310 source test_310/bin/activate pip install -e .[test] && pytest ``` See [TESTING-DETAILS.md](TESTING-DETAILS.md) for version-specific results. ## Code Quality ```bash # Install tools pip install -e .[dev] # Code formatting and linting ruff check mlxk2/ --fix # Type checking mypy mlxk2/ # Complete workflow ruff check mlxk2/ --fix && mypy mlxk2/ && pytest ``` ## Test Markers MLX Knife uses pytest markers to organize tests by category: - **Default suite** (`pytest -v`): Unit tests with mocks (fast, offline, no real models) - **Spec tests** (`-m spec`): API contract/schema validation - **Live tests** (`-m live_*`): Tests with real models or network (opt-in) **Common commands:** ```bash # Default test suite (fast, offline) pytest -v # API spec/contract tests only pytest -m spec -v # Live tests with real models (examples) pytest -m live_stop_tokens -v # Stop token validation (ADR-009) pytest -m live_e2e -v # E2E server/HTTP/CLI tests (ADR-011) ``` **For complete marker reference, environment requirements, and detailed usage, see:** - [TESTING-DETAILS.md → Test Execution Guide](TESTING-DETAILS.md#test-execution-guide) **Symbol Legend:** - 🔒 **Marker-required**: Must use `-m marker` (skipped by default `pytest -v`) - **Skip-unless-env**: Collected but skipped without required environment ## Troubleshooting **Tests hang forever:** ```bash pytest --timeout=60 ``` **Import errors:** ```bash pip install -e .[test] ``` **Cache conflicts:** ```bash export HF_HOME="/tmp/test_cache" pytest --cache-clear ``` **Debug specific test:** ```bash pytest path/to/test.py::test_name -v -s ``` ## Contributing Tests When submitting PRs with test changes, please document in the PR description: 1. **Test environment** (macOS version, Apple Silicon chip, Python version) 2. **Test results** (passed/skipped/failed counts) 3. **Any issues encountered** and resolutions See [TESTING-DETAILS.md](TESTING-DETAILS.md#current-status) for the current official test environment and results as an example. ## Development Workflow Before committing: ```bash # 1. Code style ruff check mlxk2/ --fix # 2. Type checking mypy mlxk2/ # 3. Run tests pytest -v # Or combined ruff check mlxk2/ --fix && mypy mlxk2/ && pytest -v ``` ## Summary **MLX Knife Testing:** - ✅ **Isolated by default** - User cache stays pristine - ✅ **Fast feedback** - 500+ tests run in seconds without model downloads - ✅ **Low requirements** - 16GB RAM, ~20MB disk, no HF cache needed - ✅ **Opt-in live tests** - Real models/network when needed - ✅ **Multi-Python support** - Verified on Python 3.9-3.14 For detailed information including current test counts, complete file structure, version history, and implementation specifics, see [TESTING-DETAILS.md](TESTING-DETAILS.md). --- *MLX-Knife 2.0 Testing Framework*