mlx-knife/TESTING.md

# MLX Knife Testing Guide

## Overview

MLX Knife uses a **3-category test strategy** designed for safety, speed, and reproducibility on Apple Silicon. Most tests run in complete isolation without requiring models or network access.

For current test counts, version-specific details, and complete file listings, see [TESTING-DETAILS.md](TESTING-DETAILS.md).

## Test Philosophy

**Core Principles:**
- **Isolated by default** - User cache stays pristine with sentinel protection
- **Opt-in live tests** - Network/model tests require explicit markers/environment
- **Mock-heavy** - MLX stubs enable fast testing without model downloads
- **Fast feedback** - 500+ tests run in seconds on any Apple Silicon Mac

**Cache Architecture:**
- **User Cache (Singleton):** ONE permanent cache per system - READ-ONLY in tests
- **Isolated Cache (Factory):** NEW temporary cache PER test - full read/write
- **Sentinel Safety:** Automatic protection prevents accidental User Cache deletion

See [TESTING-DETAILS.md → Fundamental Definitions](TESTING-DETAILS.md#fundamental-definitions-single-source-of-truth) for complete cache architecture and safety mechanisms.

**Safety First:**
- Tests use temporary caches with `TEST_SENTINEL` protection
- Delete operations fail if not in test cache (`MLXK2_STRICT_TEST_DELETE=1`)
- Live tests never modify user cache without explicit environment variables

**Unit Test Limitations:**

MLX Knife has two test categories:
1. **Unit tests** (~500 tests, fast, mocked) - verify code structure
2. **Live E2E tests** (real models, slow) - verify actual functionality

**Why both are needed:**
When dependencies like `transformers` or `mlx-lm` update their APIs, unit tests (which mock these libraries) continue to pass, but real model loading breaks. Only live E2E tests catch these issues.

**Example:** transformers 5.0 changed tokenizer initialization - unit tests passed (mocked API), but vision models failed to load in production. Live E2E tests caught the issue immediately.

## Quick Start

```bash
# Install package + development tools (text-only tests)
pip install -e ".[dev,test]"

# Run default test suite (isolated, no live downloads)
pytest -v

# Before committing
ruff check mlxk2/ --fix && mypy mlxk2/ && pytest -v
```

**That's it!** Default tests use isolated caches and MLX stubs - no model downloads required.

> **Vision + Audio Tests:** For complete development setup including Vision and Audio,
> see **[README.md → Development Installation](README.md#development-installation)**.

## Running All Real Tests

**Single command (recommended):**
```bash
./scripts/test-wet-umbrella.sh
```

This runs all real tests in the correct order. For details on test categories, see [TESTING-DETAILS.md](TESTING-DETAILS.md).

**Manual execution (advanced):**
```bash
# Portfolio-compatible tests
pytest -m wet -v

# Isolated Cache WRITE tests
MLXK2_TEST_RESUMABLE_DOWNLOAD=1 pytest -m live_resumable -v
```

## Test Categories

### Category 1: Isolated Cache (Default)
**User cache stays pristine** - Tests use temporary caches with sentinel protection

**What's tested:**
- JSON API contracts (list, show, health)
- Human output formatting
- Model resolution and naming
- Push operations (offline: `--check-only`, `--dry-run`)
- Clone operations (offline: APFS validation, CoW workflow)
- Run command and generation (with MLX stubs)
- Server API endpoints (minimal, no real models)
- Schema validation and spec compliance

**How to run:**
```bash
pytest -v  # Runs all isolated tests
```

**Technical pattern:**
```python
def test_something(isolated_cache):
    # Complete isolation with sentinel protection
    assert_is_test_cache(isolated_cache)
    # Test implementation
```

### Category 2: Live Tests (Opt-in)
**Require explicit environment setup** - Network or user cache dependent

**What's tested:**
- Real HuggingFace push operations
- APFS same-volume clone workflows
- Stop token validation with real models
- Framework detection with private/org models
- Multi-shard model health validation

**Markers:** `live_push`, `live_clone`, `live_list`, `live_stop_tokens`, `live_e2e`, `live_run`, `issue27`

**How to run:**
```bash
# Live stop tokens (requires models in cache or HF_HOME)
pytest -m live_stop_tokens -v

# Live push (requires credentials + workspace)
export MLXK2_ENABLE_ALPHA_FEATURES=1
export MLXK2_LIVE_PUSH=1
export HF_TOKEN=...
export MLXK2_LIVE_REPO=org/model
export MLXK2_LIVE_WORKSPACE=/path/to/workspace
pytest -m live_push -v
```

See [TESTING-DETAILS.md](TESTING-DETAILS.md) for complete environment setup instructions.

### Category 3: Server Tests (Default)
**Basic server functionality** - Lightweight API validation

**What's tested:**
- OpenAI-compatible endpoints
- SSE streaming functionality
- Model loading and error handling
- Token limit enforcement

**How to run:**
```bash
pytest -k server -v  # Optional, included in default suite
```

**Note:** Basic server tests use MLX stubs and run by default. Comprehensive E2E tests with real models are available via `live_e2e` marker (ADR-011).

## Test Structure

```
tests_2.0/
├── conftest.py              # Isolated cache, safety sentinel, core fixtures
├── conftest_runner.py       # Runner-specific fixtures/mocks
├── stubs/                   # Minimal MLX/MLX-LM stubs for unit tests
│   ├── mlx/core.py
│   └── mlx_lm/...
├── spec/                    # JSON API spec/contract validation
│   ├── test_cli_commands_json_flag.py
│   ├── test_spec_version_sync.py
│   └── ...
├── live/                    # Opt-in live tests (markers required)
│   ├── test_push_live.py
│   ├── test_clone_live.py
│   └── test_list_human_live.py
├── test_*.py               # Core test files
└── test_*.py.disabled      # Intentionally disabled (WIP)
```

**Legend:**
- `spec/` - API contract validation (stays in sync with `docs/schema`)
- `live/` - **User Cache READ only** - Portfolio Discovery tests (parametrized across many models)
- `stubs/` - Lightweight MLX replacements for unit tests
- `conftest.py` - Isolated HF cache (temp), safety sentinel, fixtures
  - Parent `conftest.py` applies globally
  - Subdirectory `conftest.py` (live/, spec/) MUST limit scope to own directory only
  - See [TESTING-DETAILS.md → conftest.py Scope Rules](TESTING-DETAILS.md#conftestpy-scope-rules)

**CRITICAL RULE:** ❌ **NEVER write to User Cache** ❌

**Test organization by cache strategy:**
- **User Cache READ** → `tests_2.0/live/` (Portfolio Discovery with many models)
- **Isolated Cache WRITE** → `tests_2.0/` (fresh downloads, mock creation)
- **Isolated Cache READ** → `tests_2.0/` (safety copies from User Cache)
- **Schema validation** → `tests_2.0/spec/` (mocks, fast)
- **Workspace operations** → `tmp_path` fixture (Clone/Push tests, separate from cache)

**Note:** Workspace is semantically distinct from Cache - see [TESTING-DETAILS.md → Workspace](TESTING-DETAILS.md#workspace-separate-concept---not-a-cache) for details.

See [TESTING-DETAILS.md → Truth Table](TESTING-DETAILS.md#truth-table-cache-type--operation) for complete categorization and decision tree.

## MLX Stubs (Fast Testing Without Model Downloads)

**Purpose:** Unit tests run without loading real models

**How it works:**
- `conftest.py` prepends `tests_2.0/stubs/` to `sys.path`
- `import mlx` / `import mlx_lm` resolve to minimal stubs
- Tests use mock models (~50KB fake files instead of 50GB real models)

**Benefits:**
- Fast test runs (seconds instead of minutes)
- Low RAM usage (default suite: 16GB sufficient)
- No model downloads required
- Deterministic behavior

**Limitations:**
- Tests requiring real mlx-lm integration use `@requires_mlx_lm` marker
- Production CLI/server still use real packages (stubs not installed)

## Common Test Commands

```bash
# Default suite (isolated, fast)
pytest -v

# Specific categories
pytest -m spec -v              # Only spec/schema tests
pytest -m "not spec" -v        # Exclude spec tests
pytest -k push -v              # Push tests (offline)
pytest -k server -v            # Server tests

# Live tests (opt-in)
pytest -m live_stop_tokens -v  # Stop token validation
pytest -m live_push -v         # Real HF push
pytest -m live_clone -v        # APFS clone workflow

# Development
pytest --durations=10          # Show slowest tests
pytest -k "test_name" -v       # Run specific test
```

## Test Prerequisites

### Required Setup
1. **Apple Silicon Mac (M1/M2/M3)** - Required (MLX uses Metal)
2. **Python 3.9 or newer**
3. **RAM Requirements:**
   - **Default suite:** 16GB minimum (isolated tests, mock models)
   - **Live E2E tests:** 32GB minimum (real models, Portfolio Discovery)
   - **Full suite (wet-umbrella):** **64GB recommended**
     - Wet umbrella Phase 4 (Vision→Geo pipe): ~29GB peak observed (M2 Max)
     - Sequential loading: Vision unloads before text model loads (not parallel)
     - Portfolio Discovery selects largest eligible models for quality
     - **Tested:** M2 Max 64GB (comfortable headroom)
     - **Untested:** M1 Max 32GB (theoretically viable but Metal limits unknown)
     - **Note:** Metal memory limits may vary by chip generation
4. **~10-20MB disk space** for test temp files (default suite)
5. **Test dependencies:**
   ```bash
   pip install -e .[test]
   ```

**Default suite (16GB):** Mock models, fast, no downloads needed.
**Full suite (64GB):** Real models, comprehensive validation, recommended for development.

### Optional Setup (Live Tests)

Live tests require additional environment setup:

**🔍 Show which models would be tested:**
```bash
HF_HOME=/path/to/cache pytest -m show_model_portfolio -s
```
This displays all models that would be used in E2E tests (no actual testing).

**E2E tests** (ADR-011):
```bash
# Full E2E test suite with real models
HF_HOME=/path/to/cache pytest -m live_e2e -v
```

**Stop token validation** (ADR-009):
```bash
pytest -m live_stop_tokens -v
# Uses Portfolio Discovery if models found, else fallback models
# See TESTING-DETAILS.md "Required Models for Live Tests"
```

**Push/Clone tests** (alpha features):
```bash
# See TESTING-DETAILS.md for complete environment setup
```

## Environment & Caches

**User cache** (persistent):
- Real cache for manual operations
- Example: `export HF_HOME="/Volumes/SSD/models"`
- Safe ops: `list`, `health`, `show`

**Test cache** (isolated):
- Ephemeral via fixtures
- Default tests never touch user cache
- Deletion safety: `MLXK2_STRICT_TEST_DELETE=1`

**Best practice:**
- Use isolated tests for development (default `pytest`)
- Use live tests for validation (opt-in with markers)
- Set `HF_HOME` to external SSD for live tests

## Python Version Compatibility

**Tests validated on Python 3.10-3.12** (Python 3.9 not supported since 2.0.4)

Multi-version testing:
```bash
# Automated script
./test-multi-python.sh

# Manual verification
python3.10 -m venv test_310
source test_310/bin/activate
pip install -e .[test] && pytest
```

See [TESTING-DETAILS.md](TESTING-DETAILS.md) for version-specific results.

## Code Quality

```bash
# Install tools
pip install -e .[dev]

# Code formatting and linting
ruff check mlxk2/ --fix

# Type checking
mypy mlxk2/

# Complete workflow
ruff check mlxk2/ --fix && mypy mlxk2/ && pytest
```

## Test Markers

MLX Knife uses pytest markers to organize tests by category:

- **Default suite** (`pytest -v`): Unit tests with mocks (fast, offline, no real models)
- **Spec tests** (`-m spec`): API contract/schema validation
- **Live tests** (`-m live_*`): Tests with real models or network (opt-in)

**Common commands:**
```bash
# Default test suite (fast, offline)
pytest -v

# API spec/contract tests only
pytest -m spec -v

# Live tests with real models (examples)
pytest -m live_stop_tokens -v  # Stop token validation (ADR-009)
pytest -m live_e2e -v          # E2E server/HTTP/CLI tests (ADR-011)
```

**For complete marker reference, environment requirements, and detailed usage, see:**
- [TESTING-DETAILS.md → Test Execution Guide](TESTING-DETAILS.md#test-execution-guide)

**Symbol Legend:**
- 🔒 **Marker-required**: Must use `-m marker` (skipped by default `pytest -v`)
- **Skip-unless-env**: Collected but skipped without required environment

## Troubleshooting

**Tests hang forever:**
```bash
pytest --timeout=60
```

**Import errors:**
```bash
pip install -e .[test]
```

**Cache conflicts:**
```bash
export HF_HOME="/tmp/test_cache"
pytest --cache-clear
```

**Debug specific test:**
```bash
pytest path/to/test.py::test_name -v -s
```

## Contributing Tests

When submitting PRs with test changes, please document in the PR description:

1. **Test environment** (macOS version, Apple Silicon chip, Python version)
2. **Test results** (passed/skipped/failed counts)
3. **Any issues encountered** and resolutions

See [TESTING-DETAILS.md](TESTING-DETAILS.md#current-status) for the current official test environment and results as an example.

## Development Workflow

Before committing:

```bash
# 1. Code style
ruff check mlxk2/ --fix

# 2. Type checking
mypy mlxk2/

# 3. Run tests
pytest -v

# Or combined
ruff check mlxk2/ --fix && mypy mlxk2/ && pytest -v
```

## Summary

**MLX Knife Testing:**
- ✅ **Isolated by default** - User cache stays pristine
- ✅ **Fast feedback** - 500+ tests run in seconds without model downloads
- ✅ **Low requirements** - 16GB RAM, ~20MB disk, no HF cache needed
- ✅ **Opt-in live tests** - Real models/network when needed
- ✅ **Multi-Python support** - Verified on Python 3.9-3.14

For detailed information including current test counts, complete file structure, version history, and implementation specifics, see [TESTING-DETAILS.md](TESTING-DETAILS.md).

---

*MLX-Knife 2.0 Testing Framework*