- Vision Support (Issue #45): CLI + Server with OpenAI-compatible image API, EXIF metadata - Unix Pipes (ADR-014): stdin support, isatty detection, SIGPIPE handling - Memory-Aware Loading (ADR-016): Pre-load checks with >70% RAM warnings - Python 3.9-3.14: Full compatibility verified (476-485 tests passing) - Fixed: --log-json regression (Issue #44), Vision multimodal history filtering See CHANGELOG.md for complete details.
57 KiB
MLX Knife Testing - Detailed Documentation
This document contains version-specific details, complete file listings, and implementation specifics for the MLX Knife test suite. For timeless testing philosophy and quick start instructions, see TESTING.md.
Current Status
✅ 2.0.4-beta.1 (Release Candidate) — Probe/Policy architecture complete; Vision support Phase 1-3 (CLI + Server); Pipes/Memory-Aware; EXIF metadata; Test Portfolio Separation complete. ✅ Test Suite: 476-485 unit tests passing (Py3.9: 476 passed/65 skipped, Py3.10+: 485 passed/56 skipped, baseline without HF_HOME); Live E2E: 139 passed/21 skipped ✅ Test environment: macOS 14.x, M2 Max, Python 3.9-3.14 ✅ Production verified & reported: M1, M1 Max, M2 Max in real-world use ✅ License: Apache 2.0 (was MIT in 1.x) ✅ Isolated test system - user cache stays pristine with temp cache isolation ✅ 3-category test strategy - optimized for performance and safety ✅ Portfolio Separation - Text and Vision models tested independently with separate RAM formulas
Skipped Tests Breakdown (56 total Python 3.10+, 65 total Python 3.9, standard run without HF_HOME)
- 34 Live E2E tests - Server/HTTP/CLI validation with real models (requires
pytest -m live_e2e, ADR-011 + Portfolio Separation)- 23 Text model tests - Parametrized across text_portfolio (chat completions batch/streaming)
- 3 Vision model tests - Parametrized across vision_portfolio (multimodal, SSE, text-on-vision)
- 5 Vision CLI E2E tests - Deterministic vision queries (requires vision model in cache, ADR-012)
- 3 Non-parametrized tests - Health, models list, vision→text switching
- 4 Live Stop Tokens tests - Stop token validation with real models (requires
pytest -m live_stop_tokens, ADR-009) - 3 Live Clone tests - APFS same-volume clone workflow (requires
MLXK2_LIVE_CLONE=1) - 2 Issue #37 tests - Private/org model detection (requires
pytest -m live_run, Issue #37) - 2 Runtime Compatibility tests - Reason chain validation (requires specific model types)
- 1 Live List test - Tests against user cache (requires HF_HOME with models)
- 1 Live Push test - Real HuggingFace push (requires
MLXK2_LIVE_PUSH=1) - 2 Show Portfolio tests - Display text/vision portfolios separately (requires HF_HOME)
- 7 Issue #27 tests - Real-model health validation (requires HF_HOME or MLXK2_USER_HF_HOME setup)
Portfolio Discovery (ADR-009) is implemented in tests_2.0/test_stop_tokens_live.py. When HF_HOME is set, tests auto-discover all MLX chat models in user cache using mlxk list --json (production command). This ensures Issue #32 fix is validated across the full model portfolio. Current validation: 17 models discovered, 15 testable (60% RAM budget), 73/81 tests passing, 0 failures. Portfolio includes: Phi-3, DeepSeek-R1, GPT-oss, Llama, Qwen, Mistral, Mixtral families.
New coverage in 2.0.4-beta.1:
- JSON-mode interactive rejection emits JSON on stdout with exit code 1.
- Pipe stdin semantics for
mlxk run(-reads stdin, non-TTY forces batch) behindMLXK2_ENABLE_PIPES=1. - SIGPIPE handling: Handler set to SIG_DFL for graceful pipe termination (e.g.,
mlxk run | head -1). - BrokenPipeError handling: Streaming and batch output catch BrokenPipeError for robust pipe chains.
- Vision CLI support (ADR-012 Phase 1-2): Implementation with
--imageflag- VisionRunner wraps mlx-vlm backend (non-streaming, batch-only)
- Auto-routing: Vision models use mlx-vlm; text models use mlx-lm
- 5 deterministic CLI E2E tests: Chess position reading, OCR text extraction, color recognition, chart label reading, large image support (2.7MB validates 10MB limit)
- Vision Server support (ADR-012 Phase 3): HTTP API for vision requests
- Backend-aware
get_or_load_model(): Loads MLXRunner OR VisionRunner based on policy ChatMessage.contentextended for OpenAI Vision format (Union[str, List[Dict]])- Streaming graceful degradation (SSE emulation) for vision models (mlx-vlm doesn't support true streaming)
- 17 unit tests in
test_server_vision.py(ChatMessage, image detection, helpers) - 3 E2E tests in
test_vision_server_e2e.py(Base64 image, streaming graceful degradation, text on vision server)
- Backend-aware
- Test Portfolio Separation (CLAUDE.md): Text and Vision models tested independently
- Separate discovery functions:
discover_text_models()anddiscover_vision_models()with vision capability filtering - RAM calculation modularization: Text models use 1.2x multiplier; Vision models use 0.70 threshold (ADR-016)
- New fixtures:
text_portfolio,vision_portfolio,text_model_info,vision_model_info - Parametrized E2E tests: text_XX (23 text models), vision_XX (3 vision models) - deterministic indices
- 21 new unit tests: 10 portfolio discovery tests, 11 RAM calculation tests
- Benchmark reporting updated: Dynamically selects correct model_info fixture
- Diagnostic tool:
show_portfolios.pydisplays separated portfolios with RAM estimates
- Separate discovery functions:
mlx-runwrapper entrypoint argv injection.- Tests added:
tests_2.0/test_cli_run_exit_codes.py(pipe/JSON/SIGPIPE/BrokenPipe),tests_2.0/test_cli_run_wrapper.py,tests_2.0/live/test_vision_e2e_live.py(5 vision CLI E2E tests),tests_2.0/test_server_vision.py(17 vision server unit tests),tests_2.0/live/test_vision_server_e2e.py(3 vision server E2E tests),tests_2.0/test_portfolio_discovery.py(10 tests),tests_2.0/test_ram_calculation.py(11 tests),tests_2.0/live/test_portfolio_fixtures.py(7 validation tests),tests_2.0/show_portfolios.py(diagnostic script).
For complete test file structure, see Appendix.
Test Execution Guide
| Target | How to Run | Markers / Env | Includes | Network |
|---|---|---|---|---|
| Default suite | pytest -v |
— | JSON-API (list/show/health), Human-Output, Model-Resolution, Health-Policy, Push Offline (--check-only, --dry-run), Spec/Schema checks |
No |
| Spec only | pytest -m spec -v |
spec |
Schema/contract tests, version sync, docs example validation | No |
| Exclude spec | pytest -m "not spec" -v |
not spec |
Everything except spec/schema checks | No |
| Push offline | pytest -k push -v |
— | Push offline tests (tests alpha feature: --check-only, --dry-run, error handling); no network, no credentials needed |
No |
| Live pipe mode | pytest -m live_e2e tests_2.0/live/test_cli_pipe_live.py -v |
live_e2e; Env: HF_HOME, MLXK2_ENABLE_PIPES=1 |
Stdin -, pipe auto-batch, JSON interactive error path, list→run pipe; first eligible model from portfolio discovery |
No (uses local cache) |
| Live push | MLXK2_ENABLE_ALPHA_FEATURES=1 pytest -m live_push -v |
live_push (subset of wet) + Env: MLXK2_ENABLE_ALPHA_FEATURES=1, MLXK2_LIVE_PUSH=1, HF_TOKEN, MLXK2_LIVE_REPO, MLXK2_LIVE_WORKSPACE |
JSON push against the real Hub; on errors the test SKIPs (diagnostic) | Yes |
| Live list | pytest -m live_list -v |
live_list (subset of wet) + Env: HF_HOME (user cache with models) |
Tests list/health against user cache models | No (uses local cache) |
| Clone offline | pytest -k clone -v |
— | Clone offline tests (tests alpha feature: APFS validation, temp cache, CoW workflow); no network needed | No |
| Live clone (ADR-007) | MLXK2_ENABLE_ALPHA_FEATURES=1 pytest -m live_clone -v |
live_clone + Env: MLXK2_ENABLE_ALPHA_FEATURES=1, MLXK2_LIVE_CLONE=1, HF_TOKEN, MLXK2_LIVE_CLONE_MODEL, MLXK2_LIVE_CLONE_WORKSPACE |
Real clone workflow: pull→temp cache→APFS same-volume clone→workspace (ADR-007 Phase 1 constraints: same volume + APFS required) | Yes |
| Live stop tokens (ADR-009) | pytest -m live_stop_tokens -v |
live_stop_tokens (required); Optional: HF_HOME (enables portfolio discovery) |
Issue #32: Validates stop token behavior with real models. With HF_HOME: Portfolio Discovery auto-discovers all MLX chat models (filter: MLX+healthy+runtime+chat), RAM-aware skip, empirical report. Without HF_HOME: Uses 3 predefined models (see "Optional Setup" section for model requirements). | No (uses local cache) |
| Live run | pytest -m live_run -v |
live_run + Env: MLXK2_USER_HF_HOME or HF_HOME (user cache with mlx-community/Phi-3-mini-4k-instruct-4bit) |
Regression tests for Issue #37: Validates private/org MLX model framework detection in run command (renames Phi-3 to simulate private-org model) | No (uses local cache) |
| Live E2E (ADR-011) | HF_HOME=/path/to/cache pytest -m live_e2e -v |
live_e2e (required) + Env: HF_HOME (optional, enables Portfolio Discovery); Requires: httpx installed |
✅ Working: Server/HTTP/CLI validation with real models. Portfolio Discovery auto-discovers all MLX chat models via mlxk list --json (filter: MLX+healthy+runtime+chat), parametrized tests (one server per model), RAM-aware skip. |
No (uses local cache) |
| Vision CLI E2E (ADR-012) | HF_HOME=/path/to/cache pytest -m live_e2e tests_2.0/live/test_vision_e2e_live.py -v |
live_e2e (required) + Env: HF_HOME (vision model in cache, e.g., pixtral-12b-8bit or Llama-3.2-Vision); Requires: mlx-vlm installed (Python 3.10+) |
✅ Working: Deterministic vision queries validate actual image understanding (not hallucination). Tests: chess position reading (e6=black king), OCR text extraction (contract name), color recognition (blue mug), chart label reading (Y-axis), large image support (2.7MB). | No (uses local cache) |
| Vision Server E2E (ADR-012 Phase 3) | HF_HOME=/path/to/cache pytest -m live_e2e tests_2.0/live/test_vision_server_e2e.py -v |
live_e2e (required) + Env: HF_HOME (vision model in cache); Requires: mlx-vlm installed (Python 3.10+), httpx |
✅ Working: Vision API over HTTP. Tests: Base64 image chat completion, streaming graceful degradation (SSE emulation), text request on vision model server. | No (uses local cache) |
| Show E2E portfolios | HF_HOME=/path/to/cache python tests_2.0/show_portfolios.py OR pytest -m show_model_portfolio -s |
Env: HF_HOME |
Displays TEXT and VISION portfolios separately. Shows model keys (text_XX, vision_XX), RAM requirements, and test/skip status. Diagnostic tool for understanding portfolio separation. Use script for detailed output, or pytest marker for quick check. | No (uses local cache) |
| Manual debug mode | mlxk run <model> "test prompt" --verbose |
Manual CLI usage with --verbose flag |
Shows token generation details including multiple EOS token warnings. Use this for manual debugging of model quality issues. Output includes [DEBUG] Token generation analysis and ⚠️ WARNING: Multiple EOS tokens detected for broken models. |
No (uses local cache) |
| Issue #27 real-model | pytest -m issue27 tests_2.0/test_issue_27.py -v |
Marker: issue27; Env (required): MLXK2_USER_HF_HOME or HF_HOME (user cache, read-only). Env (optional): MLXK2_ISSUE27_MODEL, MLXK2_ISSUE27_INDEX_MODEL, MLXK2_SUBSET_COUNT=0. |
Copies real models from user cache into isolated test cache; validates strict health policy on index-based models (no network) | No (uses local cache) |
| Server tests | pytest -k server -v |
— | Basic server API tests (minimal, uses MLX stubs) | No |
Useful commands:
# Only Spec
pytest -m spec -v
# Push tests (offline)
pytest -k "push and not live" -v
# Clone tests (offline)
pytest -k "clone and not live" -v
# Exclude Spec
pytest -m "not spec" -v
# Live Push only
MLXK2_ENABLE_ALPHA_FEATURES=1 MLXK2_LIVE_PUSH=1 HF_TOKEN=... MLXK2_LIVE_REPO=... MLXK2_LIVE_WORKSPACE=... pytest -m live_push -v
# Live Clone only
MLXK2_ENABLE_ALPHA_FEATURES=1 MLXK2_LIVE_CLONE=1 HF_TOKEN=... MLXK2_LIVE_CLONE_MODEL=... MLXK2_LIVE_CLONE_WORKSPACE=... pytest -m live_clone -v
# Live List only
HF_HOME=/path/to/user/cache pytest -m live_list -v
# Live Stop Tokens only (ADR-009)
pytest -m live_stop_tokens -v # Optional: HF_HOME=/path/to/cache for portfolio discovery
# Live Run only
HF_HOME=/path/to/user/cache pytest -m live_run -v
# Live E2E only (ADR-011)
HF_HOME=/path/to/user/cache pytest -m live_e2e -v # See model list: pytest tests_2.0/live/test_server_e2e.py::TestChatCompletionsBatch --collect-only -q
# Issue #27 only
MLXK2_USER_HF_HOME=/path/to/user/cache pytest -m issue27 tests_2.0/test_issue_27.py -v
# All live tests (umbrella)
MLXK2_ENABLE_ALPHA_FEATURES=1 pytest -m wet -v
⚠️ CRITICAL: Sequential Execution Required for E2E Tests
DO NOT use parallel execution with -m live_e2e tests:
# ✅ SAFE - Sequential execution (one model at a time)
HF_HOME=/path/to/cache pytest -m live_e2e -v
# 🔥 DEADLY - Parallel execution (multiple large models simultaneously)
HF_HOME=/path/to/cache pytest -m live_e2e -n auto # ← NEVER DO THIS!
Why parallel execution is dangerous:
| Risk | Impact | Evidence |
|---|---|---|
| Multiple large models loading simultaneously | System freeze requiring hardware reset | Experienced 2025-11-12 during development |
| RAM budget violation | 8 workers × 20GB models = 160GB peak RAM usage | Even 64GB M2 Max cannot handle this |
| Metal GPU memory exhaustion | MLX Metal cache shared across processes | Leads to GPU hang + system unresponsive |
Architecture protections (sequential mode only):
- ✅ One server per test: No parallel inference within a single test
- ✅ Active cleanup polling: Waits for actual process termination (not blind timeout)
- ✅ Explicit garbage collection: Forces Python GC + 2s Metal memory buffer
- ✅ Conservative timeout: 45s max wait for very large models (>40GB), but polls every 500ms
- ⚠️ Large model transitions: Models >20GB may have 10-15s RAM overlap during cleanup
Safe execution guidelines:
- Always run
pytest -m live_e2ewithout-n autoor-n <workers> - If using pytest-xdist, ensure it's NOT active for E2E tests
- Monitor system RAM during first run to understand your hardware limits
- Expected duration: ~7-10 minutes for 15 models (sequential, with cleanup)
Note on -n auto (pytest-xdist):
-n auto: Spawns one worker per CPU core (e.g., 8 workers on 8-core M2)- Each worker loads a separate model instance simultaneously
- Safe for unit tests (mocked, no real models), DEADLY for E2E tests (real models)
Python Version Verification Results
All standard tests validated on Apple Silicon with enhanced isolation
| Python Version | Status | Tests Passing | Skipped | Notes |
|---|---|---|---|---|
| 3.9.6 (macOS) | ✅ Verified | 476/541 | 65 | Vision tests auto-skip (mlx-vlm requires 3.10+) |
| 3.10.x | ✅ Verified | 485/541 | 56 | Full suite including vision tests |
| 3.11.x | ✅ Verified | 485/541 | 56 | Full suite including vision tests |
| 3.12.x | ✅ Verified | 485/541 | 56 | Full suite including vision tests |
| 3.13.x | ✅ Verified | 485/541 | 56 | Full suite including vision tests |
| 3.14.x | ✅ Verified | 485/541 | 56 | Full suite including vision tests |
Note: 56 skipped tests (65 on Python 3.9) are opt-in (live tests, alpha features). Skipped count may vary by environment:
- Without
HF_HOME: Standard 56 skipped (65 on Py3.9, live E2E tests use fallback parametrization) - With
HF_HOME: Live E2E tests run with discovered models across text_portfolio (23) and vision_portfolio (3)
All versions tested with isolated_cache system and MLX stubs for fast execution without model downloads.
Push Testing Details (2.0)
This section summarizes what our test suite covers for the experimental push feature and what still requires live/manual checks.
Reference: Push CLI and JSON
-
Usage:
mlxk2 push <local_dir> <org/model> --private [--create] [--branch main] [--commit <msg>] [--check-only] [--json] [--verbose] -
Args:
--private(required in alpha): Safety gate to avoid public uploads.--create: Create the repository if it does not exist (model repo).--branch: Target branch, defaultmain. Missing branches are tolerated; with--create, the branch is proactively created (and upload retried once if the hub initially rejects the revision).--commit: Commit message, default"mlx-knife push".--check-only: Analyze workspace locally; no network call; returnsdata.workspace_health.--dry-run: Compare local workspace to the remote branch and summarize changes without uploading (requires repo read access).--json: Print JSON response; in JSON mode, logs/progress are suppressed by default.--verbose: Human mode — append details (e.g., commit URL). In JSON mode, only toggles console log verbosity; the JSON payload is unchanged.
-
JSON fields (
data):repo_id: string— targetorg/model.branch: string— target branch.commit_sha: string|null— commit id; null whenno_changes:trueor on noop.commit_url: string|null— link to commit; null when no commit created.repo_url: string—https://huggingface.co/<org/model>.uploaded_files_count: int|null— number of changed files; set to0onno_changes:true.local_files_count: int|null— approximate local file count scanned.no_changes: boolean— true when hub reports an empty commit (preferred signal) or no file operations are detected.created_repo: boolean— true when repo was created (with--create).change_summary: {added:int, modified:int, deleted:int}— optional; derived from hub response when available.message: string|null— short human hint; mirrors hub on no-op.hf_logs: string[]— buffered hub log lines (not printed in JSON mode unless--verbose).experimental: trueanddisclaimer: string— feature state markers.workspace_health: {...}— present only with--check-only:healthy: bool,anomalies: [],config,weights.index,weights.pattern_complete, etc.
dry_run: true— present only with--dry-run.dry_run_summary: {added:int, modified:int, deleted:int}— present with--dry-run.would_create_repo: bool/would_create_branch: bool— planning hints when target does not exist.
-
Error types (
error.type):dependency_missing—huggingface-hubnot installed.auth_error— missingHF_TOKEN(unless--check-only).workspace_not_found— local_dir missing/not a directory.repo_not_found— repo missing without--create.upload_failed— hub returned an error (e.g., 403/permission).push_operation_failed— unexpected internal failure wrapper.
-
Exit codes: success →
0; anystatus:error→1.
Automated (offline)
- Token/Workspace errors: Missing
HF_TOKENand missing workspace produce proper JSON errors. - CLI args (JSON mode): Missing positional args emit JSON errors rather than usage text.
- Schema shape: Push success/error outputs validate against
docs/json-api-schema.json. - No-op push: Detects
no_changes: true, setsuploaded_files_count: 0, carries hub message into JSON (message/hf_logs), and human output shows "no changes" without duplicate logs. - Commit path: Extracts
commit_sha,commit_url,change_summary(+/~/−), correctuploaded_files_count; human--verboseincludes URL. - Repo/Branch handling: Missing repo requires
--create; with--createsetscreated_repo: true. Missing branch is tolerated; upload attempts proceed. With--create, the branch is proactively created and the upload is retried once if the hub rejects the revision (e.g., "Invalid rev id"). - Ignore rules:
.hfignoreis merged with default ignores and forwarded to the hub.
Files:
tests_2.0/test_cli_push_args.py(CLI errors and JSON outputs)tests_2.0/test_push_extended.py(no-op vs commit, branch/repo, .hfignore, human; includes retry on invalid revision with--create)tests_2.0/spec/test_push_output_matches_schema.py(schema success path)
Run (venv39):
source venv39/bin/activate && pip install -e .
pytest -q tests_2.0/test_cli_push_args.py tests_2.0/test_push_extended.py
pytest -q tests_2.0/spec/test_push_output_matches_schema.py
pytest -q tests_2.0/test_push_extended.py::test_push_retry_creates_branch_on_upload_revision_error
Live (opt-in / wet)
- Purpose: sanity-check real HF behavior (auth, no-op vs commit, URLs).
- Defaults: Live tests are skipped. Enable with env vars and markers.
- Env:
MLXK2_LIVE_PUSH=1HF_TOKEN(write-enabled)MLXK2_LIVE_REPO='org/model'MLXK2_LIVE_WORKSPACE='/abs/path/to/workspace'
- Command:
pytest -q -m wet tests_2.0/live/test_push_live.py- or
pytest -q -m live_push
Pull/Preflight (Issue #30)
Goal: Gated/private/not-found repos must not pollute the cache and should fail fast.
-
Behavior (2.0):
- Preflight uses
huggingface_hub.HfApi.model_info()(metadata only; no download). - Gated/Forbidden/Unauthorized/NotFound →
access_deniedbefore download; clear hint to setHF_TOKEN. - Network timeouts/unspecific HTTP errors in preflight → degrade to a warning; allow the download layer (to surface meaningful error/timeout paths).
- Tokens: prefer
HF_TOKEN(legacyHUGGINGFACE_HUB_TOKENis read, but not promoted). - Tests use isolated caches; the user cache is never touched.
- Preflight uses
-
Relevant tests:
tests_2.0/test_issue_30_preflight.pytest_preflight_private_model_without_tokentest_preflight_nonexistent_modeltest_preflight_integration_in_pulltest_preflight_prevents_cache_pollution
-
Quick checks:
pytest -q tests_2.0/test_issue_30_preflight.py- CLI:
unset HF_TOKEN HUGGINGFACE_HUB_TOKEN; mlxk-json pull meta-llama/Llama-2-7b-hf --json
Runner: Interruption & Recovery
- Semantics (2.0): A new generation resets
_interrupted = Falseat the start (recovery behavior). A previous Ctrl-C does not block the next generation. - Streaming:
- During an active generation, the runner yields a line
"[Generation interrupted by user]"and stops. - Token diffing in streaming is robust against minimal mocks (no StopIteration due to short
decodesequences).
- During an active generation, the runner yields a line
- Batch:
- Resets the flag at the start of a new generation; filters stop tokens; chat stop tokens optional via
use_chat_stop_tokens=True.
- Resets the flag at the start of a new generation; filters stop tokens; chat stop tokens optional via
- Relevant tests:
tests_2.0/test_ctrl_c_handling.py(SIGINT, interruption behavior, interactive)tests_2.0/test_interruption_recovery.py(resetting the flag for new generations)tests_2.0/test_runner_core.py(consistency/batch/streaming, error handling)
Server Minimal Tests
- Dependencies:
httpx,fastapi,uvicorn,pydantic(via[test]). - Scope: OpenAI-compatible endpoints (minimal smoke); no real models required.
- Optional for local verification; in CI currently "nice to have" (Backlog, not part of the 2.0 Guide).
Known Warnings
- urllib3 LibreSSL notice on macOS Python 3.9
-
Message: "urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3' …"
-
Status: Harmless for our usage; suppressed in production code (see
mlxk2/__init__.py,warnings.filterwarnings(...)). -
Tests: May still appear in pytest summary if third-party dependencies import
urllib3before our package. -
Optional suppression in tests: add to
pytest.ini:filterwarnings = ignore:urllib3 v2 only supports OpenSSL 1.1.1+
-
Issue #27 Tests (Real Multi-Shard Model Health)
Quick Start (Minimal)
# Set your HF cache (external SSD recommended)
export HF_HOME=/Volumes/your-ssd/huggingface/cache
# Select a model with index file (upstream repo)
export MLXK2_ISSUE27_MODEL="mistralai/Mixtral-8x7B-Instruct-v0.1"
# Optional: Bootstrap index if not in cache
export MLXK2_BOOTSTRAP_INDEX=1
# Run tests
pytest tests_2.0/test_issue_27.py -v
Purpose
These tests validate the strict health policy against real upstream Hugging Face repositories that ship multi-shard safetensors with a model.safetensors.index.json. They complement the deterministic unit tests by exercising real-world layouts.
When to Run
Run them when:
- Your user cache contains at least one upstream PyTorch repo with a safetensors index (not MLX/GGUF conversions). Good candidates:
mistralai/Mistral-7B-Instruct-v0.2or-v0.3Qwen/Qwen1.5-7B-Chat,Qwen/Qwen2-7B-Instructteknium/OpenHermes-2.5-Mistral- Gated:
meta-llama/Llama-2-7b-chat-hf,meta-llama/Llama-3-8B-Instruct,google/gemma-7b-it
- You want to sanity-check index-based completeness, shard deletion/truncation, and LFS pointer detection against real artifacts.
They are not useful when:
- Your cache only has MLX Community models (no
model.safetensors.index.json) or GGUF models — the index-based tests will skip by design. In that case, rely ontests_2.0/test_health_multifile.pyfor deterministic coverage.
Environment Setup
# Set user cache (EITHER)
export MLXK2_USER_HF_HOME=/absolute/path/to/huggingface/cache
# OR
export HF_HOME=/absolute/path/to/huggingface/cache # Test harness preserves this
# Select model with index file (recommended)
export MLXK2_ISSUE27_MODEL="mistralai/Mixtral-8x7B-Instruct-v0.1"
# Optional: Minimize copy size
export MLXK2_SUBSET_COUNT=1 # Default 1
export MLXK2_MIN_FREE_MB=512 # Default 512 MB
# Run tests
PYTHONPATH=. pytest tests_2.0/test_issue_27.py -v
Optional Bootstrap (Opt-in, Minimal Workflow)
# Enable index bootstrap (fetches only index files, never modifies user cache)
export MLXK2_BOOTSTRAP_INDEX=1
# Optional: Separate model for index tests
export MLXK2_ISSUE27_INDEX_MODEL="org/model-with-index"
# Run
pytest tests_2.0/test_issue_27.py -v
Note: Network is only needed if your user cache does not already contain an index file for the chosen repo. If the index exists in your cache, the tests copy it into the isolated cache and no network is required.
Troubleshooting
If you see SKIPs:
- "No safetensors index found" → The chosen model snapshot lacks an index file. Pick a model that has
model.safetensors.index.json(orpytorch_model.bin.index.json). - "Not enough free space" → Free disk space; tests create a subset copy into an isolated temp cache.
- "User model not found" → Verify your model exists in the user cache and
MLXK2_USER_HF_HOMEpoints to the.../huggingface/cacheroot.
Quick helper to list index-bearing models in your user cache:
find "$MLXK2_USER_HF_HOME/hub" -type f \
\( -name 'model.safetensors.index.json' -o -name 'pytorch_model.bin.index.json' \) \
| sed 's#.*/hub/models--\(.*\)/snapshots/.*#\1#; s#--#/#g' | sort -u
Resource Considerations
- Disk: Tests copy a minimal subset of files into an isolated cache (index + 1 smallest shard, or 1 Pattern-Shard).
- Network: If you need to fetch a candidate model first, prefer downloading only
config.json,model.safetensors.index.json, and 1-2 small shards to keep it light.
Copy-on-Write (CoW) Optimization
New in 2.0.4-beta.1: Test model copies use CoW on macOS/APFS for instant, disk-free clones.
How it works:
- Volume detection:
_get_volume_root()finds mount point,_is_apfs_volume()verifies APFS - External volumes: Creates
.mlxk2_test_isolation/on volume root for temp dirs - System volume: Falls back to
/var/folders(same APFS container, CoW still works!) copy_user_model_to_isolated()usescp -c(clonefile) for instant CoW copies- On non-APFS or cross-volume scenarios, it falls back to regular
shutil.copy2()
Benefits:
- Vision model tests (~24GB) complete in < 1 second instead of minutes
- No disk space consumed for CoW copies (blocks shared until mutation)
MLXK2_SUBSET_COUNT=999now safe to use (copies all shards instantly)
Requirements for CoW:
- macOS with APFS filesystem
- User cache and test temp dir on the same volume
- If user cache is on external SSD and temp uses system disk, falls back to regular copy
Safety Signature Mechanism
CRITICAL: The test infrastructure includes a safety signature mechanism to prevent accidental deletion of user data.
How it works:
_create_isolated_temp_dir(): Atomically creates temp directory + signature file- Signature contains: Magic string, UUID, path hash, creation timestamp
_safe_rmtree(): REFUSES to delete unless signature verification passes- If signature creation fails, the directory is immediately cleaned up
Safety checks before deletion:
- Signature file must exist
- Magic string must match (
MLXK2_ISOLATED_TEST_CACHE_V1) - UUID must match the one returned at creation
- Path hash must match current path
- Path must contain
mlxk2_test_marker
Why this matters:
- Temp directories are created on the same volume as user cache (for CoW)
- Without this mechanism, a bug could accidentally delete user model data
- The signature verification provides multiple layers of protection
Vision Model Health Tests (ADR-012 Phase 2)
New in 2.0.4-beta.1: Real vision model health validation with controlled mutations.
# Set user cache
export MLXK2_USER_HF_HOME=/path/to/huggingface/cache
# Optional: Override default vision model (11B, ~24GB)
export MLXK2_VISION_MODEL="mlx-community/Llama-3.2-11B-Vision-Instruct-4bit"
# Run vision health tests only
pytest tests_2.0/test_issue_27.py::TestIssue27Exploration::test_vision_model_missing_preprocessor_is_unhealthy -v
pytest tests_2.0/test_issue_27.py::TestIssue27Exploration::test_vision_model_invalid_preprocessor_is_unhealthy -v
pytest tests_2.0/test_issue_27.py::TestIssue27Exploration::test_vision_model_missing_tokenizer_json_is_unhealthy -v
pytest tests_2.0/test_issue_27.py::TestIssue27Exploration::test_vision_model_complete_is_healthy -v
# Or run all Issue #27 tests (including vision)
pytest tests_2.0/test_issue_27.py -m issue27 -v
Vision model mutations tested:
remove_preprocessor- Deletespreprocessor_config.json(should be unhealthy)inject_invalid_preprocessor- Creates invalid JSON inpreprocessor_config.json(should be unhealthy)remove_tokenizer_json- Deletestokenizer.jsonwhiletokenizer_config.jsonexists (should be unhealthy)
Default model: mlx-community/Llama-3.2-11B-Vision-Instruct-4bit (~24GB)
Requirements:
- Vision model must exist in user cache (pull it first if needed)
- Python 3.10+ (mlx-vlm dependency - tests skip on Python 3.9)
- CoW (macOS/APFS, same volume) eliminates disk space concerns; otherwise ~24GB free space needed
Manual MLX Chat Model Smoke Test (2.0)
Goal: Pull a small MLX chat model, verify classification, prepare a local workspace, validate it offline, and push to a private repo while preserving chat intent. This helps issuers validate iOS-focused workflows.
Model choice (example):
mlx-community/Qwen2.5-0.5B-Instruct-4bit(small, chat-oriented)
Steps
-
Pull (venv39):
mlxk2 pull mlx-community/Qwen2.5-0.5B-Instruct-4bit -
Verify in cache:
mlxk2 list --health "Qwen2.5-0.5B-Instruct-4bit" # Expect: Framework MLX, Type chat, capabilities include chat -
Prepare local workspace from cache (dereference symlinks):
# Ensure HF_HOME points to your HF cache # Compute cache path: $HF_HOME/models--mlx-community--Qwen2.5-0.5B-Instruct-4bit # Find latest snapshot hash under snapshots/ # Copy to workspace and dereference symlinks: rsync -aL "$HF_HOME/models--mlx-community--Qwen2.5-0.5B-Instruct-4bit/snapshots/<HASH>/" ./mymodel_test_workspace/ -
Recommended README front-matter (to preserve intent on push):
- Include YAML with tags and pipeline tag, e.g.:
tags: [mlx, chat]pipeline_tag: text-generationbase_model: <upstream_base>
- Keep model name containing
Instructorchatto aid chat detection
- Include YAML with tags and pipeline tag, e.g.:
-
Offline validation (no network):
mlxk2 push --check-only ./mymodel_test_workspace <org/model> --json # Expect: workspace_health.healthy: true -
Push to private repo:
mlxk2 push --private --create ./mymodel_test_workspace <org/model> --json # Re-push without changes should show no_changes: true -
Post-push verification:
mlxk2 list --all --health <org/model> # Current limitation: Framework may show PyTorch for non-mlx-community orgs # This does not affect content; future M1 will parse model card tags (mlx)
Real-Model Testing (Implemented in 2.0.1)
Status: ✅ Live in 2.0.1 (Portfolio Discovery, ADR-009)
Portfolio Discovery
Auto-discovers and tests all MLX chat models in user cache.
Location: test_stop_tokens_live.py (Category 2: Live Tests)
Marker: live_stop_tokens
Usage:
# With HF_HOME: Auto-discovers all MLX chat models
export HF_HOME=/path/to/cache
pytest -m live_stop_tokens -v
# Without HF_HOME: Uses 3 predefined models (must exist in cache)
pytest -m live_stop_tokens -v # → Runs if models present, else fails
Features:
- ✅ Model Filtering: MLX + healthy + runtime_compatible + chat only
- ✅ Portfolio Discovery: Uses
mlxk list --jsonto discover all qualifying models (refactored: production command, ~70 LOC eliminated) - ✅ RAM-Aware: Progressive budgets prevent OOM (40%-70% of system RAM)
- ✅ Empirical Report: Generates
stop_token_config_report.jsonwith findings - ✅ Fallback: Uses 3 predefined models (MXFP4, Qwen, Llama) if HF_HOME not set - models must exist in HF cache
Required models for fallback (without HF_HOME):
mlxk pull mlx-community/gpt-oss-20b-MXFP4-Q8 # ~12GB RAM
mlxk pull mlx-community/Qwen2.5-0.5B-Instruct-4bit # ~1GB RAM
mlxk pull mlx-community/Llama-3.2-3B-Instruct-4bit # ~4GB RAM
E2E Tests with Portfolio Separation (ADR-011 + Portfolio Separation)
Status: ✅ Working (Portfolio Separation complete, CLAUDE.md)
Auto-discovers and validates Server/HTTP/CLI interfaces with real models, separated into text and vision portfolios.
Location: tests_2.0/live/ (test_server_e2e.py, test_vision_server_e2e.py, test_cli_e2e.py, test_streaming_parity.py)
Marker: live_e2e
Usage:
# With HF_HOME: Auto-discovers all MLX chat models (separated into text/vision)
export HF_HOME=/path/to/cache
pytest -m live_e2e -v
# See which TEXT models will be tested
pytest tests_2.0/live/test_server_e2e.py::TestChatCompletionsBatch --collect-only -q
# See which VISION models will be tested
pytest tests_2.0/live/test_vision_server_e2e.py::TestVisionServerE2E --collect-only -q
# Show portfolios before running tests
HF_HOME=/path/to/cache python tests_2.0/show_portfolios.py
# ⚠️ IMPORTANT: Always test collection before release
pytest -m live_e2e --collect-only # Should work without errors
Architecture:
- ✅ Separate Portfolio Discovery:
discover_text_models()anddiscover_vision_models()filter by capability - ✅ Production Command: Uses
mlxk list --jsoninstead of duplicating cache logic (~70 LOC eliminated) - ✅ Parametrized Tests: text_XX (23 text models), vision_XX (3 vision models) - deterministic indices
- ✅ Independent RAM Formulas: Text uses 1.2x multiplier, Vision uses 0.70 threshold (ADR-016)
- ✅ Clean Lifecycle: Each test gets its own server instance (45s timeout for MLX cleanup)
- ✅ Disjoint Portfolios: No model appears in both text and vision portfolios
- ✅ Current result: 136/136 tests passing (23 text + 3 vision models, deterministic discovered_XX replaced)
Tests Covered:
- Text Portfolio: Server health/metadata, chat completions (batch/streaming), text completions, CLI run, streaming parity, stop tokens
- Vision Portfolio: Multimodal chat (Base64 images), SSE graceful degradation, text-only on vision models, Vision→Text switching
RAM-Aware Model Selection (Portfolio Separation)
Implementation: calculate_text_model_ram_gb(), calculate_vision_model_ram_gb(), get_safe_ram_budget_gb(), should_skip_model()
Text Model RAM Calculation:
- Formula:
size_bytes * 1.2(accounts for KV cache + inference overhead) - Progressive budgets apply (40%-70% based on system RAM)
Vision Model RAM Calculation:
- Formula:
size_bytes(no multiplier) with 0.70 threshold gate - Vision Encoder overhead: Crashes above 70% system RAM (ADR-016)
- Models >70% →
ram_needed_gb = float('inf')(auto-skip)
Progressive RAM Budgets (Text Models):
| System RAM | Budget | Available for Models |
|---|---|---|
| 16GB | 40% | 6.4GB |
| 32GB | 50% | 16GB |
| 64GB | 60% | 38.4GB |
| 96GB+ | 70% | 67GB+ |
Vision Model Threshold (0.70 = 70%):
| System RAM | 70% Threshold | Example Vision Model |
|---|---|---|
| 64GB | 44.8GB | Llama-3.2-11B-Vision (5.6GB) ✅ RUN |
| 64GB | 44.8GB | Llama-3.2-90B-Vision (46.4GB) ⏭️ SKIP |
Rationale:
- Text: OS overhead is ~4-6GB (constant), larger systems have more headroom
- Vision: Vision Encoder overhead is unpredictable, conservative 70% gate prevents OOM crashes
Behavior:
- Models exceeding budget/threshold → Auto-skipped
- Skip reason: "Model requires XGB but only YGB available" or "Vision model exceeds 70% threshold"
- Empirical report tracks skipped models
Example (64GB system):
# TEXT MODELS (1.2x multiplier, 60% budget = 38.4GB)
# Qwen-0.5B (0.6GB RAM) → ✅ RUN
# Llama-3.2-3B (4.8GB RAM) → ✅ RUN
# Mistral-7B (10GB RAM) → ✅ RUN
# Mixtral-8x7B (38.4GB RAM) → ✅ RUN (exactly at budget)
# VISION MODELS (no multiplier, 70% threshold = 44.8GB)
# Llama-3.2-11B-Vision (5.6GB, 8.75% ratio) → ✅ RUN
# pixtral-12b (12.6GB, 19.6% ratio) → ✅ RUN
# Llama-3.2-90B-Vision (46.4GB, 72.5% ratio) → ⏭️ SKIP (exceeds 70%)
max_tokens Strategy: Vision vs Text (Session 31)
Problem: Vision and text models have fundamentally different context management strategies.
Text Models (MLXRunner):
- Shift-Window Context: Maintain conversation history in context buffer
- Server Default:
context_length / 2(reserve half for history, half for generation) - CLI Default:
context_length(full context, no reservation) - Example: Llama-3.2-3B (128K context) → Server: 64K max_tokens
- Implementation:
get_effective_max_tokens(runner, requested_max_tokens, server_mode)
Vision Models (VisionRunner):
- Stateless Processing: Each request is independent (Metal memory limitations prevent context preservation)
- No Shift-Window: History not maintained in model context
- Server/CLI Default:
2048tokens (conservative, works for all vision models) - Rationale:
- No need for
/2division (no history to reserve) - Vision inference is slow → 2048 adequate for image descriptions
- Prevents accidentally generating 64K+ tokens on large-context models
- No need for
- Example: Llama-3.2-11B-Vision (128K context) → Default: 2048 max_tokens
- Implementation:
get_effective_max_tokens_vision(runner, requested_max_tokens)
Future (Phase 1c - Batch Processing):
- Vision: Processes 24 images → Batched stateless (each image independent)
- Text: Receives ALL vision outputs → Full shift-window context for complex queries
- Example: "Compare Image 1 and Image 15" requires text model with full history
Test Updates (Session 31):
- E2E Vision tests: Updated from
50-100→2048tokens - Reflects realistic server defaults (no artificial limits)
- Prevents test failures from truncated responses
Text Portfolio E2E Tests
Status: ✅ Complete (Portfolio Separation)
Location: tests_2.0/live/test_server_e2e.py
Fixture: text_portfolio (provides text-only models)
Parametrization: text_model_key (text_00, text_01, ..., text_22)
Test Classes:
-
TestServerHealthEndpoints - Basic server functionality
test_health_endpoint- Server liveness checktest_v1_models_list- Model metadata endpoint
-
TestChatCompletionsBatch - Non-streaming chat (parametrized)
test_chat_completions_batch[text_XX]- OpenAI-compatible batch responses- Validates: Response structure, stop token filtering (Issue #32)
- 23 tests (one per text model in portfolio)
-
TestChatCompletionsStreaming - SSE streaming chat (parametrized)
test_chat_completions_streaming[text_XX]- Server-Sent Events format- Validates: SSE format, chunk structure, stop token filtering, completion
- 23 tests (one per text model in portfolio)
-
TestCompletionsBatch - Non-streaming text completion
test_completions_batch_basic- Basic/v1/completionsendpoint
-
TestCompletionsStreaming - SSE streaming text completion
test_completions_streaming_basic- Streaming text completions
RAM Gating:
- Uses
calculate_text_model_ram_gb()(1.2x multiplier) - Progressive budgets: 40%-70% based on system RAM
- Models exceeding budget auto-skipped with clear reason
Example:
# 64GB system → 38.4GB budget (60%)
# text_00: Qwen-0.5B (0.6GB) → ✅ RUN
# text_10: Mistral-7B (10GB) → ✅ RUN
# text_22: Mixtral-8x7B (38.4GB) → ✅ RUN (at budget limit)
Vision Portfolio E2E Tests
Status: ✅ Complete (Portfolio Separation)
Location: tests_2.0/live/test_vision_server_e2e.py
Fixture: vision_portfolio (provides vision-capable models)
Parametrization: vision_model_key (vision_00, vision_01, vision_02)
Test Class: TestVisionServerE2E
-
test_single_image_chat_completion[vision_XX] (parametrized)
- Multimodal chat with Base64 image data
- Validates: Vision model describes image correctly
- OpenAI Vision API format:
content: [{"type": "text"}, {"type": "image_url"}] - 3 tests (one per vision model in portfolio)
-
test_streaming_graceful_degradation[vision_XX] (parametrized)
- Vision request with
stream=True - Validates: SSE emulation (mlx-vlm doesn't support true streaming)
- Returns HTTP 200 with SSE events (not HTTP 400 rejection)
- 3 tests (one per vision model in portfolio)
- Vision request with
-
test_text_request_still_works_on_vision_model[vision_XX] (parametrized)
- Text-only request on vision model server
- Validates: Vision models handle pure text requests
- No image data, just string content
- 3 tests (one per vision model in portfolio)
-
test_vision_to_text_model_switch_filters_images (special integration test)
- Tests Vision→Text model switching with conversation history
- Server filters
image_urlcontent for text models - Validates: Multimodal history filtering (Session 26, VISION-MULTIMODAL-HISTORY-ISSUE.md)
- 1 test (uses both portfolios)
RAM Gating:
- Uses
calculate_vision_model_ram_gb()(0.70 threshold, no multiplier) - Models >70% system RAM →
ram_needed_gb = float('inf')(auto-skip) - Conservative gate prevents Vision Encoder OOM crashes
Example:
# 64GB system → 44.8GB threshold (70%)
# vision_00: Llama-3.2-11B-Vision (5.6GB, 8.75%) → ✅ RUN
# vision_01: pixtral-12b (12.6GB, 19.6%) → ✅ RUN
# vision_02: Llama-3.2-90B-Vision (46.4GB, 72.5%) → ⏭️ SKIP (exceeds 70%)
Why Separate Portfolios:
- Text and Vision models have different RAM characteristics
- Vision models crash unpredictably above 70% (Vision Encoder overhead)
- Deterministic test indices (text_XX, vision_XX) replace flaky discovered_XX
- Tests validate model-type-specific behavior (e.g., text-only on vision models)
Appendix
Test Environment Variables Reference
This section documents environment variables used exclusively for testing and development. For user-facing configuration, see the "Configuration Reference" section in README.md.
Test Control Variables
These variables control test behavior and should only be set when running the test suite:
| Variable | Description | Usage |
|---|---|---|
MLXK2_DEBUG |
Enable debug logging in server internals | Set to 1 for verbose debugging during test failures |
MLXK2_STRICT_TEST_DELETE |
Enforce strict deletion checks in rm tests |
Set to 1 in test suite to validate error handling |
HF_HUB_DISABLE_PROGRESS_BARS |
Disable HuggingFace download progress bars | Set automatically by MLX Knife; don't set manually |
Live Test Environment Variables
These variables enable optional live tests that interact with real models or external services:
| Variable | Description | Required For |
|---|---|---|
MLXK2_LIVE_PUSH |
Enable live push tests (requires HF credentials) | pytest -m live_push |
MLXK2_LIVE_REPO |
Target repository for push tests | pytest -m live_push |
MLXK2_LIVE_WORKSPACE |
Workspace path for push tests | pytest -m live_push |
MLXK2_LIVE_CLONE |
Enable live clone tests | pytest -m live_clone |
MLXK2_LIVE_CLONE_MODEL |
Model to clone in live tests | pytest -m live_clone |
MLXK2_LIVE_CLONE_WORKSPACE |
Clone destination path | pytest -m live_clone |
MLXK2_USER_HF_HOME |
User cache path for read-only tests | pytest -m issue27 or pytest -m live_run |
MLXK2_ISSUE27_MODEL |
Specific model for Issue #27 tests | pytest -m issue27 |
MLXK2_ISSUE27_INDEX_MODEL |
Index-based model for Issue #27 | pytest -m issue27 |
MLXK2_SUBSET_COUNT |
Limit Issue #27 test count | pytest -m issue27 |
MLXK2_BOOTSTRAP_INDEX |
Auto-download model for Issue #27 | pytest -m issue27 |
Example:
# Enable debug logging for troubleshooting
MLXK2_DEBUG=1 pytest tests_2.0/test_server_base.py -v
# Run live push tests with credentials
MLXK2_LIVE_PUSH=1 \
MLXK2_LIVE_REPO="test-org/test-model" \
MLXK2_LIVE_WORKSPACE="/tmp/workspace" \
HF_TOKEN=hf_... \
pytest -m live_push -v
Important: Never set these variables in production environments or user-facing documentation. They are intended exclusively for the test suite.
Complete Test File Structure (2.0.4-beta.1)
tests_2.0/
├── __init__.py
├── conftest.py # Isolated test cache (HF_HOME override), safety sentinel, core fixtures
├── conftest_runner.py # Runner-specific fixtures/mocks
├── show_portfolios.py # Diagnostic tool: Display text/vision portfolios with RAM estimates
├── stubs/ # Minimal mlx/mlx_lm stubs for unit/spec tests
│ ├── mlx/
│ │ └── core.py
│ └── mlx_lm/
│ ├── __init__.py
│ ├── generate.py
│ └── sample_utils.py
├── spec/ # JSON API spec/contract validation
│ ├── test_cli_commands_json_flag.py # CLI JSON flag behavior
│ ├── test_cli_version_output.py # Version command JSON shape
│ ├── test_code_outputs_validate_against_schema.py # Code outputs validate against schema
│ ├── test_push_error_matches_schema.py # Push error output matches schema
│ ├── test_push_output_matches_schema.py # Push success output matches schema
│ ├── test_spec_doc_examples_validate.py # Docs examples validate against JSON schema
│ └── test_spec_version_sync.py # Code/docs version consistency check
├── live/ # Opt-in live tests (markers)
│ ├── __init__.py
│ ├── conftest.py # Shared fixtures for live E2E tests (text_portfolio, vision_portfolio, pytest_generate_tests hook)
│ ├── server_context.py # LocalServer context manager for E2E testing (45s timeout for MLX cleanup)
│ ├── sse_parser.py # SSE parsing utilities for streaming validation
│ ├── test_utils.py # Portfolio Discovery (text/vision separation), RAM calculation modularization, RAM gating utilities
│ ├── test_cli_e2e.py # CLI integration E2E tests (ADR-011, parametrized)
│ ├── test_cli_pipe_live.py # Pipe-mode E2E (stdin '-', JSON interactive error, list→run pipe) using first eligible model
│ ├── test_clone_live.py # Live clone flow (requires MLXK2_LIVE_CLONE, HF_TOKEN)
│ ├── test_list_human_live.py # Live list/health against user cache (requires HF_HOME)
│ ├── test_portfolio_fixtures.py # Portfolio separation validation tests (7 tests: fixture behavior, disjoint check)
│ ├── test_push_live.py # Live push flow (requires MLXK2_LIVE_PUSH, HF_TOKEN)
│ ├── test_server_e2e.py # Server E2E tests with TEXT models (ADR-011 + Portfolio Separation, parametrized: text_XX)
│ ├── test_streaming_parity.py # Streaming vs batch parity tests (Issue #20, ADR-011, parametrized)
│ ├── test_vision_e2e_live.py # Vision CLI E2E tests with real models (ADR-012, 5 deterministic vision queries)
│ └── test_vision_server_e2e.py # Vision Server E2E tests with VISION models (ADR-012 Phase 3 + Portfolio Separation, parametrized: vision_XX)
├── test_adr004_error_logging.py # ADR-004 error logging and redaction (tokens, paths)
├── test_capabilities.py # Probe/Policy architecture (ADR-012, ADR-016, Session 18-19, 45 tests)
├── test_cli_log_json_flag.py # CLI --log-json flag behavior and JSON log format
├── test_cli_push_args.py # Push CLI args and JSON error/output handling (offline)
├── test_cli_run_exit_codes.py # CLI exit codes + pipe/JSON regressions, stdin '-', non-TTY batch, interactive JSON error, SIGPIPE, BrokenPipeError
├── test_cli_run_wrapper.py # mlx-run wrapper argv injection
├── test_clone_operation.py # Clone operations with APFS optimization
├── test_ctrl_c_handling.py # SIGINT handling during run/interactive flows
├── test_detection_readme_tokenizer.py # README/tokenizer-based framework detection
├── test_edge_cases_adr002.py # Naming/health edge cases (ADR-002)
├── test_health_multifile.py # Multi-file health completeness (index vs pattern)
├── test_health_vision.py # Vision model health checks (ADR-012 Phase 2, preprocessor_config.json validation)
├── test_human_output.py # Human rendering of list/health views
├── test_integration.py # Model resolution and health integration
├── test_interactive_mode.py # Interactive CLI mode prompts/history/streaming
├── test_interruption_recovery.py # Recovery semantics after interruption (flag reset)
├── test_issue_27.py # Health policy exploration with real models (marker: issue27)
├── test_issue_30_preflight.py # Preflight for gated/private/not-found repos (Issue #30)
├── test_issue_37_private_org_regression.py # Issue #37 private/org MLX model detection (marker: live_run)
├── test_json_api_list.py # JSON API list contract (shape/fields)
├── test_json_api_show.py # JSON API show contract (base/files/config)
├── test_legacy_formats.py # Legacy model format detection (Issue #37)
├── test_model_naming.py # Conversion rules, bijection, parsing
├── test_multimodal_filtering.py # Multimodal history filtering (Vision→Text model switching, Session 27)
├── test_portfolio_discovery.py # Portfolio separation discovery tests (10 tests: text/vision filtering, RAM formulas)
├── test_push_dry_run.py # Push dry-run diff planning (added/modified/deleted)
├── test_push_extended.py # Extended push: no-op vs commit, branch/retry, .hfignore
├── test_push_minimal.py # Minimal push scenarios (offline)
├── test_push_workspace_check.py # Push check-only: workspace validation without network
├── test_ram_calculation.py # RAM calculation unit tests (11 tests: text 1.2x, vision 0.70 threshold, system memory)
├── test_robustness.py # Robustness for rm/pull/disk/timeout/concurrency
├── test_run_complete.py # End-to-end run command (stream/batch/params)
├── test_run_vision.py # Vision runner unit tests (ADR-012 Phase 1b, VisionRunner routing, default prompt)
├── test_runner_core.py # MLXRunner core generation/memory/stop tokens
├── test_runtime_compatibility_reason_chain.py # Runtime compatibility reason field decision chain (Issue #36)
├── test_server_api_minimal.py # Minimal OpenAI-compatible server endpoints (SSE, JSON)
├── test_server_api.py.disabled # Disabled server API tests (WIP/expanded scenarios)
├── test_server_models_and_errors.py # Server model loading and error handling
├── test_server_streaming_minimal.py # Server SSE streaming functionality
├── test_server_token_limits_api.py # Server token limit enforcement
├── test_server_vision.py # Vision server unit tests (ADR-012 Phase 3, 17 tests: ChatMessage, image detection, helpers)
├── test_stop_tokens_live.py # Stop token validation with real models (marker: live_stop_tokens, ADR-009)
├── test_token_limits.py # Dynamic token calculation; server vs run policies
├── test_vision_adapter.py # Vision HTTP adapter unit tests (46 tests: Base64 decoding, OpenAI format parsing, sequential images, image ID persistence)
└── test_vision_exif.py # EXIF extraction tests (ADR-017 Phase 1, 8 tests: GPS, DateTime, Camera, collapsible table, privacy controls)
Known Model Quality Issues
Models with documented quality issues discovered during testing. Tests will fail when these issues occur (no workarounds).
Multiple EOS Token Generation
| Model | Token IDs | Status | Evidence Date |
|---|---|---|---|
| Phi-3-mini-4k-instruct-4bit | 32007=<|end|>, 32000=<|endoftext|> |
Fixed 2.0.2 | 2025-11-13 |
Issue: Model generates multiple EOS tokens instead of stopping at first.
Detection: Use mlxk run --verbose to see token generation details.
Fix: MLX-Knife 2.0.2+ filters by earliest position in text (not list order).
Example usage:
# Manual debugging with verbose mode
mlxk run mlx-community/Phi-3-mini-4k-instruct-4bit "Write one sentence about cats." --verbose
# Look for:
# [DEBUG] Token generation analysis:
# [DEBUG] Last 3 tokens: ["29889='.'", "32007='<|end|>'", "32000='<|endoftext|>'"]
# [DEBUG] ⚠️ WARNING: Multiple EOS tokens detected (2) - model quality issue
Version History
2.0.4-beta.1 (Release Candidate, 2025-12-16)
- ✅ Vision Server integration (ADR-012 Phase 3, Session 24):
- Backend-aware
get_or_load_model(): Loads MLXRunner OR VisionRunner based on policy ChatMessage.contentextended:Union[str, List[Dict]]for OpenAI Vision format_handle_vision_chat_completion(): Vision request handler with cache reuse_handle_text_chat_completion(): Supports both runner types- Streaming rejection (HTTP 400) for vision models
- 17 new unit tests in
test_server_vision.py - 3 new E2E tests in
test_vision_server_e2e.py
- Backend-aware
- ✅ Test count: 430 passed, 41 skipped (17 new tests from Session 24)
- ✅ Live E2E: 105 passed, 5 failed, 20 skipped
- Known failures: discovered_04/25 (Vision models) HTTP 404 in generic E2E tests (not vision-specific tests)
- Fixed: test_chess_position_e6 (increased max_tokens, relaxed assertion for "black" OR "king")
2.0.3 (2025-11-17)
- ✅ Test updates for stderr separation: 4 test files modified to verify errors go to stderr (human mode)
test_interactive_mode.py: 2 tests patching stderr for ERROR messagestest_run_complete.py: 2 tests validating stderr error handlingtest_cli_run_exit_codes.py: 3 tests checking stdout/stderr separation in JSON modetest_cli_push_args.py: 2 tests verifying push stdout/stderr returns
- ✅ Benchmark reporting infrastructure: 4 live test files updated with benchmark fixtures
live/conftest.py: +225 lines -report_benchmark()fixture,_parse_model_family()helperlive/test_cli_e2e.py: Benchmark metadata reporting (family, variant, stop_tokens)live/test_server_e2e.py: Benchmark metadata + performance (usage) datalive/test_streaming_parity.py: Portfolio Discovery refactoring (usesmlxk list --json)
- ✅ Interactive mode reasoning control: 2 new tests added (review_report.md)
test_interactive_mode.py: 1 test forhide_reasoningparameter passingtest_run_complete.py: 1 test forTestRunReasoningControlclass
- ✅ Test count: 308 passed, 35 skipped (+2 tests from review_report.md fixes)
2.0.2 (2025-11-14)
- ✅ Test infrastructure hardening (TOKENIZERS_PARALLELISM, active polling, gc.collect())
- ✅ Portfolio Discovery validation complete (73/81 E2E tests, 17 models discovered)
- ✅ Sequential execution warning added (TESTING-DETAILS.md CRITICAL section)
- ✅ RAM logging infrastructure added (server_context.py, for future benchmark tooling)
2.0.2-dev (2025-11-13)
- ✅ Stop token ordering bugs fixed (batch + streaming modes)
- ✅ E2E test suite completed (ADR-011)
- ✅ 72/80 E2E tests passing baseline (15 models tested)
- ✅ TESTING.md restructuring (timeless policy + version-specific details split)
2.0.1 (2025-11-28)
- ✅ Portfolio Discovery (Issue #32, ADR-009)
- ✅ CLI exit code propagation (Issue #38)
- ✅ 306/306 tests passing
2.0.0 (2025-11-06)
- ✅ Complete rewrite with Apache 2.0 license
- ✅ JSON-first architecture
- ✅ Isolated test system with sentinel protection
- ✅ MLX stubs for fast testing without model downloads
- ✅ 3-category test strategy
MLX-Knife 2.0.4-beta.1 Testing Details