mirror of https://github.com/cloudstack-llc/mlx-knife.git synced 2026-07-01 20:44:14 -04:00

Files

T

Local Test eedb91b75c Feat: add experimental push (2.0.0-alpha.2)

- Push (upload-only): quiet JSON by default; capture hub logs in data.hf_logs
  - No-op detection aligned to hub signal; clear commit fields; uploaded_files_count=0
  - Add --dry-run (plan vs remote) and --check-only (offline preflight); merge .hfignore; extend
  default ignores
  - Human output: concise; --verbose shows commit URL; JSON shape unchanged
  - Tests: add offline dry-run cases; live push remains opt-in (wet/live_push)
  - Docs: README push section updated; TESTING.md reference + mini-matrix;
  - Changelog: add 2.0.0-alpha.2; note Issue #31 under 1.1.1 pending
  - Spec: keep schema stable (0.1.3); CLI/version docs consistent

2025-09-05 22:42:39 +02:00

35 KiB

Raw Blame History

MLX Knife Testing Guide

Current Status

✅ 150/150 tests passing (August 2025) - STABLE RELEASE 🚀
✅ Apple Silicon verified (M1/M2/M3)
✅ Python 3.9-3.13 compatible
✅ Production ready - comprehensive testing with real model execution ✅ Isolated test system - user cache stays pristine with temp cache isolation ✅ 3-category test strategy - optimized for performance and safety

Quick Start (2.0 Default)

# Install package + tests
pip install -e .[test]

# Download test model (optional - most tests use isolated cache)
mlxk pull mlx-community/Phi-3-mini-4k-instruct-4bit

# Run 2.0 tests (default: tests_2.0/)
pytest -v

# Run legacy 1.x suite explicitly (not maintained here)
pytest tests/ -v

# Fast unit tests only
pytest tests/unit/

# Before committing
ruff check mlx_knife/ --fix && mypy mlx_knife/ && pytest

Why Local Testing?

MLX Knife requires Apple Silicon hardware and real MLX models for comprehensive testing:

Hardware Requirement: MLX framework only runs on Apple Silicon (M1/M2/M3)
Model Requirement: Tests use actual models (4GB+) for realistic validation
Industry Standard: Local testing is normal for MLX projects
Quality Assurance: Real hardware testing ensures actual functionality

This approach ensures our tests reflect real-world usage, not mocked behavior.

Test Structure

2.0 Test Structure (default)

tests_2.0/
├── __init__.py
├── conftest.py                      # Isolated test cache, fixtures
├── test_edge_cases_adr002.py        # Edge-case naming, ADR-002
├── test_health_multifile.py         # Multi-file health completeness
├── test_integration.py              # Model resolution, health integration
├── test_issue_27.py                 # Health policy consistency
├── test_model_naming.py             # Pattern/@hash parsing and resolution
├── test_robustness.py               # General robustness tests
├── test_json_api_list.py            # JSON API v0.1.2 (list contract)
├── test_json_api_show.py            # JSON API v0.1.2 (show contract)
└── spec/
    ├── test_cli_version_output.py   # version command JSON shape
    ├── test_spec_doc_examples_validate.py # docs examples vs schema (jsonschema)
    └── test_spec_version_sync.py    # docs version == code constant

Note: This tree is illustrative (not exhaustive). Push-related tests are documented in the dedicated "Push Testing (2.0)" section below to avoid drift.

Push Testing (2.0)

This section summarizes what our test suite covers for the experimental push feature and what still requires live/manual checks.

Reference: Push CLI and JSON

Usage: mlxk2 push <local_dir> <org/model> --private [--create] [--branch main] [--commit <msg>] [--check-only] [--json] [--verbose]
Args:
- --private (required in alpha): Safety gate to avoid public uploads.
- --create: Create the repository if it does not exist (model repo).
- --branch: Target branch, default main.
- --commit: Commit message, default "mlx-knife push".
- --check-only: Analyze workspace locally; no network call; returns data.workspace_health.
- --dry-run: Compare local workspace to the remote branch and summarize changes without uploading (requires repo read access).
- --json: Print JSON response; in JSON mode, logs/progress are suppressed by default.
- --verbose: Human mode — append details (e.g., commit URL). In JSON mode, only toggles console log verbosity; the JSON payload is unchanged.
JSON fields (data):
- repo_id: string — target org/model.
- branch: string — target branch.
- commit_sha: string|null — commit id; null when no_changes:true or on noop.
- commit_url: string|null — link to commit; null when no commit created.
- repo_url: string — https://huggingface.co/<org/model>.
- uploaded_files_count: int|null — number of changed files; set to 0 on no_changes:true.
- local_files_count: int|null — approximate local file count scanned.
- no_changes: boolean — true when hub reports an empty commit (preferred signal) or no file operations are detected.
- created_repo: boolean — true when repo was created (with --create).
- change_summary: {added:int, modified:int, deleted:int} — optional; derived from hub response when available.
- message: string|null — short human hint; mirrors hub on no‑op.
- hf_logs: string[] — buffered hub log lines (not printed in JSON mode unless --verbose).
- experimental: true and disclaimer: string — feature state markers.
- workspace_health: {...} — present only with --check-only:
  - healthy: bool, anomalies: [], config, weights.index, weights.pattern_complete, etc.
- dry_run: true — present only with --dry-run.
- dry_run_summary: {added:int, modified:int, deleted:int} — present with --dry-run.
- would_create_repo: bool / would_create_branch: bool — planning hints when target does not exist.
Error types (error.type):
- dependency_missing — huggingface-hub not installed.
- auth_error — missing HF_TOKEN (unless --check-only).
- workspace_not_found — local_dir missing/not a directory.
- repo_not_found — repo missing without --create.
- upload_failed — hub returned an error (e.g., 403/permission).
- push_operation_failed — unexpected internal failure wrapper.
Exit codes: success → 0; any status:error → 1.

Notes on output verbosity and behavior

JSON is quiet by default: only the final JSON object is printed. Use --verbose to allow hub logs/progress to reach the console (the JSON payload remains unchanged). For assertions, prefer data.hf_logs.
Human mode is chatty by default: progress + one‑liner summary. --verbose appends the commit URL when present.
No‑changes detection: If the hub reports “No files have been modified… Skipping to prevent empty commit.”, JSON sets no_changes: true, uploaded_files_count: 0, and nulls commit_sha/commit_url. Human shows “— no changes”. This hub signal is preferred over inferring from file lists.
--dry-run human output: prints a concise plan line dry-run: +A ~M -D (modifications are an approximation and may be ~? in rare cases).

Examples (expected)

No‑op re‑push (JSON): commit_sha: null, commit_url: null, uploaded_files_count: 0, no_changes: true, message mirrors hub text, hf_logs contains hub lines.
Commit (JSON): commit_sha/commit_url populated; uploaded_files_count == sum(change_summary.values()); message summarizes counts.

Dry-run (existing repo/branch, no changes) — JSON:

{
  "status": "success",
  "command": "push",
  "error": null,
  "data": {
    "repo_id": "org/model",
    "branch": "main",
    "commit_sha": null,
    "commit_url": null,
    "repo_url": "https://huggingface.co/org/model",
    "uploaded_files_count": 0,
    "local_files_count": 11,
    "no_changes": true,
    "created_repo": false,
    "message": "Dry-run: no changes",
    "experimental": true,
    "disclaimer": "Experimental feature (M0: upload only). No validation/filters; review results on the Hub.",
    "dry_run": true,
    "dry_run_summary": {"added": 0, "modified": null, "deleted": 0},
    "change_summary": {"added": 0, "modified": 0, "deleted": 0},
    "would_create_repo": false,
    "would_create_branch": false,
    "added_files": [],
    "deleted_files": []
  }
}

Dry-run (existing repo/branch, changes present) — JSON:

{
  "status": "success",
  "command": "push",
  "error": null,
  "data": {
    "repo_id": "org/model",
    "branch": "main",
    "commit_sha": null,
    "commit_url": null,
    "repo_url": "https://huggingface.co/org/model",
    "uploaded_files_count": 0,
    "local_files_count": 11,
    "no_changes": false,
    "created_repo": false,
    "message": "Dry-run: +2 ~? -1",
    "experimental": true,
    "disclaimer": "Experimental feature (M0: upload only). No validation/filters; review results on the Hub.",
    "dry_run": true,
    "dry_run_summary": {"added": 2, "modified": null, "deleted": 1},
    "change_summary": {"added": 2, "modified": 0, "deleted": 1},
    "would_create_repo": false,
    "would_create_branch": false,
    "added_files": ["new.txt", "weights/model.safetensors"],
    "deleted_files": ["old.txt"]
  }
}

Dry-run — Human output:

push (experimental): org/model@main — dry-run: no changes
push (experimental): org/model@main — dry-run: +2 ~? -1

Spec/Schema

The JSON API spec version and schema live in mlxk2/spec.py and docs/json-api-specification.md. The docs schema includes support for command: "push" and its fields. Keep tests in sync with those sources of truth.

Automated (offline)

Token/Workspace errors: Missing HF_TOKEN and missing workspace produce proper JSON errors.
CLI args (JSON mode): Missing positional args emit JSON errors rather than usage text.
Schema shape: Push success/error outputs validate against docs/json-api-schema.json.
No-op push: Detects no_changes: true, sets uploaded_files_count: 0, carries hub message into JSON (message/hf_logs), and human output shows "no changes" without duplicate logs.
Commit path: Extracts commit_sha, commit_url, change_summary (+/~/−), correct uploaded_files_count; human --verbose includes URL.
Repo/Branch handling: Missing repo requires --create; with --create sets created_repo: true. Missing branch is tolerated; upload creates it.
Ignore rules: .hfignore is merged with default ignores and forwarded to the hub.

Files:

tests_2.0/test_cli_push_args.py (CLI errors and JSON outputs)
tests_2.0/test_push_extended.py (no-op vs commit, branch/repo, .hfignore, human)
tests_2.0/spec/test_push_output_matches_schema.py (schema success path)

Run (venv39):

source venv39/bin/activate && pip install -e .
pytest -q tests_2.0/test_cli_push_args.py tests_2.0/test_push_extended.py
pytest -q tests_2.0/spec/test_push_output_matches_schema.py

Live (opt-in / wet)

Purpose: sanity-check real HF behavior (auth, no-op vs commit, URLs).
Defaults: Live tests are skipped. Enable with env vars and markers.
Env:
- MLXK2_LIVE_PUSH=1
- HF_TOKEN (write-enabled)
- MLXK2_LIVE_REPO='org/model'
- MLXK2_LIVE_WORKSPACE='/abs/path/to/workspace'
Command:
- pytest -q -m wet tests_2.0/live/test_push_live.py
- or pytest -q -m live_push
Notes:
- Live test does not use --create (safety). If the repo does not exist, create it once manually.
- Manual create example: mlxk2 push --private --create "$MLXK2_LIVE_WORKSPACE" "$MLXK2_LIVE_REPO" --json

Manual Checklist (Live)

Create repo (first time): --private --create → expect created_repo: true, private repo on HF.
No-op re-push: identical workspace → no_changes: true, uploaded_files_count: 0, concise human "no changes".
Commit after change: edit a small file → push shows commit_sha, commit_url, change_summary matches expectations.
.hfignore behavior: add ignores (e.g., .idea/, .vscode/, *.ipynb) → verify excluded on HF.
Optional errors: invalid token or missing rights → JSON error (upload_failed / auth error), clear message.

Human vs JSON:

Human output is derived from JSON only; hub logs are not printed directly.
Use --verbose with human output to append the commit URL or short message; JSON content stays the same structurally.

Manual MLX Chat Model Smoke Test (2.0)

Goal: Pull a small MLX chat model, verify classification, prepare a local workspace, validate it offline, and push to a private repo while preserving chat intent. This helps issuers validate iOS‑focused workflows.

Model choice (example)

mlx-community/Qwen2.5-0.5B-Instruct-4bit (small, chat‑oriented)

Steps

Pull (venv39):
- mlxk2 pull mlx-community/Qwen2.5-0.5B-Instruct-4bit
Verify in cache:
- mlxk2 list --health "Qwen2.5-0.5B-Instruct-4bit"
- Expect: Framework MLX, Type chat, capabilities include chat
Prepare local workspace from cache (dereference symlinks):
- Ensure HF_HOME points to your HF cache (optional, but recommended)
- Compute cache path: $HF_HOME/models--mlx-community--Qwen2.5-0.5B-Instruct-4bit
- Find latest snapshot hash under snapshots/
- Copy to workspace and dereference symlinks:
  - rsync -aL "$HF_HOME/models--mlx-community--Qwen2.5-0.5B-Instruct-4bit/snapshots/<HASH>/" ./mymodel_test_workspace/
Recommended README front‑matter (to preserve intent on push):
- Include YAML with tags and pipeline tag, e.g.
  - tags: [mlx, chat]
  - pipeline_tag: text-generation
  - base_model: <upstream_base>
- Keep model name containing Instruct or chat to aid chat detection
Offline validation (no network):
- mlxk2 push --check-only ./mymodel_test_workspace <org/model> --json
- Expect: workspace_health.healthy: true; ensure tokenizer present (tokenizer.json or tokenizer.model) and at least one non‑empty weight file
Push to private repo:
- mlxk2 push --private --create ./mymodel_test_workspace <org/model> --json
- Re‑push without changes should show no_changes: true
Post‑push verification:
- mlxk2 list --all --health <org/model>
- Current limitation: Framework may show PyTorch for non‑mlx-community orgs due to conservative detection. This does not affect content; future M1 will parse model card tags (mlx) to classify MLX across orgs.

Notes

Ensure tokenizer files exist (tokenizer.json/tokenizer.model) and optional generation_config.json for runnable chat contexts.
Avoid pushing unwanted files; use .hfignore for project‑specific filters.

1.x Legacy Test Suite (separate)

Location: tests/ (stable 1.x release on main).
Not part of the 2.0 default run; execute explicitly with pytest tests/ -v.
Contains extensive integration/server tests unrelated to the 2.0 JSON CLI.

3-Category Test Strategy (MLX Knife 1.1.0+)

MLX Knife uses a 3-category test strategy to balance test isolation, performance, and user cache protection:

🏠 CATEGORY 1: ISOLATED CACHE (Most Tests)

✅ User cache stays pristine - Tests use temporary isolated caches with automatic cleanup

Implemented Tests (78 tests):

✅ test_real_model_lifecycle.py - Full model lifecycle with tiny-random-gpt2 (~12MB download)
✅ test_core_functionality.py - Basic CLI operations with patch_model_cache isolation
✅ test_process_lifecycle.py - Process management with isolated cache + MODEL_CACHE patching
✅ test_run_command_advanced.py - Run command edge cases with mock_model_cache in isolation
✅ test_lock_cleanup_bug.py - Lock cleanup testing with temporary MODEL_CACHE override
✅ test_health_checks.py - Mock corruption testing with isolated temp_cache_dir

Technical Pattern:

@pytest.mark.usefixtures("temp_cache_dir")
class TestBasicLifecycle:
    def test_something(self, temp_cache_dir, patch_model_cache):
        with patch_model_cache(temp_cache_dir / "hub"):
            # Test operates in complete isolation
            # User cache never touched, automatic cleanup

Benefits:

✅ Clean User Cache: No test artifacts or broken models ever
✅ Parallel Testing: No cache conflicts between test runs
✅ Reproducible: No dependency on existing models in user cache
✅ Fast CI: Small models (12MB vs 4GB) for most tests

🏥 CATEGORY 2: USER CACHE (Framework Diversity)

📋 Reserved for future - Real model diversity that cannot be mocked

Future Framework Validation Tests:

Multiple framework detection (MLX + PyTorch + Tokenizer-only models)
Health check diversity testing with naturally corrupted models
Cross-framework model compatibility validation

Currently: All health/framework tests use mock_model_cache and are Category 1 (isolated)

🖥️ CATEGORY 3: SERVER CACHE (Performance Tests)

🔒 Large models, user cache expected - Marked with @pytest.mark.server

Server Tests (Excluded from default pytest):

🔒 test_issue_14.py - Chat self-conversation regression tests
🔒 test_issue_15_16.py - Dynamic token limit validation
🔒 test_end_token_issue.py - End-token filtering consistency
🔒 test_server_functionality.py - OpenAI API compliance (basic tests only)

Technical Pattern:

@pytest.mark.server  # Excluded from default pytest
def test_server_feature(mlx_server, model_name: str):
    # Uses real models in user cache
    # Requires significant RAM and time

Characteristics:

🔒 Not run by default - Must use pytest -m server
💾 RAM-aware - Auto-skip models exceeding available memory
⏱️ Longer execution - 20-40 minutes for full suite
🎯 Model diversity - Tests across different model sizes/architectures

Test Prerequisites

Required Setup

Apple Silicon Mac (M1/M2/M3)
Python 3.9 or newer
Test dependencies installed (includes jsonschema for Spec tests):
```
pip install -e .[test]
```

Notes:

Spec validation requires jsonschema. Installing .[test] ensures it is available.
Without jsonschema, Spec example validation is skipped (you will see one extra SKIPPED test).
With jsonschema installed, expect one additional PASS in the -m spec and tests_2.0/ totals.

That's it! Most tests (Category 1) use isolated caches and download small test models automatically (~12MB).

Enabling Issue #27 Tests (optional)

By default, several Issue #27 tests are skipped because they require a real multi‑shard safetensors model (with model.safetensors.index.json) in your user cache and enough free disk space to create an isolated copy.

Set your user cache: export MLXK2_USER_HF_HOME=/absolute/path/to/your/huggingface/cache
Ensure the cache contains a model with a safetensors index (common for larger Llama/Mistral models).
Run the focused tests: PYTHONPATH=. pytest tests_2.0/test_issue_27.py -v
If you see skips:
- “No safetensors index found” → pick a model that has model.safetensors.index.json.
- “Not enough free space” → free disk space; tests create a subset copy into an isolated temp cache.
- “User model not found” → verify the exact HF path in your cache and env var points to its .../huggingface/cache root.

With a suitable model present and MLXK2_USER_HF_HOME set, the Issue #27 tests should run without SKIPs.

When Issue #27 real‑model tests make sense

Purpose

These tests validate the strict health policy against real upstream Hugging Face repositories that ship multi‑shard safetensors with a model.safetensors.index.json. They complement the deterministic unit tests by exercising real‑world layouts.

Run them when

Your user cache contains at least one upstream PyTorch repo with a safetensors index (not MLX/GGUF conversions). Good candidates:
- mistralai/Mistral-7B-Instruct-v0.2 or -v0.3
- Qwen/Qwen1.5-7B-Chat, Qwen/Qwen2-7B-Instruct
- teknium/OpenHermes-2.5-Mistral
- Gated: meta-llama/Llama-2-7b-chat-hf, meta-llama/Llama-3-8B-Instruct, google/gemma-7b-it
You want to sanity‑check index‑based completeness, shard deletion/truncation, and LFS pointer detection against real artifacts.

They are not useful when

Your cache only has MLX Community models (no model.safetensors.index.json) or GGUF models — the index‑based tests will skip by design. In that case, rely on tests_2.0/test_health_multifile.py for deterministic coverage.

Resource considerations

Disk: tests copy a subset of files into an isolated cache. Tune size/speed with:
- export MLXK2_COPY_STRATEGY="index_subset"
- export MLXK2_SUBSET_COUNT="1"
- export MLXK2_MIN_FREE_MB="512" (or higher)
Network: if you need to fetch a candidate model first, prefer downloading only config.json, model.safetensors.index.json, and 1–2 small shards to keep it light.

Summary

If you have a suitable upstream PyTorch chat/instruct model with an index in your user cache, enable the env vars above and run tests_2.0/test_issue_27.py for an extra layer of real‑model assurance. Otherwise, the deterministic tests already validate the policy thoroughly.

Optional Setup (Server Tests Only)

For server tests (@pytest.mark.server - excluded by default):

# Medium model for server testing
mlxk pull mlx-community/Phi-3-mini-4k-instruct-4bit

# Different architecture for variety  
mlxk pull mlx-community/Mistral-7B-Instruct-v0.3-4bit

Note: Server tests are excluded from default pytest and require manual execution with pytest -m server.

Environment & Caches

To keep results reproducible and caches safe on Apple Silicon:

Preferred Python/venv: Apple‑native 3.9 in a dedicated env
- Example: python3.9 -m venv venv39 && source venv39/bin/activate && pip install -e .[test]
User cache (persistent): shared, real cache for manual ops and certain advanced/server tests
- Example (external SSD): export HF_HOME="/Volumes/SomeExternalSSD/models"
- Safe ops: list, health, show; Coordinate pull/rm (maintenance window)
Test cache (isolated/default): ephemeral via fixtures; default pytest runs must not force the user cache
- Category 1 tests use temporary caches and should not depend on HF_HOME
- Only server/advanced tests may require user cache and are excluded by default (-m server)
- Deletion safety: tests set MLXK2_STRICT_TEST_DELETE=1 so delete ops fail if not in test cache

In PRs, please state your Python version and whether you used the user cache or isolated test caches.

Test Commands

Basic Test Execution

# All tests (recommended before commits)
pytest

# Only integration tests (system-level)
pytest tests/integration/

# Only unit tests (fast)
pytest tests/unit/

# Verbose output
pytest -v

# Show test coverage
pytest --cov=mlx_knife --cov-report=html

Specific Test Categories

# Process lifecycle tests (critical for production)
pytest tests/integration/test_process_lifecycle.py -v

# Health check robustness (model corruption detection)
pytest tests/integration/test_health_checks.py -v

# Core functionality (basic CLI commands)
pytest tests/integration/test_core_functionality.py -v

# Issue #20: End-token filtering consistency (new in 1.1.0-beta2)
pytest tests/integration/test_end_token_issue.py -v

# Advanced run command tests
pytest tests/integration/test_run_command_advanced.py -v

# Server functionality tests
pytest tests/integration/test_server_functionality.py -v

# Lock cleanup bug tests (Issue #23 - new in 1.1.0-beta3)
pytest tests/integration/test_lock_cleanup_bug.py -v

Test Filtering

# Run only basic operations tests
pytest -k "TestBasicOperations" -v

# Server tests are excluded by default (marked with @pytest.mark.server)
# Run server tests manually (requires large models in user cache)
pytest -m server -v

# Skip server tests explicitly (default behavior)
pytest -m "not server" -v

# Run only process lifecycle tests
pytest -k "process_lifecycle or zombie" -v

# Run health check tests only
pytest -k "health" -v

# Only JSON API contract/spec tests
pytest -m spec -v

Timeout and Performance

# Set custom timeout (default: 300s)
pytest --timeout=60

# Show slowest tests
pytest --durations=10

# Parallel execution (if pytest-xdist installed)
pytest -n auto

Server Tests (Advanced)

⚠️ Warning: Server tests require significant system resources and time.

# Run comprehensive Issue #20 server tests (48 tests, ~30 minutes)
pytest tests/integration/test_end_token_issue.py -m server -v

# All server-marked tests (includes above + server functionality)
pytest -m server -v

# Quick server functionality test only
pytest tests/integration/test_server_functionality.py -v

# Server tests are RAM-aware - automatically skip models that don't fit

Server Test Requirements:

RAM: 8GB+ recommended (16GB+ for large models)
Time: 20-40 minutes for full suite
Models: Multiple 4-bit quantized models (1B-30B parameters)
Coverage: Streaming vs non-streaming consistency, token limits, API compliance

Python Version Compatibility

Verification Results (August 2025)

✅ 150/150 tests passing - All standard tests validated on Apple Silicon with isolated cache system

Python Version	Status	Tests Passing
3.9.6 (macOS)	✅ Verified	150/150
3.10.x	✅ Verified	150/150
3.11.x	✅ Verified	150/150
3.12.x	✅ Verified	150/150
3.13.x	✅ Verified	150/150

All versions tested with isolated cache system. Real MLX execution verified separately with server/run commands.

Manual Multi-Python Testing

If you have multiple Python versions installed, you can verify compatibility:

# Run the multi-Python verification script
./test-multi-python.sh

# Or manually test specific versions
python3.9 -m venv test_39
source test_39/bin/activate
pip install -e . && pip install pytest
pytest
deactivate && rm -rf test_39

Code Quality & Development

Code Quality Tools

MLX Knife includes comprehensive code quality tools:

# Install development tools
pip install ruff mypy

# Automatic code formatting and linting
ruff check mlx_knife/ --fix

# Type checking with mypy
mypy mlx_knife/

# Complete development workflow
ruff check mlx_knife/ --fix && mypy mlx_knife/ && pytest

Mini‑Matrix: What runs by default vs markers

Target	How to Run	Markers / Env	Includes	Network
Default 2.0 suite	`pytest -v`	—	JSON‑API (list/show/health), Human‑Output, Model‑Resolution, Health‑Policy, Push Offline (`--check-only`, `--dry-run`), Spec/Schema checks	No
Spec‑only	`pytest -m spec -v`	`spec`	Schema/contract tests, version sync, docs example validation	No
Exclude Spec	`pytest -m "not spec" -v`	`not spec`	Everything except spec/schema checks	No
Live Push (opt‑in)	`pytest -m live_push -v` (or all live tests: `pytest -m wet -v`)	`live_push` (subset of `wet`) + Env: `MLXK2_LIVE_PUSH=1`, `HF_TOKEN`, `MLXK2_LIVE_REPO`, `MLXK2_LIVE_WORKSPACE`	JSON push against the real Hub; on errors the test SKIPs (diagnostic)	Yes
Issue #27 real‑model (opt‑in)	`pytest tests_2.0/test_issue_27.py -v`	Env: `MLXK2_USER_HF_HOME` (user cache with multi‑shard models)	Strict health policy on real index‑based models	No (uses local cache)
Server/run (separate)	`pytest tests/integration -m server -v`	`server`	Heavy server/run tests, RAM‑dependent, longer duration	No (models local)

Useful commands

Only Spec: pytest -m spec -v
Offline Push only: pytest -k "push and not live" -v
Exclude Spec: pytest -m "not spec" -v
Live Push only: MLXK2_LIVE_PUSH=1 HF_TOKEN=... MLXK2_LIVE_REPO=... MLXK2_LIVE_WORKSPACE=... pytest -m live_push -v
All live tests (umbrella): pytest -m wet -v (may include future live tests beyond push)

Markers: wet vs live_push

wet: umbrella marker for any opt‑in “live” test that may require network, credentials, or user environment. Use to run all live tests.
live_push: narrow marker for push‑specific live tests only. Use to target push live checks without running other live suites.

Note: Without the required env vars, live tests remain SKIPPED.

Development Workflow

Before committing changes:

#!/bin/bash
# pre-commit-check.sh - Run before committing
set -e

echo "🧪 Running MLX Knife pre-commit checks..."

# 1. Code style
echo "Checking code style..."
ruff check mlx_knife/ --fix

# 2. Type checking
echo "Checking types..."
mypy mlx_knife/

# 3. Quick smoke test
echo "Running quick tests..."
pytest tests/unit/ -v

echo "✅ All checks passed. Safe to commit!"

Local Development Testing

Adding New Tests

Integration tests go in tests/integration/
Unit tests go in tests/unit/
Use existing fixtures from conftest.py
Follow naming: test_*.py, Test* classes, test_* methods

Test Categories (Markers)

@pytest.mark.integration  # Slower system tests
@pytest.mark.unit         # Fast isolated tests  
@pytest.mark.slow         # Tests >30 seconds
@pytest.mark.requires_model  # Needs actual MLX model
@pytest.mark.network      # Requires internet
@pytest.mark.server       # Requires MLX Knife server (excluded from default pytest)

Mock Utilities

mock_model_cache(): Creates fake model directories
mlx_knife_process(): Manages subprocess lifecycle
process_monitor(): Tracks zombie processes
temp_cache_dir(): Isolated test environment

Test Philosophy

Following the "Process Hygiene over Edge-Case Perfection" principle:

Process Cleanliness: No zombies, no leaks ✅
Health Checks: Reliable corruption detection ✅
Core Operations: Basic functionality works ✅
Error Handling: Graceful failures ✅

The test suite validates production readiness with real Apple Silicon hardware and actual MLX models.

Troubleshooting

Common Issues

Tests hang forever:

pytest --timeout=60

Import errors:

pip install -e . && pip install pytest

Process cleanup issues:

ps aux | grep mlx_knife  # Check for zombies

Cache conflicts:

export HF_HOME="/tmp/test_cache"
pytest --cache-clear

Test Environment

# Clean test run
rm -rf .pytest_cache __pycache__
pytest tests/ -v --cache-clear

# Debug specific test
pytest tests/integration/test_health_checks.py::TestHealthCheckRobustness::test_healthy_model_detection -v -s

Contributing Test Results

When submitting PRs, please include:

Your test environment:
- macOS version
- Apple Silicon chip (M1/M2/M3)
- Python version
- Which model(s) you tested with

Test results summary:

Platform: macOS 14.5, M2 Pro
Python: 3.11.6
Model: Phi-3-mini-4k-instruct-4bit
Results: 150/150 tests passed

Any issues encountered and how you resolved them

Summary

MLX Knife 1.1.0 STABLE Testing Status:

✅ Production Ready - 150/150 tests passing
✅ Isolated Test System - User cache stays pristine with temp cache isolation ✅ 3-Category Strategy - Optimized for performance and safety ✅ Multi-Python Support - Python 3.9-3.13 verified
✅ Code Quality - ruff/mypy integration working
✅ Real Model Testing - Server/run commands validated with multiple models ✅ Memory Management - Context managers prevent leaks
✅ Exception Safety - Context managers ensure cleanup
✅ Cache Directory Fix - Issue #21: Empty cache crash resolved ✅ LibreSSL Warning Fix - Issue #22: macOS Python 3.9 warning suppression ✅ Lock Cleanup Fix - Issue #23: Enhanced rm command with lock cleanup

This comprehensive testing framework validates MLX Knife's production readiness through isolated testing with automatic model downloads and separate real MLX validation.

Server-Based Testing (Advanced)

Some tests require a running MLX Knife server with loaded models. These tests are marked with @pytest.mark.server and are not run by default with pytest.

Why Separate Server Tests?

Test count varies by loaded models (makes CI reporting inconsistent)
Large memory requirements - need different models for different RAM sizes
Longer execution time - each model needs to load individually
Manual setup required - need to download appropriate models first

Prerequisites for Server Tests

System RAM	Recommended Models	Commands
16GB	Small models only	`mlxk pull mlx-community/Qwen2.5-0.5B-Instruct-4bit` `mlxk pull mlx-community/Llama-3.2-1B-Instruct-4bit` `mlxk pull mlx-community/Llama-3.2-3B-Instruct-4bit`
32GB	+ Medium models	`mlxk pull mlx-community/Phi-3-mini-4k-instruct-4bit` `mlxk pull mlx-community/Mistral-7B-Instruct-v0.2-4bit` `mlxk pull mlx-community/Mixtral-8x7B-Instruct-v0.1-4bit`
64GB	+ Large models	`mlxk pull mlx-community/Mistral-Small-3.2-24B-Instruct-2506-4bit` `mlxk pull mlx-community/Qwen3-30B-A3B-Instruct-2507-4bit` `mlxk pull mlx-community/Llama-3.3-70B-Instruct-4bit`
96GB+	+ Huge models	`mlxk pull mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit`

Running Server Tests

Issue #14 Regression Tests (Chat Self-Conversation Bug):

# Set environment
export HF_HOME=/path/to/your/cache

# Smoke test first (see which models are available)
python tests/integration/test_issue_14.py

# Run server tests only (excluded from default pytest)
pytest -m server -v

# Run specific Issue #14 tests
pytest tests/integration/test_issue_14.py -m server -v

Expected Output:

🦫 MLX Knife Issue #14 Test - Smoke Test
==================================================
📊 Safe models for this system: 6
💾 System RAM: 64GB total, 40GB available

  🎯 mlx-community/Mistral-7B-Instruct-v0.2-4bit
     └─ Size: 7B, RAM needed: 8GB
  🎯 mlx-community/Llama-3.2-3B-Instruct-4bit  
     └─ Size: 3B, RAM needed: 4GB
  [...]

========== test session starts ==========
tests/integration/test_issue_14.py::test_server_health[mlx_server] PASSED
tests/integration/test_issue_14.py::test_issue_14_self_conversation_regression_original[mlx-community/Mistral-7B-Instruct-v0.2-4bit-7B-8] PASSED
[...6 more model tests...]
========== 7 passed in 45.23s ==========

Additional Server Tests

Issues #15 & #16 - Dynamic Token Limits (Implemented in 1.1.0-beta1):

pytest tests/integration/test_issue_15_16.py -v

Issue #20 - End-Token Filtering (Implemented in 1.1.0-beta2):

pytest tests/integration/test_end_token_issue.py -m server -v

Troubleshooting Server Tests

Permission warnings are normal:

WARNING: ⚠️  Cannot scan network connections (permission denied)
INFO: 🔧 Falling back to process-based cleanup only

This is expected on macOS - the tests continue with process-based cleanup.

Memory issues:

Tests automatically skip models exceeding 80% available RAM
Use smaller models if you see consistent memory failures
Consider external SSD for model cache to reduce memory pressure

Server startup failures:

# Debug server manually
python -m mlx_knife.cli server --port 8000

# Check model health  
mlxk health

# Verify environment
echo $HF_HOME

Adding New Server Tests

When contributing server-based tests:

@pytest.mark.server
def test_new_feature(mlx_server, model_name: str, size_str: str, ram_needed: int):
    """Test new feature with MLX models.""" 
    # Use mlx_server fixture for automatic server management
    # Test implementation here

Mark with @pytest.mark.server - excludes from default pytest
Use mlx_server fixture - automatic server lifecycle management
Test RAM requirements - use get_safe_models_for_system() helper
Document in TESTING.md - add to this guide

35 KiB Raw Blame History Unescape Escape

MLX Knife Testing Guide

Current Status

Quick Start (2.0 Default)

Why Local Testing?

Test Structure

2.0 Test Structure (default)

Push Testing (2.0)

Reference: Push CLI and JSON

Manual MLX Chat Model Smoke Test (2.0)

1.x Legacy Test Suite (separate)

3-Category Test Strategy (MLX Knife 1.1.0+)

🏠 CATEGORY 1: ISOLATED CACHE (Most Tests)

🏥 CATEGORY 2: USER CACHE (Framework Diversity)

🖥️ CATEGORY 3: SERVER CACHE (Performance Tests)

Test Prerequisites

Required Setup

Enabling Issue #27 Tests (optional)

When Issue #27 real‑model tests make sense

Optional Setup (Server Tests Only)

Environment & Caches

Test Commands

Basic Test Execution

Specific Test Categories

Test Filtering

Timeout and Performance

Server Tests (Advanced)

Python Version Compatibility

Verification Results (August 2025)

Manual Multi-Python Testing

Code Quality & Development

Code Quality Tools

Mini‑Matrix: What runs by default vs markers

Development Workflow

Local Development Testing

Adding New Tests

Test Categories (Markers)

Mock Utilities

Test Philosophy

Troubleshooting

Common Issues

Test Environment

Contributing Test Results

Summary

Server-Based Testing (Advanced)

Why Separate Server Tests?

Prerequisites for Server Tests

Running Server Tests

Additional Server Tests

Troubleshooting Server Tests

Adding New Server Tests

35 KiB

Raw Blame History