Release 2.0.4-beta.5: Community repair tool + OS-agnostic benchmarking

Closes #49 (Mistral Tokenizer Bug)

Major features:
- Workspace Infrastructure (ADR-018 Phase 0a): Managed workspace detection,
  provenance metadata, backward compatible with unmanaged workspaces
- Convert Operation (ADR-018 Phase 1): `mlxk convert --repair-index` fixes
  mlx-vlm #624 affected models (7+ models including Qwen2.5-VL, gemma-3)
- Resumable Pull: Auto-detect partial downloads with `--force-resume`
- Wet Umbrella Test Integration: Single entry point for all real model tests

Fixes:
- #49: BPE space markers now correctly converted (Mistral-family models)
- Vision Portfolio Discovery: Filter by capabilities instead of model_type
- Memory Cleanup Hook: Triggers for both live_e2e and wet markers

Test suite: 528 passed, 60 skipped (Python 3.9-3.14)
This commit is contained in:
The BROKE Cluster Team
2025-12-31 16:05:18 +01:00
parent c51ca1b10e
commit 25609e4dcb
31 changed files with 4355 additions and 590 deletions
+46 -7
View File
@@ -12,7 +12,14 @@ For current test counts, version-specific details, and complete file listings, s
- **Isolated by default** - User cache stays pristine with sentinel protection
- **Opt-in live tests** - Network/model tests require explicit markers/environment
- **Mock-heavy** - MLX stubs enable fast testing without model downloads
- **Fast feedback** - 300+ tests run in seconds on any Apple Silicon Mac
- **Fast feedback** - 500+ tests run in seconds on any Apple Silicon Mac
**Cache Architecture:**
- **User Cache (Singleton):** ONE permanent cache per system - READ-ONLY in tests
- **Isolated Cache (Factory):** NEW temporary cache PER test - full read/write
- **Sentinel Safety:** Automatic protection prevents accidental User Cache deletion
See [TESTING-DETAILS.md → Fundamental Definitions](TESTING-DETAILS.md#fundamental-definitions-single-source-of-truth) for complete cache architecture and safety mechanisms.
**Safety First:**
- Tests use temporary caches with `TEST_SENTINEL` protection
@@ -45,6 +52,24 @@ ruff check mlxk2/ --fix && mypy mlxk2/ && pytest -v
**That's it!** Default tests use isolated caches and MLX stubs - no model downloads required.
## Running All Real Tests
**Single command (recommended):**
```bash
./scripts/test-wet-umbrella.sh
```
This runs all real tests in the correct order. For details on test categories, see [TESTING-DETAILS.md](TESTING-DETAILS.md).
**Manual execution (advanced):**
```bash
# Portfolio-compatible tests
pytest -m wet -v
# Isolated Cache WRITE tests
MLXK2_TEST_RESUMABLE_DOWNLOAD=1 pytest -m live_resumable -v
```
## Test Categories
### Category 1: Isolated Cache (Default)
@@ -140,11 +165,25 @@ tests_2.0/
**Legend:**
- `spec/` - API contract validation (stays in sync with `docs/schema`)
- `live/` - Opt-in tests requiring environment (markers: `live_*`)
- `live/` - **User Cache READ only** - Portfolio Discovery tests (parametrized across many models)
- `stubs/` - Lightweight MLX replacements for unit tests
- `conftest.py` - Isolated HF cache (temp), safety sentinel, fixtures
- Parent `conftest.py` applies globally
- Subdirectory `conftest.py` (live/, spec/) MUST limit scope to own directory only
- See [TESTING-DETAILS.md → conftest.py Scope Rules](TESTING-DETAILS.md#conftestpy-scope-rules)
See [TESTING-DETAILS.md](TESTING-DETAILS.md) for complete file listing with descriptions.
**CRITICAL RULE:****NEVER write to User Cache**
**Test organization by cache strategy:**
- **User Cache READ** → `tests_2.0/live/` (Portfolio Discovery with many models)
- **Isolated Cache WRITE** → `tests_2.0/` (fresh downloads, mock creation)
- **Isolated Cache READ** → `tests_2.0/` (safety copies from User Cache)
- **Schema validation** → `tests_2.0/spec/` (mocks, fast)
- **Workspace operations** → `tmp_path` fixture (Clone/Push tests, separate from cache)
**Note:** Workspace is semantically distinct from Cache - see [TESTING-DETAILS.md → Workspace](TESTING-DETAILS.md#workspace-separate-concept---not-a-cache) for details.
See [TESTING-DETAILS.md → Truth Table](TESTING-DETAILS.md#truth-table-cache-type--operation) for complete categorization and decision tree.
## MLX Stubs (Fast Testing Without Model Downloads)
@@ -344,9 +383,9 @@ When submitting PRs with test changes, please include:
2. **Test results** (example):
```
Platform: macOS 14.6, M2 Max
Python: 3.9.6
Results: 476 passed, 65 skipped
Platform: macOS 26.2 (Tahoe), M2 Max
Python: 3.10.x
Results: 528 passed, 60 skipped
```
3. **Any issues encountered** and resolutions
@@ -373,7 +412,7 @@ ruff check mlxk2/ --fix && mypy mlxk2/ && pytest -v
**MLX Knife Testing:**
-**Isolated by default** - User cache stays pristine
-**Fast feedback** - 400+ tests run in seconds without model downloads
-**Fast feedback** - 500+ tests run in seconds without model downloads
-**Low requirements** - 16GB RAM, ~20MB disk, no HF cache needed
-**Opt-in live tests** - Real models/network when needed
-**Multi-Python support** - Verified on Python 3.9-3.14