Release 2.0.4 stable - see CHANGELOG.md for details

2026-07-01 20:44:14 -04:00 · 2026-02-11 15:05:09 +01:00
parent 97b832b568
commit d4cd89fab0
11 changed files with 364 additions and 197 deletions
@@ -1,6 +1,6 @@
 # Changelog

-## [2.0.4] - 2026-XX-XX (WIP)
+## [2.0.4] - 2026-02-11

 > **First stable release with Audio/STT support.** This release consolidates beta.9 and beta.10 improvements into a production-ready package.

@@ -18,8 +18,12 @@
  - `test_whisper_tokenizer.py`: 47 tests for tiktoken workaround (get_encoding, get_tokenizer, Tokenizer class)
  - `TestEmbeddingGate`: 3 tests for embedding model runtime blocking

+- **Test Infrastructure:** Test suite now runs consistently with `unset HF_HOME` (uses fallback models)
+
 ### Fixed (since beta.10)

+- **`serve --model` Pre-Validation:** Model ambiguity and not-found errors now detected before server starts. Previously showed stacktrace; now shows clean error message with suggestions.
+
 - **Run Preflight Consistency:** `run.py` now passes `probe` and `framework` to `audio_runtime_compatibility()`. STT model_type and tekken.json gates now work in CLI (previously only in list/health).

 - **STT model_type Gate:** Extended to accept `vibevoice` and `audio` model types (was only `whisper`/`voxtral`). VibeVoice-ASR and future STT models no longer blocked incorrectly.
@@ -4,9 +4,9 @@
  <img src="https://github.com/mzau/mlx-knife/raw/main/mlxk-demo.gif" alt="MLX Knife Demo" width="900">
 </p>

-**Current Version: 2.0.4-beta.10** (Stable: 2.0.3)
+**Current Version: 2.0.4** (Stable)

-[![GitHub Release](https://img.shields.io/badge/version-2.0.4--beta.10-blue.svg)](https://github.com/mzau/mlx-knife/releases)
+[![GitHub Release](https://img.shields.io/badge/version-2.0.4-blue.svg)](https://github.com/mzau/mlx-knife/releases)
 [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
 [![Python 3.10-3.12](https://img.shields.io/badge/python-3.10--3.12-blue.svg)](https://www.python.org/downloads/)
 [![Apple Silicon](https://img.shields.io/badge/Apple%20Silicon-green.svg)](https://support.apple.com/en-us/HT211814)
@@ -17,10 +17,8 @@

 ## Features

-> **⚠️ Beta.9 Audio Bug:** If you installed `mlx-knife[audio]==2.0.4b9` from PyPI, audio transcription fails with "Processor not found". Upgrade to beta.10: `pip install mlx-knife[all]==2.0.4b10`
-
-### What's New in 2.0.4 (Coming Soon - Currently Beta)
- **Audio Transcription (STT)** - Whisper speech-to-text (`--audio` flag, `pip install mlx-knife[audio]`)
+### What's New in 2.0.4
+- **Audio Transcription (STT)** - Whisper speech-to-text (`--audio` flag)
 - **Vision Models with EXIF Metadata** - Image analysis + automatic GPS/date/camera extraction visible to the model
 - **Unix Pipe Integration** - Chain models without temp files (`vision → text` workflows)
 - **Local Development Workflow** - Clone → Repair → Test models without HuggingFace round-trips
@@ -79,42 +77,33 @@ This license applies **only** to the `mlx-knife` code and **does not extend** to
 ### Python Compatibility

 ✅ **Python 3.10 - 3.12** - Full support (Text + Vision + Audio)
-❌ **Python 3.9** - Not supported (MLX 0.30+ requires 3.10+)
+❌ **Python 3.9** - Use version 2.0.3 (text + cache management only)
 ❌ **Python 3.13+** - Not supported (miniaudio lacks pre-built wheels)

-**Note:** Vision/Audio features require Python 3.10+. Recommended: **Python 3.10 or 3.11** for best compatibility.
+**Recommended:** Python 3.10 or 3.11 for best compatibility.



 ## Installation

-### 1. PyPI Stable (2.0.3 - Text models only)
+### 1. PyPI (Recommended)

 ```bash
 pip install mlx-knife
-mlxk --version  # → mlxk 2.0.3
-```
-
-**Requirements:** macOS Apple Silicon, Python 3.9-3.12
-
-### 2. PyPI Beta (2.0.4-beta.10 - Text + Vision + Audio)
-
-```bash
-pip install mlx-knife[all]==2.0.4b10
-mlxk --version  # → mlxk 2.0.4b10
+mlxk --version  # → mlxk 2.0.4
 ```

 **Requirements:** macOS Apple Silicon, Python 3.10-3.12
-**Features:** Audio STT (Whisper), Vision with EXIF metadata, full tiktoken workaround
+**Includes:** Text, Vision, Audio (Whisper STT), EXIF metadata, Unix pipes

-### 3. Developer Installation
+### 2. Developer Installation

 ```bash
 git clone https://github.com/mzau/mlx-knife.git
 cd mlx-knife
-pip install -e ".[all,dev,test]"
+pip install -e ".[dev,test]"

-mlxk --version  # → mlxk 2.0.4b10
+mlxk --version  # → mlxk 2.0.4
 pytest -v
 ```

@@ -285,8 +274,7 @@ Image analysis via the `--image` flag (CLI and server). Requires Python 3.10+.
 #### Requirements

 - **Python 3.10+** (mlx-vlm dependency)
- **Installation:** `pip install mlx-knife[vision]`
- **Backend:** mlx-vlm 0.3.10 (auto-installed from PyPI)
+- **Backend:** mlx-vlm 0.3.10 (included in base install)

 #### Usage

@@ -424,7 +412,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso

 **⚠️ Important:** Vision support relies on mlx-vlm (upstream), which has known stability issues. While `mlxk health` verifies file integrity, **runtime failures may occur** with certain model architectures due to upstream bugs.

-**✅ Tested & Working Models** (mlx-knife v2.0.4-beta.6):
+**✅ Tested & Working Models** (mlx-knife v2.0.4):

 | Model | Size | Notes |
 |-------|------|-------|
@@ -452,14 +440,13 @@ mlxk convert <model> <output> --repair-index

 ### Audio Transcription (Speech-to-Text)

-> **🎙️ New in beta.9/10:** Professional STT via dedicated Whisper models (mlx-audio backend). Beta.10 fixes PyPI install (no Git workaround needed). Backward compatible with Gemma-3n multimodal audio (mlx-vlm).
+> **🎙️ Audio Transcription:** Speech-to-text via Whisper models (mlx-audio backend). Works out-of-the-box with PyPI install. Backward compatible with Gemma-3n multimodal audio (mlx-vlm).

 **Requirements:**
- **Python 3.10+** (mlx-audio dependency)
- **Installation:** `pip install mlx-knife[audio]` (tiktoken workaround bundled)
+- **Python 3.10+** (mlx-audio dependency, included in base install)
 - **No system dependencies:** MP3/WAV decoding via embedded libsndfile (no ffmpeg or Homebrew required)

-**✅ Recommended Models** (mlx-knife v2.0.4-beta.10):
+**✅ Recommended Models** (mlx-knife v2.0.4):

 | Model | Backend | Size | Duration | Notes |
 |-------|---------|------|----------|-------|
@@ -1240,7 +1227,7 @@ Apache License 2.0 — see `LICENSE` (root) and `mlxk2/NOTICE`.

 <p align="center">
  <b>Made with ❤️ by The BROKE team <img src="broke-logo.png" alt="BROKE Logo" width="30" align="middle"></b><br>
-  <i>Version 2.0.4-beta.10 | February 2026</i><br>
+  <i>Version 2.0.4 | February 2026</i><br>
  <a href="https://github.com/mzau/broke-nchat">💬 Web UI: nChat - lightweight chat interface</a> •
  <a href="https://github.com/mzau/broke-cluster">🔮 Multi-node: BROKE Cluster</a>
 </p>
@@ -4,24 +4,24 @@ This document contains version-specific details, complete file listings, and imp

 ## Current Status

-✅ **2.0.4-beta.10** — **Audio PyPI Fix** (tiktoken workaround complete); Runtime compatibility accuracy; Audio transcription (Whisper via mlx-audio); Server `/v1/audio/transcriptions` endpoint; Probe/Policy architecture complete; Vision support Phase 1-3 (CLI + Server); Pipes/Memory-Aware; EXIF metadata; **Test Portfolio Separation complete**; Workspace Infrastructure (ADR-018 Phase 0a+0b+0c); Convert Operation (ADR-018 Phase 1); Resumable Clone; **Benchmark Schema v0.2.2** (Precise test timing).
+✅ **2.0.4** — **First stable release with Vision + Audio.** Vision support (CLI + Server, EXIF metadata); Audio transcription (Whisper via mlx-audio); Runtime compatibility accuracy; Server `/v1/audio/transcriptions` endpoint; Probe/Policy architecture; Pipes/Memory-Aware; **Test Portfolio Separation complete**; Workspace Infrastructure (ADR-018 Phase 0a+0b+0c); Convert Operation (ADR-018 Phase 1); Resumable Clone; **Benchmark Schema v0.2.2**.

 ### Test Results (Official Reference)

 **Standard Unit Tests (Multi-Python):**
 ```
 Platform: macOS 26.2 (Tahoe), M2 Max, 64GB RAM
-Python 3.10: 647 passed, 11 skipped in 19.78s
-Python 3.11: 647 passed, 11 skipped in 19.91s
-Python 3.12: 647 passed, 11 skipped in 20.94s
-Note: Default suite works on 16GB. Wet-umbrella: 64GB recommended (M1 Max 32GB untested)
+Python 3.10: 697 passed, 13 skipped
+Python 3.11: 697 passed, 13 skipped
+Python 3.12: 697 passed, 13 skipped
+Note: Default suite works on 16GB. Full integration tests: 64GB recommended
 ```

-**Wet Umbrella (4-Phase Integration):**
+**Full Integration Tests (`./scripts/test-wet-umbrella.sh`):**
 ```
-Phase 1 (wet marker):        168 passed, 73 skipped, 680 deselected (Schema v0.2.2)
-Phase 2-4 (live_pull/clone/pipe): 3 passed, 742 deselected
-Total:                       171 passed across all phases
+Phase 1 (portfolio tests):   179 passed, 73 skipped, 732 deselected
+Phase 2-4 (live operations): 3 passed
+Total:                       182 passed across all phases
 ```

 ✅ **Production verified & reported:** M1, M1 Max, M2 Max in real-world use
@@ -300,7 +300,7 @@ pytest -m live_stop_tokens -v

 ## Python Version Compatibility

-**All tests validated on Python 3.9-3.14**
+**Tests validated on Python 3.10-3.12** (Python 3.9 not supported since 2.0.4)

 Multi-version testing:
 ```bash
@@ -308,8 +308,8 @@ Multi-version testing:
 ./test-multi-python.sh

 # Manual verification
-python3.9 -m venv test_39
-source test_39/bin/activate
+python3.10 -m venv test_310
+source test_310/bin/activate
 pip install -e .[test] && pytest
 ```

@@ -15,7 +15,7 @@ This directory contains benchmark infrastructure for mlx-knife:
 benchmarks/
 ├── reports/                    # JSONL test reports + Markdown analyses
 │   ├── 2025-12-20-v2.0.4b3.jsonl   # Raw data (one file per test run)
-│   └── BENCHMARK-v1.0-*.md         # Generated analysis reports
+│   └── BENCHMARK-*.md               # Generated analysis reports
 ├── schemas/                    # JSON Schema definitions
 │   ├── report-v0.1.schema.json      # legacy schema
 │   ├── report-v0.2.2.schema.json    # Current schema
@@ -23,7 +23,7 @@ benchmarks/
 ├── tools/                      # Standalone tools
 │   ├── memmon.py                   # Memory monitor (background sampling)
 │   └── memplot.py                  # Memory timeline visualizer
-├── generate_benchmark_report.py    # Report generator (Template v1.0)
+├── generate_benchmark_report.py    # Report generator (Template v1.1)
 ├── validate_reports.py             # Schema validation
 ├── README.md                       # ← You are here
 └── TESTING.md                      # Benchmark handbook (How-To)
@@ -33,7 +33,7 @@ benchmarks/

 | Tool | Purpose |
 |------|---------|
-| `generate_benchmark_report.py` | JSONL → Markdown report (Template v1.0) |
+| `generate_benchmark_report.py` | JSONL → Markdown report (Template v1.1) |
 | `validate_reports.py` | Schema validation of JSONL files |
 | `tools/memmon.py` | Memory + CPU + GPU monitoring (200ms sampling) |
 | `tools/memplot.py` | Interactive 3-row timeline (Memory/CPU/GPU, HTML) |
@@ -59,7 +59,7 @@ See `schemas/LEARNINGS-FOR-v1.0.md` for details.
 ## Recent Reports

 Latest baseline reports are in `reports/` directory:
- Pattern: `BENCHMARK-v1.0-<version>-<date>-*.md`
+- Pattern: `BENCHMARK-<template>-<version>-<date>-*.md`
 - Hardware: Mac14,13 (M2 Max, 64 GB)
 - Test suite: ~167 tests (Vision + Text + Audio E2E)
 - Quality target: 100% clean (0 MB swap, 0 zombies)
@@ -13,7 +13,7 @@ pytest -m live_e2e tests_2.0/live/ \
 python benchmarks/generate_benchmark_report.py

 # 3. View results
-cat benchmarks/reports/BENCHMARK-v1.0-2.0.4b3-*.md
+cat benchmarks/reports/BENCHMARK-*.md
 ```

 ---
@@ -55,7 +55,7 @@ python benchmarks/tools/memmon.py \
 ```bash
 python benchmarks/generate_benchmark_report.py
 # → Finds most recent .jsonl in benchmarks/reports/
-# → Outputs: BENCHMARK-v1.0-<version>-<date>.md
+# → Outputs: BENCHMARK-<template>-<version>-<date>.md
 ```

 ### Explicit Input File
@@ -456,6 +456,20 @@ def calculate_statistics(data: List[dict]) -> Dict:
            hw_profile = entry["system"]["hardware_profile"]
            break

+    # Build test_id → model_id lookup for GPU analysis (v1.1)
+    # Maps full pytest test ID to model short name
+    test_model_map = {}
+    for entry in benchmark_data:
+        if "test" in entry and "model" in entry:
+            test_id = entry["test"]
+            model_id = entry["model"]["id"]
+            # Short name: remove org/ prefix (mlx-community/, BrokeC/, etc.)
+            if "/" in model_id:
+                model_short = model_id.split("/", 1)[1]
+            else:
+                model_short = model_id
+            test_model_map[test_id] = model_short
+
    return {
        "total_tests": len(benchmark_data),
        "passed": len(passed_tests),
@@ -488,6 +502,7 @@ def calculate_statistics(data: List[dict]) -> Dict:
        "hardware": hw_profile,
        "models": model_stats,
        "tests": test_stats,
+        "test_model_map": test_model_map,
    }


@@ -606,30 +621,26 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
        compare_models = {m['id']: m for m in compare_stats['models'].values()}

    if compare_stats:
-        md += f"""```
-{'Model':<40} {'Size':<7} {'Mode':<6} {'Tests':<5} {'Time':<8} {'Old':<8} {'Δ':<8} {'Change':<10} {'RAM (GB)':<12}
-{'='*40} {'='*7} {'='*6} {'='*5} {'='*8} {'='*8} {'='*8} {'='*10} {'='*12}
-"""
+        md += "| Model | Size | Mode | Tests | Time | Old | Δ | Change | RAM (GB) |\n"
+        md += "|-------|-----:|:----:|------:|-----:|----:|--:|-------:|---------:|\n"
    else:
-        md += f"""```
-{'Model':<50} {'Size':<8} {'Mode':<6} {'Tests':<6} {'Time':<10} {'RAM (GB)':<20}
-{'='*50} {'='*8} {'='*6} {'='*6} {'='*10} {'='*20}
-"""
+        md += "| Model | Size | Mode | Tests | Time | RAM (GB) |\n"
+        md += "|-------|-----:|:----:|------:|-----:|---------:|\n"
+
+    def format_model_short(model_id):
+        """Remove org prefix from model ID."""
+        if "/" in model_id:
+            return model_id.split("/", 1)[1]
+        return model_id

    for model in sorted_models:
-        # Shorten model ID (remove mlx-community/ prefix)
-        model_short = model['id'].replace('mlx-community/', '')
-        max_len = 38 if compare_stats else 48
-        if len(model_short) > max_len:
-            model_short = model_short[:max_len-3] + "..."
+        model_short = format_model_short(model['id'])

        # Global RAM range (for backward compat / fallback)
        ram_range = f"{model['ram_min']:.1f}-{model['ram_max']:.1f}"

        if compare_stats:
            old_model = compare_models.get(model['id'])
-
-            # Separate rows per modality (same as non-comparison mode)
            rows_written = 0

            # Vision modality
@@ -643,21 +654,14 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
                else:
                    v_ram_range = f"{v_ram_min:.1f}-{v_ram_max:.1f}"

-                # Get old vision stats (if available)
                if old_model and old_model.get('vision_count', 0) > 0:
                    old_time = old_model['vision_time']
                    delta = model['vision_time'] - old_time
                    change_pct = (delta / old_time * 100) if old_time > 0 else 0
-                    if change_pct > 5:
-                        status = "⚠️"
-                    elif change_pct < -1:
-                        status = "✅"
-                    else:
-                        status = ""
-                    change_str = f"{change_pct:+.1f}% {status}"
-                    md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'Vision':<6} {model['vision_count']:<5} {model['vision_time']:>6.1f}s {old_time:>6.1f}s {delta:>+6.1f}s {change_str:<10} {v_ram_range:<12}\n"
+                    status = "⚠️" if change_pct > 5 else ("✅" if change_pct < -1 else "")
+                    md += f"| {model_short} | {model['size_gb']:.1f} GB | Vision | {model['vision_count']} | {model['vision_time']:.1f}s | {old_time:.1f}s | {delta:+.1f}s | {change_pct:+.1f}% {status} | {v_ram_range} |\n"
                else:
-                    md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'Vision':<6} {model['vision_count']:<5} {model['vision_time']:>6.1f}s {'N/A':<8} {'N/A':<8} {'NEW':<10} {v_ram_range:<12}\n"
+                    md += f"| {model_short} | {model['size_gb']:.1f} GB | Vision | {model['vision_count']} | {model['vision_time']:.1f}s | N/A | N/A | NEW | {v_ram_range} |\n"
                rows_written += 1

            # Text modality
@@ -671,24 +675,17 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
                else:
                    t_ram_range = f"{t_ram_min:.1f}-{t_ram_max:.1f}"

-                # Get old text stats (if available)
                if old_model and old_model.get('text_count', 0) > 0:
                    old_time = old_model['text_time']
                    delta = model['text_time'] - old_time
                    change_pct = (delta / old_time * 100) if old_time > 0 else 0
-                    if change_pct > 5:
-                        status = "⚠️"
-                    elif change_pct < -1:
-                        status = "✅"
-                    else:
-                        status = ""
-                    change_str = f"{change_pct:+.1f}% {status}"
-                    md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'Text':<6} {model['text_count']:<5} {model['text_time']:>6.1f}s {old_time:>6.1f}s {delta:>+6.1f}s {change_str:<10} {t_ram_range:<12}\n"
+                    status = "⚠️" if change_pct > 5 else ("✅" if change_pct < -1 else "")
+                    md += f"| {model_short} | {model['size_gb']:.1f} GB | Text | {model['text_count']} | {model['text_time']:.1f}s | {old_time:.1f}s | {delta:+.1f}s | {change_pct:+.1f}% {status} | {t_ram_range} |\n"
                else:
-                    md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'Text':<6} {model['text_count']:<5} {model['text_time']:>6.1f}s {'N/A':<8} {'N/A':<8} {'NEW':<10} {t_ram_range:<12}\n"
+                    md += f"| {model_short} | {model['size_gb']:.1f} GB | Text | {model['text_count']} | {model['text_time']:.1f}s | N/A | N/A | NEW | {t_ram_range} |\n"
                rows_written += 1

-            # Audio modality (NEW in v0.2.2)
+            # Audio modality
            if model['audio_count'] > 0:
                a_ram_min = model['audio_ram_min']
                a_ram_max = model['audio_ram_max']
@@ -699,46 +696,30 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
                else:
                    a_ram_range = f"{a_ram_min:.1f}-{a_ram_max:.1f}"

-                # Get old audio stats (if available)
                if old_model and old_model.get('audio_count', 0) > 0:
                    old_time = old_model['audio_time']
                    delta = model['audio_time'] - old_time
                    change_pct = (delta / old_time * 100) if old_time > 0 else 0
-                    if change_pct > 5:
-                        status = "⚠️"
-                    elif change_pct < -1:
-                        status = "✅"
-                    else:
-                        status = ""
-                    change_str = f"{change_pct:+.1f}% {status}"
-                    md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'Audio':<6} {model['audio_count']:<5} {model['audio_time']:>6.1f}s {old_time:>6.1f}s {delta:>+6.1f}s {change_str:<10} {a_ram_range:<12}\n"
+                    status = "⚠️" if change_pct > 5 else ("✅" if change_pct < -1 else "")
+                    md += f"| {model_short} | {model['size_gb']:.1f} GB | Audio | {model['audio_count']} | {model['audio_time']:.1f}s | {old_time:.1f}s | {delta:+.1f}s | {change_pct:+.1f}% {status} | {a_ram_range} |\n"
                else:
-                    md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'Audio':<6} {model['audio_count']:<5} {model['audio_time']:>6.1f}s {'N/A':<8} {'N/A':<8} {'NEW':<10} {a_ram_range:<12}\n"
+                    md += f"| {model_short} | {model['size_gb']:.1f} GB | Audio | {model['audio_count']} | {model['audio_time']:.1f}s | N/A | N/A | NEW | {a_ram_range} |\n"
                rows_written += 1

-            # Fallback for legacy data (no modality info) - rare in comparison mode
+            # Fallback for legacy data
            if rows_written == 0 and old_model:
                old_time = old_model['total_time']
                delta = model['total_time'] - old_time
                change_pct = (delta / old_time * 100) if old_time > 0 else 0
-                if change_pct > 5:
-                    status = "⚠️"
-                elif change_pct < -1:
-                    status = "✅"
-                else:
-                    status = ""
-                change_str = f"{change_pct:+.1f}% {status}"
-                md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'-':<6} {model['count']:<5} {model['total_time']:>6.1f}s {old_time:>6.1f}s {delta:>+6.1f}s {change_str:<10} {ram_range:<12}\n"
+                status = "⚠️" if change_pct > 5 else ("✅" if change_pct < -1 else "")
+                md += f"| {model_short} | {model['size_gb']:.1f} GB | - | {model['count']} | {model['total_time']:.1f}s | {old_time:.1f}s | {delta:+.1f}s | {change_pct:+.1f}% {status} | {ram_range} |\n"
            elif rows_written == 0:
-                # New model with no modality info
-                md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'-':<6} {model['count']:<5} {model['total_time']:>6.1f}s {'N/A':<8} {'N/A':<8} {'NEW':<10} {ram_range:<12}\n"
+                md += f"| {model_short} | {model['size_gb']:.1f} GB | - | {model['count']} | {model['total_time']:.1f}s | N/A | N/A | NEW | {ram_range} |\n"
        else:
-            # Separate rows per modality (no "Mixed" ambiguity)
-            # Each modality gets its own line with specific stats + RAM
+            # Without comparison
            rows_written = 0

            if model['vision_count'] > 0:
-                # Use modality-specific RAM range (single value if min==max)
                v_ram_min = model['vision_ram_min']
                v_ram_max = model['vision_ram_max']
                if v_ram_min == float('inf'):
@@ -747,11 +728,10 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
                    v_ram_range = f"{v_ram_min:.1f}"
                else:
                    v_ram_range = f"{v_ram_min:.1f}-{v_ram_max:.1f}"
-                md += f"{model_short:<50} {model['size_gb']:>6.1f}GB {'Vision':<6} {model['vision_count']:<6} {model['vision_time']:>8.1f}s  {v_ram_range:<20}\n"
+                md += f"| {model_short} | {model['size_gb']:.1f} GB | Vision | {model['vision_count']} | {model['vision_time']:.1f}s | {v_ram_range} |\n"
                rows_written += 1

            if model['text_count'] > 0:
-                # Use modality-specific RAM range (single value if min==max)
                t_ram_min = model['text_ram_min']
                t_ram_max = model['text_ram_max']
                if t_ram_min == float('inf'):
@@ -760,11 +740,10 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
                    t_ram_range = f"{t_ram_min:.1f}"
                else:
                    t_ram_range = f"{t_ram_min:.1f}-{t_ram_max:.1f}"
-                md += f"{model_short:<50} {model['size_gb']:>6.1f}GB {'Text':<6} {model['text_count']:<6} {model['text_time']:>8.1f}s  {t_ram_range:<20}\n"
+                md += f"| {model_short} | {model['size_gb']:.1f} GB | Text | {model['text_count']} | {model['text_time']:.1f}s | {t_ram_range} |\n"
                rows_written += 1

            if model['audio_count'] > 0:
-                # Use modality-specific RAM range (single value if min==max)
                a_ram_min = model['audio_ram_min']
                a_ram_max = model['audio_ram_max']
                if a_ram_min == float('inf'):
@@ -773,14 +752,14 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
                    a_ram_range = f"{a_ram_min:.1f}"
                else:
                    a_ram_range = f"{a_ram_min:.1f}-{a_ram_max:.1f}"
-                md += f"{model_short:<50} {model['size_gb']:>6.1f}GB {'Audio':<6} {model['audio_count']:<6} {model['audio_time']:>8.1f}s  {a_ram_range:<20}\n"
+                md += f"| {model_short} | {model['size_gb']:.1f} GB | Audio | {model['audio_count']} | {model['audio_time']:.1f}s | {a_ram_range} |\n"
                rows_written += 1

-            # Fallback for legacy data (no modality info)
+            # Fallback for legacy data
            if rows_written == 0:
-                md += f"{model_short:<50} {model['size_gb']:>6.1f}GB {'-':<6} {model['count']:<6} {model['total_time']:>8.1f}s  {ram_range:<20}\n"
+                md += f"| {model_short} | {model['size_gb']:.1f} GB | - | {model['count']} | {model['total_time']:.1f}s | {ram_range} |\n"

-    md += "```\n\n"
+    md += "\n"

    # Model Categories (with modality differentiation)
    large_models = [m for m in sorted_models if m['size_gb'] >= 20]
@@ -891,22 +870,14 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
        compare_tests = {(t['name'], t.get('modality', 'unknown')): t for t in compare_stats['tests'].values()}

    if compare_stats:
-        md += f"""```
-{'Test Name':<38} {'Mode':<6} {'Models':<7} {'Fastest':<18} {'Slowest':<18} {'Med':<6} {'Old':<6} {'Δ Med':<8}
-{'='*38} {'='*6} {'='*7} {'='*18} {'='*18} {'='*6} {'='*6} {'='*8}
-"""
+        md += "| Test Name | Mode | Models | Fastest | Slowest | Med | Old | Δ Med |\n"
+        md += "|-----------|:----:|-------:|---------|---------|----:|----:|------:|\n"
    else:
-        md += f"""```
-{'Test Name':<44} {'Mode':<6} {'Models':<7} {'Fastest':<22} {'Slowest':<22} {'Med Time'}
-{'='*44} {'='*6} {'='*7} {'='*22} {'='*22} {'='*8}
-"""
+        md += "| Test Name | Mode | Models | Fastest | Slowest | Med Time |\n"
+        md += "|-----------|:----:|-------:|---------|---------|--------:|\n"

    for test in sorted_tests:
-        # Shorten test name if needed
-        max_test_len = 36 if compare_stats else 42
-        test_short = test['name']
-        if len(test_short) > max_test_len:
-            test_short = test_short[:max_test_len-3] + "..."
+        test_name = test['name']

        # Format modality (Vision/Text/Audio/- for unknown)
        modality = test.get('modality', 'unknown')
@@ -924,14 +895,8 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
        slowest = test['slowest']

        if fastest and slowest:
-            max_model_len = 16 if compare_stats else 20
            fastest_str = f"{fastest['model_short']} ({fastest['duration']:.1f}s)"
            slowest_str = f"{slowest['model_short']} ({slowest['duration']:.1f}s)"
-            if len(fastest_str) > max_model_len:
-                fastest_str = fastest_str[:max_model_len-3] + "..."
-            if len(slowest_str) > max_model_len:
-                slowest_str = slowest_str[:max_model_len-3] + "..."
-
            med_time = test['median_time']

            if compare_stats:
@@ -939,18 +904,18 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
                if old_test:
                    old_med = old_test['median_time']
                    delta_pct = ((med_time - old_med) / old_med * 100) if old_med > 0 else 0
-                    delta_str = f"{delta_pct:+.1f}%"
-                    md += f"{test_short:<38} {mode_str:<6} {test['model_count']:<7} {fastest_str:<18} {slowest_str:<18} {med_time:<5.1f}s {old_med:<5.1f}s {delta_str:<8}\n"
+                    md += f"| {test_name} | {mode_str} | {test['model_count']} | {fastest_str} | {slowest_str} | {med_time:.1f}s | {old_med:.1f}s | {delta_pct:+.1f}% |\n"
                else:
-                    md += f"{test_short:<38} {mode_str:<6} {test['model_count']:<7} {fastest_str:<18} {slowest_str:<18} {med_time:<5.1f}s {'N/A':<6} {'NEW':<8}\n"
+                    md += f"| {test_name} | {mode_str} | {test['model_count']} | {fastest_str} | {slowest_str} | {med_time:.1f}s | N/A | NEW |\n"
            else:
-                md += f"{test_short:<44} {mode_str:<6} {test['model_count']:<7} {fastest_str:<22} {slowest_str:<22} {med_time:.1f}s\n"
+                md += f"| {test_name} | {mode_str} | {test['model_count']} | {fastest_str} | {slowest_str} | {med_time:.1f}s |\n"

-    md += "```\n\n"
+    md += "\n"

    # GPU Analysis Section (v1.1 - only if memory data provided)
    gpu_correlations = stats.get("gpu_correlations", {})
    cold_starts = stats.get("cold_starts", {})
+    test_model_map = stats.get("test_model_map", {})

    if gpu_correlations:
        md += "\n---\n\n"
@@ -988,20 +953,23 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
        if cold_start_tests:
            md += "### Cold-Start Tests\n\n"
            md += "First test per model (includes model loading overhead).\n\n"
-            md += f"```\n"
-            md += f"{'Test':<60} {'GPU Dev':<8} {'GPU Rnd':<8} {'Samples':<8}\n"
-            md += f"{'='*60} {'='*8} {'='*8} {'='*8}\n"
+            md += "| Test | Model | GPU Dev | GPU Rnd | Samples |\n"
+            md += "|------|-------|--------:|--------:|--------:|\n"

            for test_id, gpu_stats in cold_start_tests[:15]:  # Limit to 15
-                test_short = test_id.split("::")[-1][:58]
+                # Extract test name (without parametrization)
+                test_full = test_id.split("::")[-1]
+                test_name = test_full.split("[")[0]
+                # Look up model name
+                model_name = test_model_map.get(test_id, "unknown")
                if gpu_stats:
-                    md += f"{test_short:<60} {gpu_stats['gpu_device_avg']:>6.1f}% {gpu_stats['gpu_renderer_avg']:>6.1f}% {gpu_stats['sample_count']:>8}\n"
+                    md += f"| {test_name} | {model_name} | {gpu_stats['gpu_device_avg']:.1f}% | {gpu_stats['gpu_renderer_avg']:.1f}% | {gpu_stats['sample_count']} |\n"
                else:
-                    md += f"{test_short:<60} {'N/A':>8} {'N/A':>8} {'N/A':>8}\n"
+                    md += f"| {test_name} | {model_name} | N/A | N/A | N/A |\n"

            if len(cold_start_tests) > 15:
-                md += f"... and {len(cold_start_tests) - 15} more cold-start tests\n"
-            md += f"```\n\n"
+                md += f"\n*... and {len(cold_start_tests) - 15} more cold-start tests*\n"
+            md += "\n"

        # High GPU utilization tests
        high_gpu_tests = [(tid, gs) for tid, gs in gpu_correlations.items()
@@ -1011,17 +979,20 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
        if high_gpu_tests:
            md += "### High GPU Utilization Tests\n\n"
            md += "Tests with >50% average GPU device utilization.\n\n"
-            md += f"```\n"
-            md += f"{'Test':<55} {'GPU Dev':<10} {'GPU Rnd':<10} {'Pressure':<10}\n"
-            md += f"{'='*55} {'='*10} {'='*10} {'='*10}\n"
+            md += "| Test | Model | GPU Dev | GPU Rnd | Pressure |\n"
+            md += "|------|-------|--------:|--------:|:--------:|\n"

            for test_id, gpu_stats in high_gpu_tests[:10]:
-                test_short = test_id.split("::")[-1][:53]
+                # Extract test name (without parametrization)
+                test_full = test_id.split("::")[-1]
+                test_name = test_full.split("[")[0]
+                # Look up model name
+                model_name = test_model_map.get(test_id, "unknown")
                pressure = gpu_stats.get("memory_pressure_max", 1)
-                pressure_str = "WARN" if pressure > 1 else "OK"
-                md += f"{test_short:<55} {gpu_stats['gpu_device_avg']:>8.1f}% {gpu_stats['gpu_renderer_avg']:>8.1f}% {pressure_str:>10}\n"
+                pressure_str = "⚠️ WARN" if pressure > 1 else "OK"
+                md += f"| {test_name} | {model_name} | {gpu_stats['gpu_device_avg']:.1f}% | {gpu_stats['gpu_renderer_avg']:.1f}% | {pressure_str} |\n"

-            md += f"```\n\n"
+            md += "\n"

        # GPU utilization summary
        all_gpu_device = [gs["gpu_device_avg"] for gs in gpu_correlations.values()]
@@ -0,0 +1,235 @@
+# Benchmark Report v1.1: 2.0.4
+
+**Date:** 2026-02-11-wet-benchmark-2.0.4-stable
+**Generated:** 2026-02-11T13:13:56 UTC
+**Generator:** generate_benchmark_report.py v1.1
+**Hardware:** Mac14,13, 12 cores
+
+---
+
+## Input Files
+
+- **Primary:** `benchmarks/reports/2026-02-11-wet-benchmark-2.0.4-stable.jsonl`
+- **Schema:** v0.2.2
+- **Comparison:** `benchmarks/reports/2026-02-06-wet-benchmark-1.jsonl`
+
+---
+
+## Executive Summary
+
+**Tests:** 251 total (179 passed, 72 skipped)
+**Duration:** 1157.1s (19.3 min)
+**Quality:** 100.0% clean (251/251)
+**Models:** 25 tested
+
+### Comparison
+
+**vs:** `2026-02-06-wet-benchmark-1.jsonl`
+**Duration:** 19.6 min → 19.3 min (-1.7%) ✅
+**Models:** 16/25 slower (64%), 9/25 faster (36%)
+
+✅ **System Health:** All tests clean (RAM >5 GB free, 0 zombies)
+
+---
+
+## Test Summary
+
+```
+Total tests:       251
+Passed:            179
+  With model:      122
+  Infrastructure:  57
+Skipped:           72
+Duration:          1157.1s (19.3 min)
+```
+
+---
+
+## System Health
+
+```
+Swap (MB):         min=1524, max=1524, avg=1524.0
+RAM free (GB):     min=8.5, max=38.8, avg=22.8
+Zombies:           min=0, max=0
+
+Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
+  Clean:           251/251 (100.0%)
+  Degraded (RAM):  0
+  Degraded (zombies): 0
+```
+
+---
+
+## Per-Model Statistics
+
+| Model | Size | Mode | Tests | Time | Old | Δ | Change | RAM (GB) |
+|-------|-----:|:----:|------:|-----:|----:|--:|-------:|---------:|
+| Mistral-Small-3.1-24B-Instruct-2503-4bit | 13.2 GB | Vision | 2 | 42.3s | 42.8s | -0.5s | -1.2% ✅ | 18.8-19.1 |
+| Mistral-Small-3.1-24B-Instruct-2503-4bit | 13.2 GB | Text | 1 | 5.8s | 6.9s | -1.0s | -15.2% ✅ | 15.1 |
+| DeepHermes-3-Mistral-24B-Preview-8bit | 23.3 GB | Text | 7 | 134.0s | 138.4s | -4.5s | -3.2% ✅ | 23.8-28.1 |
+| DeepSeek-R1-Distill-Llama-8B-4bit | 4.2 GB | Text | 4 | 30.5s | 30.3s | +0.3s | +0.8%  | 19.9-20.3 |
+| EuroLLM-22B-Instruct-2512-mlx-8bit | 22.4 GB | Text | 4 | 55.8s | 55.5s | +0.3s | +0.6%  | 23.6-23.7 |
+| Gabliterated-Qwen3-0.6B-float32 | 2.2 GB | Text | 4 | 28.6s | 28.1s | +0.5s | +1.7%  | 21.0-21.4 |
+| gpt-oss-20b-MXFP4-Q8 | 11.3 GB | Text | 4 | 51.5s | 50.9s | +0.6s | +1.2%  | 14.4-14.7 |
+| Llama-3.2-3B-Instruct-4bit | 1.7 GB | Text | 4 | 15.6s | 14.9s | +0.8s | +5.1% ⚠️ | 19.2-19.7 |
+| Mistral-7B-Instruct-v0.2-4bit | 4.0 GB | Text | 4 | 14.3s | 14.2s | +0.2s | +1.2%  | 15.2-15.7 |
+| Mistral-Small-3.2-24B-Instruct-2506-4bit | 12.4 GB | Text | 4 | 36.9s | 35.7s | +1.2s | +3.4%  | 13.4-13.5 |
+| Mistral-Small-3.2-24B-Instruct-2506-8bit | 23.3 GB | Text | 4 | 62.3s | 66.3s | -4.0s | -6.0% ✅ | 24.4-25.3 |
+| Mistral-Small-Instruct-2409-4bit | 11.7 GB | Text | 4 | 28.0s | 27.7s | +0.3s | +1.0%  | 12.6-13.4 |
+| Mixtral-8x7B-Instruct-v0.1-4bit | 24.5 GB | Text | 4 | 51.9s | 53.1s | -1.2s | -2.2% ✅ | 27.3-27.4 |
+| OpenCodeInterpreter-DS-33B-hf-4bit-mlx | 17.8 GB | Text | 4 | 37.9s | 38.1s | -0.2s | -0.5%  | 18.7-18.8 |
+| Phi-3-mini-4k-instruct-4bit | 2.0 GB | Text | 4 | 10.9s | 12.3s | -1.3s | -11.0% ✅ | 16.7-16.8 |
+| Phi-3.5-mini-instruct-4bit | 2.0 GB | Text | 4 | 13.2s | 13.1s | +0.1s | +0.4%  | 14.6-14.8 |
+| pixtral-12b-4bit | 7.0 GB | Vision | 7 | 79.7s | 25.0s | +54.7s | +218.4% ⚠️ | 12.6-18.8 |
+| pixtral-12b-4bit | 7.0 GB | Text | 1 | 3.7s | 4.2s | -0.5s | -11.2% ✅ | 12.8 |
+| pixtral-12b-8bit | 12.6 GB | Vision | 2 | 29.5s | 25.9s | +3.6s | +13.9% ⚠️ | 16.4-16.6 |
+| pixtral-12b-8bit | 12.6 GB | Text | 1 | 3.9s | 3.4s | +0.5s | +14.6% ⚠️ | 14.3 |
+| Qwen2.5-0.5B-Instruct-4bit | 0.3 GB | Text | 4 | 16.3s | 22.9s | -6.6s | -28.7% ✅ | 14.3-14.5 |
+| Qwen2.5-Coder-7B-Instruct-8bit | 7.5 GB | Text | 4 | 22.5s | 21.5s | +1.0s | +4.9%  | 8.5-8.7 |
+| Qwen3-30B-A3B-Instruct-2507-4bit | 16.0 GB | Text | 4 | 34.5s | 34.4s | +0.2s | +0.5%  | 17.1-17.3 |
+| Qwen3-32B-4bit | 17.2 GB | Text | 4 | 63.6s | 62.7s | +0.9s | +1.4%  | 18.4-18.7 |
+| Qwen3-Coder-30B-A3B-Instruct-4bit | 16.0 GB | Text | 4 | 36.5s | 37.7s | -1.2s | -3.3% ✅ | 17.1-17.2 |
+| Qwen3-Coder-30B-A3B-Instruct-6bit-DWQ-lr9e-8 | 24.9 GB | Text | 4 | 56.6s | 56.7s | -0.2s | -0.3%  | 26.0-26.1 |
+| whisper-large-v3-turbo-4bit | 0.4 GB | Audio | 11 | 25.5s | 25.4s | +0.1s | +0.5%  | 33.7-34.9 |
+| whisper-large-v3-turbo-8bit | 0.8 GB | Audio | 11 | 24.6s | 24.5s | +0.1s | +0.4%  | 33.7-34.4 |
+
+### Model Categories
+
+```
+LARGE MODELS (≥20 GB): 5 models
+  Avg size:               23.7 GB
+  Text Tests:
+    Models tested:        5
+    Avg test time:        15.2s
+    RAM range:            23.6-28.1 GB
+
+MEDIUM MODELS (10-20 GB): 9 models
+  Avg size:               14.2 GB
+  Vision Tests:
+    Models tested:        2
+    Avg test time:        18.0s
+    RAM range:            16.4-19.1 GB
+  Text Tests:
+    Models tested:        9
+    Avg test time:        9.1s
+    RAM range:            12.6-18.8 GB
+
+SMALL MODELS (<10 GB): 11 models
+  Avg size:               2.9 GB
+  Vision Tests:
+    Models tested:        1
+    Avg test time:        11.4s
+    RAM range:            12.6-18.8 GB
+  Text Tests:
+    Models tested:        9
+    Avg test time:        4.6s
+    RAM range:            8.5-21.4 GB
+  Audio Tests:
+    Models tested:        2
+    Avg test time:        2.3s
+    RAM range:            33.7-34.9 GB
+```
+
+---
+
+## Per-Test Statistics
+
+Shows performance range across models for each test.
+
+| Test Name | Mode | Models | Fastest | Slowest | Med | Old | Δ Med |
+|-----------|:----:|-------:|---------|---------|----:|----:|------:|
+| test_audio_output_not_empty | Audio | 2 | whisper (1.9s) | whisper (1.9s) | 1.9s | 1.9s | +0.8% |
+| test_chart_axis_label_reading | Vision | 1 | pixtral (12.4s) | pixtral (12.4s) | 12.4s | 12.8s | -2.9% |
+| test_chat_completions_batch | Text | 20 | Qwen2.5 (2.1s) | Qwen3 (12.4s) | 7.9s | 7.9s | -0.3% |
+| test_chat_completions_streaming | Text | 20 | Phi (4.3s) | Qwen3 (30.7s) | 14.2s | 15.2s | -6.3% |
+| test_chess_position_e6 | Vision | 1 | pixtral (10.9s) | pixtral (10.9s) | 10.9s | 13.0s | -16.3% |
+| test_contract_name_extraction | Vision | 1 | pixtral (12.4s) | pixtral (12.4s) | 12.4s | 12.6s | -2.3% |
+| test_json_interactive_error_path | Text | 1 | DeepHermes (3.8s) | DeepHermes (3.8s) | 3.8s | 5.2s | -25.7% |
+| test_large_image_support | Vision | 1 | pixtral (9.3s) | pixtral (9.3s) | 9.3s | 9.9s | -5.7% |
+| test_mug_color_identification | Vision | 1 | pixtral (12.4s) | pixtral (12.4s) | 12.4s | 12.7s | -2.6% |
+| test_pipe_from_list_json | Text | 1 | DeepHermes (65.3s) | DeepHermes (65.3s) | 65.3s | 66.3s | -1.6% |
+| test_run_command | Text | 20 | Qwen2.5 (1.4s) | Mistral (13.9s) | 6.9s | 6.8s | +0.5% |
+| test_run_json_output | Text | 20 | Qwen2.5 (1.4s) | Mistral (13.9s) | 6.9s | 6.8s | +0.6% |
+| test_segment_metadata_optional | Audio | 2 | whisper (2.0s) | whisper (2.0s) | 2.0s | 1.9s | +1.8% |
+| test_single_image_chat_completion | Vision | 3 | pixtral (10.2s) | BrokeC/Mistral (20.8s) | 16.0s | 12.4s | +29.0% |
+| test_stdin_dash_appends_trailing_text | Text | 1 | DeepHermes (12.1s) | DeepHermes (12.1s) | 12.1s | 12.2s | -1.1% |
+| test_streaming_graceful_degradation | Vision | 3 | pixtral (12.2s) | BrokeC/Mistral (21.5s) | 13.5s | 13.5s | +0.0% |
+| test_text_request_still_works_on_vision_model | Text | 3 | pixtral (3.7s) | BrokeC/Mistral (5.8s) | 3.9s | 4.2s | -7.5% |
+| test_transcribe_longer_audio_wav | Audio | 2 | whisper (2.0s) | whisper (2.0s) | 2.0s | 2.0s | +1.8% |
+| test_transcribe_mp3_format | Audio | 2 | whisper (1.9s) | whisper (2.0s) | 1.9s | 1.9s | +0.5% |
+| test_transcribe_short_audio_wav | Audio | 2 | whisper (2.2s) | whisper (3.1s) | 2.7s | 2.7s | -0.6% |
+| test_transcription_endpoint_json | Audio | 2 | whisper (2.5s) | whisper (2.5s) | 2.5s | 2.5s | +0.0% |
+| test_transcription_endpoint_mp3 | Audio | 2 | whisper (2.5s) | whisper (2.6s) | 2.5s | 2.5s | +0.5% |
+| test_transcription_endpoint_rejects_oversized_audio | Audio | 2 | whisper (1.8s) | whisper (1.9s) | 1.9s | 1.8s | +1.1% |
+| test_transcription_endpoint_text_format | Audio | 2 | whisper (2.5s) | whisper (2.5s) | 2.5s | 2.5s | -0.3% |
+| test_transcription_endpoint_verbose_json | Audio | 2 | whisper (2.5s) | whisper (2.5s) | 2.5s | 2.5s | +0.4% |
+| test_transcription_endpoint_with_language | Audio | 2 | whisper (2.5s) | whisper (2.5s) | 2.5s | 2.5s | -0.2% |
+
+
+---
+
+## GPU Analysis (v1.1)
+
+Per-test GPU utilization correlated from memory monitoring data.
+
+### Cold-Start Tests
+
+First test per model (includes model loading overhead).
+
+| Test | Model | GPU Dev | GPU Rnd | Samples |
+|------|-------|--------:|--------:|--------:|
+| test_transcribe_short_audio_wav | whisper-large-v3-turbo-4bit | 48.0% | 16.2% | 6 |
+| test_transcribe_short_audio_wav | whisper-large-v3-turbo-8bit | 48.5% | 18.8% | 4 |
+| test_run_command | DeepHermes-3-Mistral-24B-Preview-8bit | 14.4% | 12.5% | 20 |
+| test_run_command | DeepSeek-R1-Distill-Llama-8B-4bit | 14.3% | 14.3% | 7 |
+| test_run_command | EuroLLM-22B-Instruct-2512-mlx-8bit | 18.7% | 18.2% | 21 |
+| test_run_command | Gabliterated-Qwen3-0.6B-float32 | 38.8% | 38.2% | 5 |
+| test_run_command | Llama-3.2-3B-Instruct-4bit | 23.5% | 22.8% | 4 |
+| test_run_command | Mistral-7B-Instruct-v0.2-4bit | 20.0% | 20.0% | 5 |
+| test_run_command | Mistral-Small-3.2-24B-Instruct-2506-4bit | 0.0% | 0.0% | 12 |
+| test_run_command | Mistral-Small-3.2-24B-Instruct-2506-8bit | 16.0% | 15.3% | 24 |
+| test_run_command | Mistral-Small-Instruct-2409-4bit | 9.6% | 8.7% | 11 |
+| test_run_command | Mixtral-8x7B-Instruct-v0.1-4bit | 22.2% | 12.9% | 22 |
+| test_run_command | OpenCodeInterpreter-DS-33B-hf-4bit-mlx | 36.7% | 14.6% | 16 |
+| test_run_command | Phi-3-mini-4k-instruct-4bit | 49.0% | 22.2% | 4 |
+| test_run_command | Phi-3.5-mini-instruct-4bit | 67.3% | 15.0% | 3 |
+
+*... and 10 more cold-start tests*
+
+### High GPU Utilization Tests
+
+Tests with >50% average GPU device utilization.
+
+| Test | Model | GPU Dev | GPU Rnd | Pressure |
+|------|-------|--------:|--------:|:--------:|
+| test_chess_position_e6 | pixtral-12b-4bit | 83.6% | 82.5% | OK |
+| test_contract_name_extraction | pixtral-12b-4bit | 81.6% | 80.1% | OK |
+| test_mug_color_identification | pixtral-12b-4bit | 80.8% | 77.0% | OK |
+| test_single_image_chat_completion | pixtral-12b-8bit | 71.1% | 67.1% | OK |
+| test_streaming_graceful_degradation | pixtral-12b-4bit | 70.3% | 56.0% | OK |
+| test_audio_output_not_empty | whisper-large-v3-turbo-4bit | 68.8% | 29.8% | OK |
+| test_transcribe_longer_audio_wav | whisper-large-v3-turbo-8bit | 68.0% | 12.2% | OK |
+| test_run_command | Phi-3.5-mini-instruct-4bit | 67.3% | 15.0% | OK |
+| test_streaming_graceful_degradation | Mistral-Small-3.1-24B-Instruct-2503-4bit | 62.1% | 50.5% | OK |
+| test_chart_axis_label_reading | pixtral-12b-4bit | 62.0% | 55.1% | OK |
+
+### GPU Utilization Summary
+
+```
+Sample interval:      200ms
+Correlated tests:     134
+Cold-start tests:     25
+GPU Device avg:        22.7%  (was  20.6%, Δ +2.1%)
+GPU Device max:        83.6%  (was  84.0%, Δ -0.4%)
+GPU Renderer avg:      15.6%  (was  17.6%, Δ -2.0%)
+GPU Renderer max:      82.5%  (was  81.4%, Δ +1.1%)
+```
+
+
+---
+
+## Files
+
+- **Benchmark report:** `benchmarks/reports/2026-02-11-wet-benchmark-2.0.4-stable.jsonl`
+- **Schema:** `benchmarks/schemas/report-v0.2.2.schema.json`
+- **Memory data:** correlated
@@ -7,4 +7,4 @@ import warnings
 # Issue parity with 1.1.0 (Issue #22)
 warnings.filterwarnings('ignore', message='urllib3 v2 only supports OpenSSL 1.1.1+')

-__version__ = "2.0.4b10"
+__version__ = "2.0.4"
@@ -13,29 +13,32 @@ authors = [
    {name = "The BROKE team", email = "broke@gmx.eu"},
 ]
 classifiers = [
-    "Development Status :: 4 - Beta",
+    "Development Status :: 5 - Production/Stable",
    "Intended Audience :: Developers",
    "Topic :: Scientific/Engineering :: Artificial Intelligence",
    "Programming Language :: Python :: 3",
    "Programming Language :: Python :: 3.10",
    "Programming Language :: Python :: 3.11",
    "Programming Language :: Python :: 3.12",
-    # Python 3.9: MLX 0.30+ requires 3.10+
-    # Python 3.13+: miniaudio lacks pre-built wheels
    "Operating System :: MacOS",
    "Environment :: Console",
    "License :: OSI Approved :: Apache Software License",
 ]
 dependencies = [
+    # Core
    "huggingface-hub>=1.0.0",
    "requests>=2.32.0",
-    "mlx-lm>=0.30.5",  # Vision/Audio extras require 0.30.5+
-    "mlx>=0.30.0",
    "fastapi>=0.116.0",
    "uvicorn>=0.35.0",
    "pydantic>=2.11.0",
    "httpx>=0.27.0",
-    "python-multipart>=0.0.9",  # For audio file uploads
+    "python-multipart>=0.0.9",
+    # MLX stack (Text + Vision + Audio)
+    "mlx>=0.30.0",
+    "mlx-lm>=0.30.5",
+    "mlx-vlm==0.3.10",
+    "mlx-audio==0.3.1",
+    "tiktoken>=0.7.0",
 ]

 [project.scripts]
@@ -65,20 +68,6 @@ dev = [
    "ruff>=0.1.0",
    "mypy>=1.5.0",
 ]
-vision = [
-    "mlx-vlm==0.3.10",  # Pinned: stable, tested version (ADR-012)
-]
-audio = [
-    # mlx-audio pinned: 0.3.1 has tiktoken regression (Issue #479)
-    # We bundle complete tiktoken workaround in mlxk2/audio/whisper_tokenizer.py
-    "mlx-audio==0.3.1",
-    "tiktoken>=0.7.0",  # Required for Whisper tiktoken fallback
-]
-all = [
-    "mlx-vlm==0.3.10",  # Pinned: stable, tested version
-    "mlx-audio==0.3.1",  # Pinned: tiktoken regression patched by mlxk2
-    "tiktoken>=0.7.0",
-]

 [tool.setuptools]
 license-files = [
@@ -1,19 +0,0 @@
-# mlx_knife requirements
-# Core dependencies for HuggingFace model management
-
-huggingface-hub>=1.3.0
-requests>=2.32.0
-mlx-lm>=0.30.5          # Vision/Audio extras require 0.30.5+
-mlx>=0.30.0             # Core MLX library
-# Note: transformers pinned by mlx-lm (5.0.0rc3 as of 0.30.5)
-# Known issue: trust_remote_code dialog affects some models (Klear-46B)
-
-# API Server dependencies (for 'mlxk server' command)
-fastapi>=0.116.0
-uvicorn>=0.35.0
-pydantic>=2.11.0
-
-# Test dependencies (for FastAPI TestClient)
-httpx>=0.27.0
-
-# Note: Python 3.9-3.14 supported, tested on Apple Silicon M1/M2/M3/M4