mirror of
https://github.com/cloudstack-llc/mlx-knife.git
synced 2026-07-01 20:44:14 -04:00
Release 2.0.4 stable - see CHANGELOG.md for details
This commit is contained in:
+5
-1
@@ -1,6 +1,6 @@
|
||||
# Changelog
|
||||
|
||||
## [2.0.4] - 2026-XX-XX (WIP)
|
||||
## [2.0.4] - 2026-02-11
|
||||
|
||||
> **First stable release with Audio/STT support.** This release consolidates beta.9 and beta.10 improvements into a production-ready package.
|
||||
|
||||
@@ -18,8 +18,12 @@
|
||||
- `test_whisper_tokenizer.py`: 47 tests for tiktoken workaround (get_encoding, get_tokenizer, Tokenizer class)
|
||||
- `TestEmbeddingGate`: 3 tests for embedding model runtime blocking
|
||||
|
||||
- **Test Infrastructure:** Test suite now runs consistently with `unset HF_HOME` (uses fallback models)
|
||||
|
||||
### Fixed (since beta.10)
|
||||
|
||||
- **`serve --model` Pre-Validation:** Model ambiguity and not-found errors now detected before server starts. Previously showed stacktrace; now shows clean error message with suggestions.
|
||||
|
||||
- **Run Preflight Consistency:** `run.py` now passes `probe` and `framework` to `audio_runtime_compatibility()`. STT model_type and tekken.json gates now work in CLI (previously only in list/health).
|
||||
|
||||
- **STT model_type Gate:** Extended to accept `vibevoice` and `audio` model types (was only `whisper`/`voxtral`). VibeVoice-ASR and future STT models no longer blocked incorrectly.
|
||||
|
||||
@@ -4,9 +4,9 @@
|
||||
<img src="https://github.com/mzau/mlx-knife/raw/main/mlxk-demo.gif" alt="MLX Knife Demo" width="900">
|
||||
</p>
|
||||
|
||||
**Current Version: 2.0.4-beta.10** (Stable: 2.0.3)
|
||||
**Current Version: 2.0.4** (Stable)
|
||||
|
||||
[](https://github.com/mzau/mlx-knife/releases)
|
||||
[](https://github.com/mzau/mlx-knife/releases)
|
||||
[](https://www.apache.org/licenses/LICENSE-2.0)
|
||||
[](https://www.python.org/downloads/)
|
||||
[](https://support.apple.com/en-us/HT211814)
|
||||
@@ -17,10 +17,8 @@
|
||||
|
||||
## Features
|
||||
|
||||
> **⚠️ Beta.9 Audio Bug:** If you installed `mlx-knife[audio]==2.0.4b9` from PyPI, audio transcription fails with "Processor not found". Upgrade to beta.10: `pip install mlx-knife[all]==2.0.4b10`
|
||||
|
||||
### What's New in 2.0.4 (Coming Soon - Currently Beta)
|
||||
- **Audio Transcription (STT)** - Whisper speech-to-text (`--audio` flag, `pip install mlx-knife[audio]`)
|
||||
### What's New in 2.0.4
|
||||
- **Audio Transcription (STT)** - Whisper speech-to-text (`--audio` flag)
|
||||
- **Vision Models with EXIF Metadata** - Image analysis + automatic GPS/date/camera extraction visible to the model
|
||||
- **Unix Pipe Integration** - Chain models without temp files (`vision → text` workflows)
|
||||
- **Local Development Workflow** - Clone → Repair → Test models without HuggingFace round-trips
|
||||
@@ -79,42 +77,33 @@ This license applies **only** to the `mlx-knife` code and **does not extend** to
|
||||
### Python Compatibility
|
||||
|
||||
✅ **Python 3.10 - 3.12** - Full support (Text + Vision + Audio)
|
||||
❌ **Python 3.9** - Not supported (MLX 0.30+ requires 3.10+)
|
||||
❌ **Python 3.9** - Use version 2.0.3 (text + cache management only)
|
||||
❌ **Python 3.13+** - Not supported (miniaudio lacks pre-built wheels)
|
||||
|
||||
**Note:** Vision/Audio features require Python 3.10+. Recommended: **Python 3.10 or 3.11** for best compatibility.
|
||||
**Recommended:** Python 3.10 or 3.11 for best compatibility.
|
||||
|
||||
|
||||
|
||||
## Installation
|
||||
|
||||
### 1. PyPI Stable (2.0.3 - Text models only)
|
||||
### 1. PyPI (Recommended)
|
||||
|
||||
```bash
|
||||
pip install mlx-knife
|
||||
mlxk --version # → mlxk 2.0.3
|
||||
```
|
||||
|
||||
**Requirements:** macOS Apple Silicon, Python 3.9-3.12
|
||||
|
||||
### 2. PyPI Beta (2.0.4-beta.10 - Text + Vision + Audio)
|
||||
|
||||
```bash
|
||||
pip install mlx-knife[all]==2.0.4b10
|
||||
mlxk --version # → mlxk 2.0.4b10
|
||||
mlxk --version # → mlxk 2.0.4
|
||||
```
|
||||
|
||||
**Requirements:** macOS Apple Silicon, Python 3.10-3.12
|
||||
**Features:** Audio STT (Whisper), Vision with EXIF metadata, full tiktoken workaround
|
||||
**Includes:** Text, Vision, Audio (Whisper STT), EXIF metadata, Unix pipes
|
||||
|
||||
### 3. Developer Installation
|
||||
### 2. Developer Installation
|
||||
|
||||
```bash
|
||||
git clone https://github.com/mzau/mlx-knife.git
|
||||
cd mlx-knife
|
||||
pip install -e ".[all,dev,test]"
|
||||
pip install -e ".[dev,test]"
|
||||
|
||||
mlxk --version # → mlxk 2.0.4b10
|
||||
mlxk --version # → mlxk 2.0.4
|
||||
pytest -v
|
||||
```
|
||||
|
||||
@@ -285,8 +274,7 @@ Image analysis via the `--image` flag (CLI and server). Requires Python 3.10+.
|
||||
#### Requirements
|
||||
|
||||
- **Python 3.10+** (mlx-vlm dependency)
|
||||
- **Installation:** `pip install mlx-knife[vision]`
|
||||
- **Backend:** mlx-vlm 0.3.10 (auto-installed from PyPI)
|
||||
- **Backend:** mlx-vlm 0.3.10 (included in base install)
|
||||
|
||||
#### Usage
|
||||
|
||||
@@ -424,7 +412,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
|
||||
|
||||
**⚠️ Important:** Vision support relies on mlx-vlm (upstream), which has known stability issues. While `mlxk health` verifies file integrity, **runtime failures may occur** with certain model architectures due to upstream bugs.
|
||||
|
||||
**✅ Tested & Working Models** (mlx-knife v2.0.4-beta.6):
|
||||
**✅ Tested & Working Models** (mlx-knife v2.0.4):
|
||||
|
||||
| Model | Size | Notes |
|
||||
|-------|------|-------|
|
||||
@@ -452,14 +440,13 @@ mlxk convert <model> <output> --repair-index
|
||||
|
||||
### Audio Transcription (Speech-to-Text)
|
||||
|
||||
> **🎙️ New in beta.9/10:** Professional STT via dedicated Whisper models (mlx-audio backend). Beta.10 fixes PyPI install (no Git workaround needed). Backward compatible with Gemma-3n multimodal audio (mlx-vlm).
|
||||
> **🎙️ Audio Transcription:** Speech-to-text via Whisper models (mlx-audio backend). Works out-of-the-box with PyPI install. Backward compatible with Gemma-3n multimodal audio (mlx-vlm).
|
||||
|
||||
**Requirements:**
|
||||
- **Python 3.10+** (mlx-audio dependency)
|
||||
- **Installation:** `pip install mlx-knife[audio]` (tiktoken workaround bundled)
|
||||
- **Python 3.10+** (mlx-audio dependency, included in base install)
|
||||
- **No system dependencies:** MP3/WAV decoding via embedded libsndfile (no ffmpeg or Homebrew required)
|
||||
|
||||
**✅ Recommended Models** (mlx-knife v2.0.4-beta.10):
|
||||
**✅ Recommended Models** (mlx-knife v2.0.4):
|
||||
|
||||
| Model | Backend | Size | Duration | Notes |
|
||||
|-------|---------|------|----------|-------|
|
||||
@@ -1240,7 +1227,7 @@ Apache License 2.0 — see `LICENSE` (root) and `mlxk2/NOTICE`.
|
||||
|
||||
<p align="center">
|
||||
<b>Made with ❤️ by The BROKE team <img src="broke-logo.png" alt="BROKE Logo" width="30" align="middle"></b><br>
|
||||
<i>Version 2.0.4-beta.10 | February 2026</i><br>
|
||||
<i>Version 2.0.4 | February 2026</i><br>
|
||||
<a href="https://github.com/mzau/broke-nchat">💬 Web UI: nChat - lightweight chat interface</a> •
|
||||
<a href="https://github.com/mzau/broke-cluster">🔮 Multi-node: BROKE Cluster</a>
|
||||
</p>
|
||||
|
||||
+9
-9
@@ -4,24 +4,24 @@ This document contains version-specific details, complete file listings, and imp
|
||||
|
||||
## Current Status
|
||||
|
||||
✅ **2.0.4-beta.10** — **Audio PyPI Fix** (tiktoken workaround complete); Runtime compatibility accuracy; Audio transcription (Whisper via mlx-audio); Server `/v1/audio/transcriptions` endpoint; Probe/Policy architecture complete; Vision support Phase 1-3 (CLI + Server); Pipes/Memory-Aware; EXIF metadata; **Test Portfolio Separation complete**; Workspace Infrastructure (ADR-018 Phase 0a+0b+0c); Convert Operation (ADR-018 Phase 1); Resumable Clone; **Benchmark Schema v0.2.2** (Precise test timing).
|
||||
✅ **2.0.4** — **First stable release with Vision + Audio.** Vision support (CLI + Server, EXIF metadata); Audio transcription (Whisper via mlx-audio); Runtime compatibility accuracy; Server `/v1/audio/transcriptions` endpoint; Probe/Policy architecture; Pipes/Memory-Aware; **Test Portfolio Separation complete**; Workspace Infrastructure (ADR-018 Phase 0a+0b+0c); Convert Operation (ADR-018 Phase 1); Resumable Clone; **Benchmark Schema v0.2.2**.
|
||||
|
||||
### Test Results (Official Reference)
|
||||
|
||||
**Standard Unit Tests (Multi-Python):**
|
||||
```
|
||||
Platform: macOS 26.2 (Tahoe), M2 Max, 64GB RAM
|
||||
Python 3.10: 647 passed, 11 skipped in 19.78s
|
||||
Python 3.11: 647 passed, 11 skipped in 19.91s
|
||||
Python 3.12: 647 passed, 11 skipped in 20.94s
|
||||
Note: Default suite works on 16GB. Wet-umbrella: 64GB recommended (M1 Max 32GB untested)
|
||||
Python 3.10: 697 passed, 13 skipped
|
||||
Python 3.11: 697 passed, 13 skipped
|
||||
Python 3.12: 697 passed, 13 skipped
|
||||
Note: Default suite works on 16GB. Full integration tests: 64GB recommended
|
||||
```
|
||||
|
||||
**Wet Umbrella (4-Phase Integration):**
|
||||
**Full Integration Tests (`./scripts/test-wet-umbrella.sh`):**
|
||||
```
|
||||
Phase 1 (wet marker): 168 passed, 73 skipped, 680 deselected (Schema v0.2.2)
|
||||
Phase 2-4 (live_pull/clone/pipe): 3 passed, 742 deselected
|
||||
Total: 171 passed across all phases
|
||||
Phase 1 (portfolio tests): 179 passed, 73 skipped, 732 deselected
|
||||
Phase 2-4 (live operations): 3 passed
|
||||
Total: 182 passed across all phases
|
||||
```
|
||||
|
||||
✅ **Production verified & reported:** M1, M1 Max, M2 Max in real-world use
|
||||
|
||||
+3
-3
@@ -300,7 +300,7 @@ pytest -m live_stop_tokens -v
|
||||
|
||||
## Python Version Compatibility
|
||||
|
||||
**All tests validated on Python 3.9-3.14**
|
||||
**Tests validated on Python 3.10-3.12** (Python 3.9 not supported since 2.0.4)
|
||||
|
||||
Multi-version testing:
|
||||
```bash
|
||||
@@ -308,8 +308,8 @@ Multi-version testing:
|
||||
./test-multi-python.sh
|
||||
|
||||
# Manual verification
|
||||
python3.9 -m venv test_39
|
||||
source test_39/bin/activate
|
||||
python3.10 -m venv test_310
|
||||
source test_310/bin/activate
|
||||
pip install -e .[test] && pytest
|
||||
```
|
||||
|
||||
|
||||
@@ -15,7 +15,7 @@ This directory contains benchmark infrastructure for mlx-knife:
|
||||
benchmarks/
|
||||
├── reports/ # JSONL test reports + Markdown analyses
|
||||
│ ├── 2025-12-20-v2.0.4b3.jsonl # Raw data (one file per test run)
|
||||
│ └── BENCHMARK-v1.0-*.md # Generated analysis reports
|
||||
│ └── BENCHMARK-*.md # Generated analysis reports
|
||||
├── schemas/ # JSON Schema definitions
|
||||
│ ├── report-v0.1.schema.json # legacy schema
|
||||
│ ├── report-v0.2.2.schema.json # Current schema
|
||||
@@ -23,7 +23,7 @@ benchmarks/
|
||||
├── tools/ # Standalone tools
|
||||
│ ├── memmon.py # Memory monitor (background sampling)
|
||||
│ └── memplot.py # Memory timeline visualizer
|
||||
├── generate_benchmark_report.py # Report generator (Template v1.0)
|
||||
├── generate_benchmark_report.py # Report generator (Template v1.1)
|
||||
├── validate_reports.py # Schema validation
|
||||
├── README.md # ← You are here
|
||||
└── TESTING.md # Benchmark handbook (How-To)
|
||||
@@ -33,7 +33,7 @@ benchmarks/
|
||||
|
||||
| Tool | Purpose |
|
||||
|------|---------|
|
||||
| `generate_benchmark_report.py` | JSONL → Markdown report (Template v1.0) |
|
||||
| `generate_benchmark_report.py` | JSONL → Markdown report (Template v1.1) |
|
||||
| `validate_reports.py` | Schema validation of JSONL files |
|
||||
| `tools/memmon.py` | Memory + CPU + GPU monitoring (200ms sampling) |
|
||||
| `tools/memplot.py` | Interactive 3-row timeline (Memory/CPU/GPU, HTML) |
|
||||
@@ -59,7 +59,7 @@ See `schemas/LEARNINGS-FOR-v1.0.md` for details.
|
||||
## Recent Reports
|
||||
|
||||
Latest baseline reports are in `reports/` directory:
|
||||
- Pattern: `BENCHMARK-v1.0-<version>-<date>-*.md`
|
||||
- Pattern: `BENCHMARK-<template>-<version>-<date>-*.md`
|
||||
- Hardware: Mac14,13 (M2 Max, 64 GB)
|
||||
- Test suite: ~167 tests (Vision + Text + Audio E2E)
|
||||
- Quality target: 100% clean (0 MB swap, 0 zombies)
|
||||
|
||||
@@ -13,7 +13,7 @@ pytest -m live_e2e tests_2.0/live/ \
|
||||
python benchmarks/generate_benchmark_report.py
|
||||
|
||||
# 3. View results
|
||||
cat benchmarks/reports/BENCHMARK-v1.0-2.0.4b3-*.md
|
||||
cat benchmarks/reports/BENCHMARK-*.md
|
||||
```
|
||||
|
||||
---
|
||||
@@ -55,7 +55,7 @@ python benchmarks/tools/memmon.py \
|
||||
```bash
|
||||
python benchmarks/generate_benchmark_report.py
|
||||
# → Finds most recent .jsonl in benchmarks/reports/
|
||||
# → Outputs: BENCHMARK-v1.0-<version>-<date>.md
|
||||
# → Outputs: BENCHMARK-<template>-<version>-<date>.md
|
||||
```
|
||||
|
||||
### Explicit Input File
|
||||
|
||||
@@ -456,6 +456,20 @@ def calculate_statistics(data: List[dict]) -> Dict:
|
||||
hw_profile = entry["system"]["hardware_profile"]
|
||||
break
|
||||
|
||||
# Build test_id → model_id lookup for GPU analysis (v1.1)
|
||||
# Maps full pytest test ID to model short name
|
||||
test_model_map = {}
|
||||
for entry in benchmark_data:
|
||||
if "test" in entry and "model" in entry:
|
||||
test_id = entry["test"]
|
||||
model_id = entry["model"]["id"]
|
||||
# Short name: remove org/ prefix (mlx-community/, BrokeC/, etc.)
|
||||
if "/" in model_id:
|
||||
model_short = model_id.split("/", 1)[1]
|
||||
else:
|
||||
model_short = model_id
|
||||
test_model_map[test_id] = model_short
|
||||
|
||||
return {
|
||||
"total_tests": len(benchmark_data),
|
||||
"passed": len(passed_tests),
|
||||
@@ -488,6 +502,7 @@ def calculate_statistics(data: List[dict]) -> Dict:
|
||||
"hardware": hw_profile,
|
||||
"models": model_stats,
|
||||
"tests": test_stats,
|
||||
"test_model_map": test_model_map,
|
||||
}
|
||||
|
||||
|
||||
@@ -606,30 +621,26 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
|
||||
compare_models = {m['id']: m for m in compare_stats['models'].values()}
|
||||
|
||||
if compare_stats:
|
||||
md += f"""```
|
||||
{'Model':<40} {'Size':<7} {'Mode':<6} {'Tests':<5} {'Time':<8} {'Old':<8} {'Δ':<8} {'Change':<10} {'RAM (GB)':<12}
|
||||
{'='*40} {'='*7} {'='*6} {'='*5} {'='*8} {'='*8} {'='*8} {'='*10} {'='*12}
|
||||
"""
|
||||
md += "| Model | Size | Mode | Tests | Time | Old | Δ | Change | RAM (GB) |\n"
|
||||
md += "|-------|-----:|:----:|------:|-----:|----:|--:|-------:|---------:|\n"
|
||||
else:
|
||||
md += f"""```
|
||||
{'Model':<50} {'Size':<8} {'Mode':<6} {'Tests':<6} {'Time':<10} {'RAM (GB)':<20}
|
||||
{'='*50} {'='*8} {'='*6} {'='*6} {'='*10} {'='*20}
|
||||
"""
|
||||
md += "| Model | Size | Mode | Tests | Time | RAM (GB) |\n"
|
||||
md += "|-------|-----:|:----:|------:|-----:|---------:|\n"
|
||||
|
||||
def format_model_short(model_id):
|
||||
"""Remove org prefix from model ID."""
|
||||
if "/" in model_id:
|
||||
return model_id.split("/", 1)[1]
|
||||
return model_id
|
||||
|
||||
for model in sorted_models:
|
||||
# Shorten model ID (remove mlx-community/ prefix)
|
||||
model_short = model['id'].replace('mlx-community/', '')
|
||||
max_len = 38 if compare_stats else 48
|
||||
if len(model_short) > max_len:
|
||||
model_short = model_short[:max_len-3] + "..."
|
||||
model_short = format_model_short(model['id'])
|
||||
|
||||
# Global RAM range (for backward compat / fallback)
|
||||
ram_range = f"{model['ram_min']:.1f}-{model['ram_max']:.1f}"
|
||||
|
||||
if compare_stats:
|
||||
old_model = compare_models.get(model['id'])
|
||||
|
||||
# Separate rows per modality (same as non-comparison mode)
|
||||
rows_written = 0
|
||||
|
||||
# Vision modality
|
||||
@@ -643,21 +654,14 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
|
||||
else:
|
||||
v_ram_range = f"{v_ram_min:.1f}-{v_ram_max:.1f}"
|
||||
|
||||
# Get old vision stats (if available)
|
||||
if old_model and old_model.get('vision_count', 0) > 0:
|
||||
old_time = old_model['vision_time']
|
||||
delta = model['vision_time'] - old_time
|
||||
change_pct = (delta / old_time * 100) if old_time > 0 else 0
|
||||
if change_pct > 5:
|
||||
status = "⚠️"
|
||||
elif change_pct < -1:
|
||||
status = "✅"
|
||||
else:
|
||||
status = ""
|
||||
change_str = f"{change_pct:+.1f}% {status}"
|
||||
md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'Vision':<6} {model['vision_count']:<5} {model['vision_time']:>6.1f}s {old_time:>6.1f}s {delta:>+6.1f}s {change_str:<10} {v_ram_range:<12}\n"
|
||||
status = "⚠️" if change_pct > 5 else ("✅" if change_pct < -1 else "")
|
||||
md += f"| {model_short} | {model['size_gb']:.1f} GB | Vision | {model['vision_count']} | {model['vision_time']:.1f}s | {old_time:.1f}s | {delta:+.1f}s | {change_pct:+.1f}% {status} | {v_ram_range} |\n"
|
||||
else:
|
||||
md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'Vision':<6} {model['vision_count']:<5} {model['vision_time']:>6.1f}s {'N/A':<8} {'N/A':<8} {'NEW':<10} {v_ram_range:<12}\n"
|
||||
md += f"| {model_short} | {model['size_gb']:.1f} GB | Vision | {model['vision_count']} | {model['vision_time']:.1f}s | N/A | N/A | NEW | {v_ram_range} |\n"
|
||||
rows_written += 1
|
||||
|
||||
# Text modality
|
||||
@@ -671,24 +675,17 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
|
||||
else:
|
||||
t_ram_range = f"{t_ram_min:.1f}-{t_ram_max:.1f}"
|
||||
|
||||
# Get old text stats (if available)
|
||||
if old_model and old_model.get('text_count', 0) > 0:
|
||||
old_time = old_model['text_time']
|
||||
delta = model['text_time'] - old_time
|
||||
change_pct = (delta / old_time * 100) if old_time > 0 else 0
|
||||
if change_pct > 5:
|
||||
status = "⚠️"
|
||||
elif change_pct < -1:
|
||||
status = "✅"
|
||||
else:
|
||||
status = ""
|
||||
change_str = f"{change_pct:+.1f}% {status}"
|
||||
md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'Text':<6} {model['text_count']:<5} {model['text_time']:>6.1f}s {old_time:>6.1f}s {delta:>+6.1f}s {change_str:<10} {t_ram_range:<12}\n"
|
||||
status = "⚠️" if change_pct > 5 else ("✅" if change_pct < -1 else "")
|
||||
md += f"| {model_short} | {model['size_gb']:.1f} GB | Text | {model['text_count']} | {model['text_time']:.1f}s | {old_time:.1f}s | {delta:+.1f}s | {change_pct:+.1f}% {status} | {t_ram_range} |\n"
|
||||
else:
|
||||
md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'Text':<6} {model['text_count']:<5} {model['text_time']:>6.1f}s {'N/A':<8} {'N/A':<8} {'NEW':<10} {t_ram_range:<12}\n"
|
||||
md += f"| {model_short} | {model['size_gb']:.1f} GB | Text | {model['text_count']} | {model['text_time']:.1f}s | N/A | N/A | NEW | {t_ram_range} |\n"
|
||||
rows_written += 1
|
||||
|
||||
# Audio modality (NEW in v0.2.2)
|
||||
# Audio modality
|
||||
if model['audio_count'] > 0:
|
||||
a_ram_min = model['audio_ram_min']
|
||||
a_ram_max = model['audio_ram_max']
|
||||
@@ -699,46 +696,30 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
|
||||
else:
|
||||
a_ram_range = f"{a_ram_min:.1f}-{a_ram_max:.1f}"
|
||||
|
||||
# Get old audio stats (if available)
|
||||
if old_model and old_model.get('audio_count', 0) > 0:
|
||||
old_time = old_model['audio_time']
|
||||
delta = model['audio_time'] - old_time
|
||||
change_pct = (delta / old_time * 100) if old_time > 0 else 0
|
||||
if change_pct > 5:
|
||||
status = "⚠️"
|
||||
elif change_pct < -1:
|
||||
status = "✅"
|
||||
else:
|
||||
status = ""
|
||||
change_str = f"{change_pct:+.1f}% {status}"
|
||||
md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'Audio':<6} {model['audio_count']:<5} {model['audio_time']:>6.1f}s {old_time:>6.1f}s {delta:>+6.1f}s {change_str:<10} {a_ram_range:<12}\n"
|
||||
status = "⚠️" if change_pct > 5 else ("✅" if change_pct < -1 else "")
|
||||
md += f"| {model_short} | {model['size_gb']:.1f} GB | Audio | {model['audio_count']} | {model['audio_time']:.1f}s | {old_time:.1f}s | {delta:+.1f}s | {change_pct:+.1f}% {status} | {a_ram_range} |\n"
|
||||
else:
|
||||
md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'Audio':<6} {model['audio_count']:<5} {model['audio_time']:>6.1f}s {'N/A':<8} {'N/A':<8} {'NEW':<10} {a_ram_range:<12}\n"
|
||||
md += f"| {model_short} | {model['size_gb']:.1f} GB | Audio | {model['audio_count']} | {model['audio_time']:.1f}s | N/A | N/A | NEW | {a_ram_range} |\n"
|
||||
rows_written += 1
|
||||
|
||||
# Fallback for legacy data (no modality info) - rare in comparison mode
|
||||
# Fallback for legacy data
|
||||
if rows_written == 0 and old_model:
|
||||
old_time = old_model['total_time']
|
||||
delta = model['total_time'] - old_time
|
||||
change_pct = (delta / old_time * 100) if old_time > 0 else 0
|
||||
if change_pct > 5:
|
||||
status = "⚠️"
|
||||
elif change_pct < -1:
|
||||
status = "✅"
|
||||
else:
|
||||
status = ""
|
||||
change_str = f"{change_pct:+.1f}% {status}"
|
||||
md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'-':<6} {model['count']:<5} {model['total_time']:>6.1f}s {old_time:>6.1f}s {delta:>+6.1f}s {change_str:<10} {ram_range:<12}\n"
|
||||
status = "⚠️" if change_pct > 5 else ("✅" if change_pct < -1 else "")
|
||||
md += f"| {model_short} | {model['size_gb']:.1f} GB | - | {model['count']} | {model['total_time']:.1f}s | {old_time:.1f}s | {delta:+.1f}s | {change_pct:+.1f}% {status} | {ram_range} |\n"
|
||||
elif rows_written == 0:
|
||||
# New model with no modality info
|
||||
md += f"{model_short:<40} {model['size_gb']:>5.1f}GB {'-':<6} {model['count']:<5} {model['total_time']:>6.1f}s {'N/A':<8} {'N/A':<8} {'NEW':<10} {ram_range:<12}\n"
|
||||
md += f"| {model_short} | {model['size_gb']:.1f} GB | - | {model['count']} | {model['total_time']:.1f}s | N/A | N/A | NEW | {ram_range} |\n"
|
||||
else:
|
||||
# Separate rows per modality (no "Mixed" ambiguity)
|
||||
# Each modality gets its own line with specific stats + RAM
|
||||
# Without comparison
|
||||
rows_written = 0
|
||||
|
||||
if model['vision_count'] > 0:
|
||||
# Use modality-specific RAM range (single value if min==max)
|
||||
v_ram_min = model['vision_ram_min']
|
||||
v_ram_max = model['vision_ram_max']
|
||||
if v_ram_min == float('inf'):
|
||||
@@ -747,11 +728,10 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
|
||||
v_ram_range = f"{v_ram_min:.1f}"
|
||||
else:
|
||||
v_ram_range = f"{v_ram_min:.1f}-{v_ram_max:.1f}"
|
||||
md += f"{model_short:<50} {model['size_gb']:>6.1f}GB {'Vision':<6} {model['vision_count']:<6} {model['vision_time']:>8.1f}s {v_ram_range:<20}\n"
|
||||
md += f"| {model_short} | {model['size_gb']:.1f} GB | Vision | {model['vision_count']} | {model['vision_time']:.1f}s | {v_ram_range} |\n"
|
||||
rows_written += 1
|
||||
|
||||
if model['text_count'] > 0:
|
||||
# Use modality-specific RAM range (single value if min==max)
|
||||
t_ram_min = model['text_ram_min']
|
||||
t_ram_max = model['text_ram_max']
|
||||
if t_ram_min == float('inf'):
|
||||
@@ -760,11 +740,10 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
|
||||
t_ram_range = f"{t_ram_min:.1f}"
|
||||
else:
|
||||
t_ram_range = f"{t_ram_min:.1f}-{t_ram_max:.1f}"
|
||||
md += f"{model_short:<50} {model['size_gb']:>6.1f}GB {'Text':<6} {model['text_count']:<6} {model['text_time']:>8.1f}s {t_ram_range:<20}\n"
|
||||
md += f"| {model_short} | {model['size_gb']:.1f} GB | Text | {model['text_count']} | {model['text_time']:.1f}s | {t_ram_range} |\n"
|
||||
rows_written += 1
|
||||
|
||||
if model['audio_count'] > 0:
|
||||
# Use modality-specific RAM range (single value if min==max)
|
||||
a_ram_min = model['audio_ram_min']
|
||||
a_ram_max = model['audio_ram_max']
|
||||
if a_ram_min == float('inf'):
|
||||
@@ -773,14 +752,14 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
|
||||
a_ram_range = f"{a_ram_min:.1f}"
|
||||
else:
|
||||
a_ram_range = f"{a_ram_min:.1f}-{a_ram_max:.1f}"
|
||||
md += f"{model_short:<50} {model['size_gb']:>6.1f}GB {'Audio':<6} {model['audio_count']:<6} {model['audio_time']:>8.1f}s {a_ram_range:<20}\n"
|
||||
md += f"| {model_short} | {model['size_gb']:.1f} GB | Audio | {model['audio_count']} | {model['audio_time']:.1f}s | {a_ram_range} |\n"
|
||||
rows_written += 1
|
||||
|
||||
# Fallback for legacy data (no modality info)
|
||||
# Fallback for legacy data
|
||||
if rows_written == 0:
|
||||
md += f"{model_short:<50} {model['size_gb']:>6.1f}GB {'-':<6} {model['count']:<6} {model['total_time']:>8.1f}s {ram_range:<20}\n"
|
||||
md += f"| {model_short} | {model['size_gb']:.1f} GB | - | {model['count']} | {model['total_time']:.1f}s | {ram_range} |\n"
|
||||
|
||||
md += "```\n\n"
|
||||
md += "\n"
|
||||
|
||||
# Model Categories (with modality differentiation)
|
||||
large_models = [m for m in sorted_models if m['size_gb'] >= 20]
|
||||
@@ -891,22 +870,14 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
|
||||
compare_tests = {(t['name'], t.get('modality', 'unknown')): t for t in compare_stats['tests'].values()}
|
||||
|
||||
if compare_stats:
|
||||
md += f"""```
|
||||
{'Test Name':<38} {'Mode':<6} {'Models':<7} {'Fastest':<18} {'Slowest':<18} {'Med':<6} {'Old':<6} {'Δ Med':<8}
|
||||
{'='*38} {'='*6} {'='*7} {'='*18} {'='*18} {'='*6} {'='*6} {'='*8}
|
||||
"""
|
||||
md += "| Test Name | Mode | Models | Fastest | Slowest | Med | Old | Δ Med |\n"
|
||||
md += "|-----------|:----:|-------:|---------|---------|----:|----:|------:|\n"
|
||||
else:
|
||||
md += f"""```
|
||||
{'Test Name':<44} {'Mode':<6} {'Models':<7} {'Fastest':<22} {'Slowest':<22} {'Med Time'}
|
||||
{'='*44} {'='*6} {'='*7} {'='*22} {'='*22} {'='*8}
|
||||
"""
|
||||
md += "| Test Name | Mode | Models | Fastest | Slowest | Med Time |\n"
|
||||
md += "|-----------|:----:|-------:|---------|---------|--------:|\n"
|
||||
|
||||
for test in sorted_tests:
|
||||
# Shorten test name if needed
|
||||
max_test_len = 36 if compare_stats else 42
|
||||
test_short = test['name']
|
||||
if len(test_short) > max_test_len:
|
||||
test_short = test_short[:max_test_len-3] + "..."
|
||||
test_name = test['name']
|
||||
|
||||
# Format modality (Vision/Text/Audio/- for unknown)
|
||||
modality = test.get('modality', 'unknown')
|
||||
@@ -924,14 +895,8 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
|
||||
slowest = test['slowest']
|
||||
|
||||
if fastest and slowest:
|
||||
max_model_len = 16 if compare_stats else 20
|
||||
fastest_str = f"{fastest['model_short']} ({fastest['duration']:.1f}s)"
|
||||
slowest_str = f"{slowest['model_short']} ({slowest['duration']:.1f}s)"
|
||||
if len(fastest_str) > max_model_len:
|
||||
fastest_str = fastest_str[:max_model_len-3] + "..."
|
||||
if len(slowest_str) > max_model_len:
|
||||
slowest_str = slowest_str[:max_model_len-3] + "..."
|
||||
|
||||
med_time = test['median_time']
|
||||
|
||||
if compare_stats:
|
||||
@@ -939,18 +904,18 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
|
||||
if old_test:
|
||||
old_med = old_test['median_time']
|
||||
delta_pct = ((med_time - old_med) / old_med * 100) if old_med > 0 else 0
|
||||
delta_str = f"{delta_pct:+.1f}%"
|
||||
md += f"{test_short:<38} {mode_str:<6} {test['model_count']:<7} {fastest_str:<18} {slowest_str:<18} {med_time:<5.1f}s {old_med:<5.1f}s {delta_str:<8}\n"
|
||||
md += f"| {test_name} | {mode_str} | {test['model_count']} | {fastest_str} | {slowest_str} | {med_time:.1f}s | {old_med:.1f}s | {delta_pct:+.1f}% |\n"
|
||||
else:
|
||||
md += f"{test_short:<38} {mode_str:<6} {test['model_count']:<7} {fastest_str:<18} {slowest_str:<18} {med_time:<5.1f}s {'N/A':<6} {'NEW':<8}\n"
|
||||
md += f"| {test_name} | {mode_str} | {test['model_count']} | {fastest_str} | {slowest_str} | {med_time:.1f}s | N/A | NEW |\n"
|
||||
else:
|
||||
md += f"{test_short:<44} {mode_str:<6} {test['model_count']:<7} {fastest_str:<22} {slowest_str:<22} {med_time:.1f}s\n"
|
||||
md += f"| {test_name} | {mode_str} | {test['model_count']} | {fastest_str} | {slowest_str} | {med_time:.1f}s |\n"
|
||||
|
||||
md += "```\n\n"
|
||||
md += "\n"
|
||||
|
||||
# GPU Analysis Section (v1.1 - only if memory data provided)
|
||||
gpu_correlations = stats.get("gpu_correlations", {})
|
||||
cold_starts = stats.get("cold_starts", {})
|
||||
test_model_map = stats.get("test_model_map", {})
|
||||
|
||||
if gpu_correlations:
|
||||
md += "\n---\n\n"
|
||||
@@ -988,20 +953,23 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
|
||||
if cold_start_tests:
|
||||
md += "### Cold-Start Tests\n\n"
|
||||
md += "First test per model (includes model loading overhead).\n\n"
|
||||
md += f"```\n"
|
||||
md += f"{'Test':<60} {'GPU Dev':<8} {'GPU Rnd':<8} {'Samples':<8}\n"
|
||||
md += f"{'='*60} {'='*8} {'='*8} {'='*8}\n"
|
||||
md += "| Test | Model | GPU Dev | GPU Rnd | Samples |\n"
|
||||
md += "|------|-------|--------:|--------:|--------:|\n"
|
||||
|
||||
for test_id, gpu_stats in cold_start_tests[:15]: # Limit to 15
|
||||
test_short = test_id.split("::")[-1][:58]
|
||||
# Extract test name (without parametrization)
|
||||
test_full = test_id.split("::")[-1]
|
||||
test_name = test_full.split("[")[0]
|
||||
# Look up model name
|
||||
model_name = test_model_map.get(test_id, "unknown")
|
||||
if gpu_stats:
|
||||
md += f"{test_short:<60} {gpu_stats['gpu_device_avg']:>6.1f}% {gpu_stats['gpu_renderer_avg']:>6.1f}% {gpu_stats['sample_count']:>8}\n"
|
||||
md += f"| {test_name} | {model_name} | {gpu_stats['gpu_device_avg']:.1f}% | {gpu_stats['gpu_renderer_avg']:.1f}% | {gpu_stats['sample_count']} |\n"
|
||||
else:
|
||||
md += f"{test_short:<60} {'N/A':>8} {'N/A':>8} {'N/A':>8}\n"
|
||||
md += f"| {test_name} | {model_name} | N/A | N/A | N/A |\n"
|
||||
|
||||
if len(cold_start_tests) > 15:
|
||||
md += f"... and {len(cold_start_tests) - 15} more cold-start tests\n"
|
||||
md += f"```\n\n"
|
||||
md += f"\n*... and {len(cold_start_tests) - 15} more cold-start tests*\n"
|
||||
md += "\n"
|
||||
|
||||
# High GPU utilization tests
|
||||
high_gpu_tests = [(tid, gs) for tid, gs in gpu_correlations.items()
|
||||
@@ -1011,17 +979,20 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
|
||||
if high_gpu_tests:
|
||||
md += "### High GPU Utilization Tests\n\n"
|
||||
md += "Tests with >50% average GPU device utilization.\n\n"
|
||||
md += f"```\n"
|
||||
md += f"{'Test':<55} {'GPU Dev':<10} {'GPU Rnd':<10} {'Pressure':<10}\n"
|
||||
md += f"{'='*55} {'='*10} {'='*10} {'='*10}\n"
|
||||
md += "| Test | Model | GPU Dev | GPU Rnd | Pressure |\n"
|
||||
md += "|------|-------|--------:|--------:|:--------:|\n"
|
||||
|
||||
for test_id, gpu_stats in high_gpu_tests[:10]:
|
||||
test_short = test_id.split("::")[-1][:53]
|
||||
# Extract test name (without parametrization)
|
||||
test_full = test_id.split("::")[-1]
|
||||
test_name = test_full.split("[")[0]
|
||||
# Look up model name
|
||||
model_name = test_model_map.get(test_id, "unknown")
|
||||
pressure = gpu_stats.get("memory_pressure_max", 1)
|
||||
pressure_str = "WARN" if pressure > 1 else "OK"
|
||||
md += f"{test_short:<55} {gpu_stats['gpu_device_avg']:>8.1f}% {gpu_stats['gpu_renderer_avg']:>8.1f}% {pressure_str:>10}\n"
|
||||
pressure_str = "⚠️ WARN" if pressure > 1 else "OK"
|
||||
md += f"| {test_name} | {model_name} | {gpu_stats['gpu_device_avg']:.1f}% | {gpu_stats['gpu_renderer_avg']:.1f}% | {pressure_str} |\n"
|
||||
|
||||
md += f"```\n\n"
|
||||
md += "\n"
|
||||
|
||||
# GPU utilization summary
|
||||
all_gpu_device = [gs["gpu_device_avg"] for gs in gpu_correlations.values()]
|
||||
|
||||
@@ -0,0 +1,235 @@
|
||||
# Benchmark Report v1.1: 2.0.4
|
||||
|
||||
**Date:** 2026-02-11-wet-benchmark-2.0.4-stable
|
||||
**Generated:** 2026-02-11T13:13:56 UTC
|
||||
**Generator:** generate_benchmark_report.py v1.1
|
||||
**Hardware:** Mac14,13, 12 cores
|
||||
|
||||
---
|
||||
|
||||
## Input Files
|
||||
|
||||
- **Primary:** `benchmarks/reports/2026-02-11-wet-benchmark-2.0.4-stable.jsonl`
|
||||
- **Schema:** v0.2.2
|
||||
- **Comparison:** `benchmarks/reports/2026-02-06-wet-benchmark-1.jsonl`
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
**Tests:** 251 total (179 passed, 72 skipped)
|
||||
**Duration:** 1157.1s (19.3 min)
|
||||
**Quality:** 100.0% clean (251/251)
|
||||
**Models:** 25 tested
|
||||
|
||||
### Comparison
|
||||
|
||||
**vs:** `2026-02-06-wet-benchmark-1.jsonl`
|
||||
**Duration:** 19.6 min → 19.3 min (-1.7%) ✅
|
||||
**Models:** 16/25 slower (64%), 9/25 faster (36%)
|
||||
|
||||
✅ **System Health:** All tests clean (RAM >5 GB free, 0 zombies)
|
||||
|
||||
---
|
||||
|
||||
## Test Summary
|
||||
|
||||
```
|
||||
Total tests: 251
|
||||
Passed: 179
|
||||
With model: 122
|
||||
Infrastructure: 57
|
||||
Skipped: 72
|
||||
Duration: 1157.1s (19.3 min)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## System Health
|
||||
|
||||
```
|
||||
Swap (MB): min=1524, max=1524, avg=1524.0
|
||||
RAM free (GB): min=8.5, max=38.8, avg=22.8
|
||||
Zombies: min=0, max=0
|
||||
|
||||
Quality Flags (Thresholds: RAM <5 GB free, zombies >0):
|
||||
Clean: 251/251 (100.0%)
|
||||
Degraded (RAM): 0
|
||||
Degraded (zombies): 0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Per-Model Statistics
|
||||
|
||||
| Model | Size | Mode | Tests | Time | Old | Δ | Change | RAM (GB) |
|
||||
|-------|-----:|:----:|------:|-----:|----:|--:|-------:|---------:|
|
||||
| Mistral-Small-3.1-24B-Instruct-2503-4bit | 13.2 GB | Vision | 2 | 42.3s | 42.8s | -0.5s | -1.2% ✅ | 18.8-19.1 |
|
||||
| Mistral-Small-3.1-24B-Instruct-2503-4bit | 13.2 GB | Text | 1 | 5.8s | 6.9s | -1.0s | -15.2% ✅ | 15.1 |
|
||||
| DeepHermes-3-Mistral-24B-Preview-8bit | 23.3 GB | Text | 7 | 134.0s | 138.4s | -4.5s | -3.2% ✅ | 23.8-28.1 |
|
||||
| DeepSeek-R1-Distill-Llama-8B-4bit | 4.2 GB | Text | 4 | 30.5s | 30.3s | +0.3s | +0.8% | 19.9-20.3 |
|
||||
| EuroLLM-22B-Instruct-2512-mlx-8bit | 22.4 GB | Text | 4 | 55.8s | 55.5s | +0.3s | +0.6% | 23.6-23.7 |
|
||||
| Gabliterated-Qwen3-0.6B-float32 | 2.2 GB | Text | 4 | 28.6s | 28.1s | +0.5s | +1.7% | 21.0-21.4 |
|
||||
| gpt-oss-20b-MXFP4-Q8 | 11.3 GB | Text | 4 | 51.5s | 50.9s | +0.6s | +1.2% | 14.4-14.7 |
|
||||
| Llama-3.2-3B-Instruct-4bit | 1.7 GB | Text | 4 | 15.6s | 14.9s | +0.8s | +5.1% ⚠️ | 19.2-19.7 |
|
||||
| Mistral-7B-Instruct-v0.2-4bit | 4.0 GB | Text | 4 | 14.3s | 14.2s | +0.2s | +1.2% | 15.2-15.7 |
|
||||
| Mistral-Small-3.2-24B-Instruct-2506-4bit | 12.4 GB | Text | 4 | 36.9s | 35.7s | +1.2s | +3.4% | 13.4-13.5 |
|
||||
| Mistral-Small-3.2-24B-Instruct-2506-8bit | 23.3 GB | Text | 4 | 62.3s | 66.3s | -4.0s | -6.0% ✅ | 24.4-25.3 |
|
||||
| Mistral-Small-Instruct-2409-4bit | 11.7 GB | Text | 4 | 28.0s | 27.7s | +0.3s | +1.0% | 12.6-13.4 |
|
||||
| Mixtral-8x7B-Instruct-v0.1-4bit | 24.5 GB | Text | 4 | 51.9s | 53.1s | -1.2s | -2.2% ✅ | 27.3-27.4 |
|
||||
| OpenCodeInterpreter-DS-33B-hf-4bit-mlx | 17.8 GB | Text | 4 | 37.9s | 38.1s | -0.2s | -0.5% | 18.7-18.8 |
|
||||
| Phi-3-mini-4k-instruct-4bit | 2.0 GB | Text | 4 | 10.9s | 12.3s | -1.3s | -11.0% ✅ | 16.7-16.8 |
|
||||
| Phi-3.5-mini-instruct-4bit | 2.0 GB | Text | 4 | 13.2s | 13.1s | +0.1s | +0.4% | 14.6-14.8 |
|
||||
| pixtral-12b-4bit | 7.0 GB | Vision | 7 | 79.7s | 25.0s | +54.7s | +218.4% ⚠️ | 12.6-18.8 |
|
||||
| pixtral-12b-4bit | 7.0 GB | Text | 1 | 3.7s | 4.2s | -0.5s | -11.2% ✅ | 12.8 |
|
||||
| pixtral-12b-8bit | 12.6 GB | Vision | 2 | 29.5s | 25.9s | +3.6s | +13.9% ⚠️ | 16.4-16.6 |
|
||||
| pixtral-12b-8bit | 12.6 GB | Text | 1 | 3.9s | 3.4s | +0.5s | +14.6% ⚠️ | 14.3 |
|
||||
| Qwen2.5-0.5B-Instruct-4bit | 0.3 GB | Text | 4 | 16.3s | 22.9s | -6.6s | -28.7% ✅ | 14.3-14.5 |
|
||||
| Qwen2.5-Coder-7B-Instruct-8bit | 7.5 GB | Text | 4 | 22.5s | 21.5s | +1.0s | +4.9% | 8.5-8.7 |
|
||||
| Qwen3-30B-A3B-Instruct-2507-4bit | 16.0 GB | Text | 4 | 34.5s | 34.4s | +0.2s | +0.5% | 17.1-17.3 |
|
||||
| Qwen3-32B-4bit | 17.2 GB | Text | 4 | 63.6s | 62.7s | +0.9s | +1.4% | 18.4-18.7 |
|
||||
| Qwen3-Coder-30B-A3B-Instruct-4bit | 16.0 GB | Text | 4 | 36.5s | 37.7s | -1.2s | -3.3% ✅ | 17.1-17.2 |
|
||||
| Qwen3-Coder-30B-A3B-Instruct-6bit-DWQ-lr9e-8 | 24.9 GB | Text | 4 | 56.6s | 56.7s | -0.2s | -0.3% | 26.0-26.1 |
|
||||
| whisper-large-v3-turbo-4bit | 0.4 GB | Audio | 11 | 25.5s | 25.4s | +0.1s | +0.5% | 33.7-34.9 |
|
||||
| whisper-large-v3-turbo-8bit | 0.8 GB | Audio | 11 | 24.6s | 24.5s | +0.1s | +0.4% | 33.7-34.4 |
|
||||
|
||||
### Model Categories
|
||||
|
||||
```
|
||||
LARGE MODELS (≥20 GB): 5 models
|
||||
Avg size: 23.7 GB
|
||||
Text Tests:
|
||||
Models tested: 5
|
||||
Avg test time: 15.2s
|
||||
RAM range: 23.6-28.1 GB
|
||||
|
||||
MEDIUM MODELS (10-20 GB): 9 models
|
||||
Avg size: 14.2 GB
|
||||
Vision Tests:
|
||||
Models tested: 2
|
||||
Avg test time: 18.0s
|
||||
RAM range: 16.4-19.1 GB
|
||||
Text Tests:
|
||||
Models tested: 9
|
||||
Avg test time: 9.1s
|
||||
RAM range: 12.6-18.8 GB
|
||||
|
||||
SMALL MODELS (<10 GB): 11 models
|
||||
Avg size: 2.9 GB
|
||||
Vision Tests:
|
||||
Models tested: 1
|
||||
Avg test time: 11.4s
|
||||
RAM range: 12.6-18.8 GB
|
||||
Text Tests:
|
||||
Models tested: 9
|
||||
Avg test time: 4.6s
|
||||
RAM range: 8.5-21.4 GB
|
||||
Audio Tests:
|
||||
Models tested: 2
|
||||
Avg test time: 2.3s
|
||||
RAM range: 33.7-34.9 GB
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Per-Test Statistics
|
||||
|
||||
Shows performance range across models for each test.
|
||||
|
||||
| Test Name | Mode | Models | Fastest | Slowest | Med | Old | Δ Med |
|
||||
|-----------|:----:|-------:|---------|---------|----:|----:|------:|
|
||||
| test_audio_output_not_empty | Audio | 2 | whisper (1.9s) | whisper (1.9s) | 1.9s | 1.9s | +0.8% |
|
||||
| test_chart_axis_label_reading | Vision | 1 | pixtral (12.4s) | pixtral (12.4s) | 12.4s | 12.8s | -2.9% |
|
||||
| test_chat_completions_batch | Text | 20 | Qwen2.5 (2.1s) | Qwen3 (12.4s) | 7.9s | 7.9s | -0.3% |
|
||||
| test_chat_completions_streaming | Text | 20 | Phi (4.3s) | Qwen3 (30.7s) | 14.2s | 15.2s | -6.3% |
|
||||
| test_chess_position_e6 | Vision | 1 | pixtral (10.9s) | pixtral (10.9s) | 10.9s | 13.0s | -16.3% |
|
||||
| test_contract_name_extraction | Vision | 1 | pixtral (12.4s) | pixtral (12.4s) | 12.4s | 12.6s | -2.3% |
|
||||
| test_json_interactive_error_path | Text | 1 | DeepHermes (3.8s) | DeepHermes (3.8s) | 3.8s | 5.2s | -25.7% |
|
||||
| test_large_image_support | Vision | 1 | pixtral (9.3s) | pixtral (9.3s) | 9.3s | 9.9s | -5.7% |
|
||||
| test_mug_color_identification | Vision | 1 | pixtral (12.4s) | pixtral (12.4s) | 12.4s | 12.7s | -2.6% |
|
||||
| test_pipe_from_list_json | Text | 1 | DeepHermes (65.3s) | DeepHermes (65.3s) | 65.3s | 66.3s | -1.6% |
|
||||
| test_run_command | Text | 20 | Qwen2.5 (1.4s) | Mistral (13.9s) | 6.9s | 6.8s | +0.5% |
|
||||
| test_run_json_output | Text | 20 | Qwen2.5 (1.4s) | Mistral (13.9s) | 6.9s | 6.8s | +0.6% |
|
||||
| test_segment_metadata_optional | Audio | 2 | whisper (2.0s) | whisper (2.0s) | 2.0s | 1.9s | +1.8% |
|
||||
| test_single_image_chat_completion | Vision | 3 | pixtral (10.2s) | BrokeC/Mistral (20.8s) | 16.0s | 12.4s | +29.0% |
|
||||
| test_stdin_dash_appends_trailing_text | Text | 1 | DeepHermes (12.1s) | DeepHermes (12.1s) | 12.1s | 12.2s | -1.1% |
|
||||
| test_streaming_graceful_degradation | Vision | 3 | pixtral (12.2s) | BrokeC/Mistral (21.5s) | 13.5s | 13.5s | +0.0% |
|
||||
| test_text_request_still_works_on_vision_model | Text | 3 | pixtral (3.7s) | BrokeC/Mistral (5.8s) | 3.9s | 4.2s | -7.5% |
|
||||
| test_transcribe_longer_audio_wav | Audio | 2 | whisper (2.0s) | whisper (2.0s) | 2.0s | 2.0s | +1.8% |
|
||||
| test_transcribe_mp3_format | Audio | 2 | whisper (1.9s) | whisper (2.0s) | 1.9s | 1.9s | +0.5% |
|
||||
| test_transcribe_short_audio_wav | Audio | 2 | whisper (2.2s) | whisper (3.1s) | 2.7s | 2.7s | -0.6% |
|
||||
| test_transcription_endpoint_json | Audio | 2 | whisper (2.5s) | whisper (2.5s) | 2.5s | 2.5s | +0.0% |
|
||||
| test_transcription_endpoint_mp3 | Audio | 2 | whisper (2.5s) | whisper (2.6s) | 2.5s | 2.5s | +0.5% |
|
||||
| test_transcription_endpoint_rejects_oversized_audio | Audio | 2 | whisper (1.8s) | whisper (1.9s) | 1.9s | 1.8s | +1.1% |
|
||||
| test_transcription_endpoint_text_format | Audio | 2 | whisper (2.5s) | whisper (2.5s) | 2.5s | 2.5s | -0.3% |
|
||||
| test_transcription_endpoint_verbose_json | Audio | 2 | whisper (2.5s) | whisper (2.5s) | 2.5s | 2.5s | +0.4% |
|
||||
| test_transcription_endpoint_with_language | Audio | 2 | whisper (2.5s) | whisper (2.5s) | 2.5s | 2.5s | -0.2% |
|
||||
|
||||
|
||||
---
|
||||
|
||||
## GPU Analysis (v1.1)
|
||||
|
||||
Per-test GPU utilization correlated from memory monitoring data.
|
||||
|
||||
### Cold-Start Tests
|
||||
|
||||
First test per model (includes model loading overhead).
|
||||
|
||||
| Test | Model | GPU Dev | GPU Rnd | Samples |
|
||||
|------|-------|--------:|--------:|--------:|
|
||||
| test_transcribe_short_audio_wav | whisper-large-v3-turbo-4bit | 48.0% | 16.2% | 6 |
|
||||
| test_transcribe_short_audio_wav | whisper-large-v3-turbo-8bit | 48.5% | 18.8% | 4 |
|
||||
| test_run_command | DeepHermes-3-Mistral-24B-Preview-8bit | 14.4% | 12.5% | 20 |
|
||||
| test_run_command | DeepSeek-R1-Distill-Llama-8B-4bit | 14.3% | 14.3% | 7 |
|
||||
| test_run_command | EuroLLM-22B-Instruct-2512-mlx-8bit | 18.7% | 18.2% | 21 |
|
||||
| test_run_command | Gabliterated-Qwen3-0.6B-float32 | 38.8% | 38.2% | 5 |
|
||||
| test_run_command | Llama-3.2-3B-Instruct-4bit | 23.5% | 22.8% | 4 |
|
||||
| test_run_command | Mistral-7B-Instruct-v0.2-4bit | 20.0% | 20.0% | 5 |
|
||||
| test_run_command | Mistral-Small-3.2-24B-Instruct-2506-4bit | 0.0% | 0.0% | 12 |
|
||||
| test_run_command | Mistral-Small-3.2-24B-Instruct-2506-8bit | 16.0% | 15.3% | 24 |
|
||||
| test_run_command | Mistral-Small-Instruct-2409-4bit | 9.6% | 8.7% | 11 |
|
||||
| test_run_command | Mixtral-8x7B-Instruct-v0.1-4bit | 22.2% | 12.9% | 22 |
|
||||
| test_run_command | OpenCodeInterpreter-DS-33B-hf-4bit-mlx | 36.7% | 14.6% | 16 |
|
||||
| test_run_command | Phi-3-mini-4k-instruct-4bit | 49.0% | 22.2% | 4 |
|
||||
| test_run_command | Phi-3.5-mini-instruct-4bit | 67.3% | 15.0% | 3 |
|
||||
|
||||
*... and 10 more cold-start tests*
|
||||
|
||||
### High GPU Utilization Tests
|
||||
|
||||
Tests with >50% average GPU device utilization.
|
||||
|
||||
| Test | Model | GPU Dev | GPU Rnd | Pressure |
|
||||
|------|-------|--------:|--------:|:--------:|
|
||||
| test_chess_position_e6 | pixtral-12b-4bit | 83.6% | 82.5% | OK |
|
||||
| test_contract_name_extraction | pixtral-12b-4bit | 81.6% | 80.1% | OK |
|
||||
| test_mug_color_identification | pixtral-12b-4bit | 80.8% | 77.0% | OK |
|
||||
| test_single_image_chat_completion | pixtral-12b-8bit | 71.1% | 67.1% | OK |
|
||||
| test_streaming_graceful_degradation | pixtral-12b-4bit | 70.3% | 56.0% | OK |
|
||||
| test_audio_output_not_empty | whisper-large-v3-turbo-4bit | 68.8% | 29.8% | OK |
|
||||
| test_transcribe_longer_audio_wav | whisper-large-v3-turbo-8bit | 68.0% | 12.2% | OK |
|
||||
| test_run_command | Phi-3.5-mini-instruct-4bit | 67.3% | 15.0% | OK |
|
||||
| test_streaming_graceful_degradation | Mistral-Small-3.1-24B-Instruct-2503-4bit | 62.1% | 50.5% | OK |
|
||||
| test_chart_axis_label_reading | pixtral-12b-4bit | 62.0% | 55.1% | OK |
|
||||
|
||||
### GPU Utilization Summary
|
||||
|
||||
```
|
||||
Sample interval: 200ms
|
||||
Correlated tests: 134
|
||||
Cold-start tests: 25
|
||||
GPU Device avg: 22.7% (was 20.6%, Δ +2.1%)
|
||||
GPU Device max: 83.6% (was 84.0%, Δ -0.4%)
|
||||
GPU Renderer avg: 15.6% (was 17.6%, Δ -2.0%)
|
||||
GPU Renderer max: 82.5% (was 81.4%, Δ +1.1%)
|
||||
```
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Files
|
||||
|
||||
- **Benchmark report:** `benchmarks/reports/2026-02-11-wet-benchmark-2.0.4-stable.jsonl`
|
||||
- **Schema:** `benchmarks/schemas/report-v0.2.2.schema.json`
|
||||
- **Memory data:** correlated
|
||||
+1
-1
@@ -7,4 +7,4 @@ import warnings
|
||||
# Issue parity with 1.1.0 (Issue #22)
|
||||
warnings.filterwarnings('ignore', message='urllib3 v2 only supports OpenSSL 1.1.1+')
|
||||
|
||||
__version__ = "2.0.4b10"
|
||||
__version__ = "2.0.4"
|
||||
|
||||
+9
-20
@@ -13,29 +13,32 @@ authors = [
|
||||
{name = "The BROKE team", email = "broke@gmx.eu"},
|
||||
]
|
||||
classifiers = [
|
||||
"Development Status :: 4 - Beta",
|
||||
"Development Status :: 5 - Production/Stable",
|
||||
"Intended Audience :: Developers",
|
||||
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3.10",
|
||||
"Programming Language :: Python :: 3.11",
|
||||
"Programming Language :: Python :: 3.12",
|
||||
# Python 3.9: MLX 0.30+ requires 3.10+
|
||||
# Python 3.13+: miniaudio lacks pre-built wheels
|
||||
"Operating System :: MacOS",
|
||||
"Environment :: Console",
|
||||
"License :: OSI Approved :: Apache Software License",
|
||||
]
|
||||
dependencies = [
|
||||
# Core
|
||||
"huggingface-hub>=1.0.0",
|
||||
"requests>=2.32.0",
|
||||
"mlx-lm>=0.30.5", # Vision/Audio extras require 0.30.5+
|
||||
"mlx>=0.30.0",
|
||||
"fastapi>=0.116.0",
|
||||
"uvicorn>=0.35.0",
|
||||
"pydantic>=2.11.0",
|
||||
"httpx>=0.27.0",
|
||||
"python-multipart>=0.0.9", # For audio file uploads
|
||||
"python-multipart>=0.0.9",
|
||||
# MLX stack (Text + Vision + Audio)
|
||||
"mlx>=0.30.0",
|
||||
"mlx-lm>=0.30.5",
|
||||
"mlx-vlm==0.3.10",
|
||||
"mlx-audio==0.3.1",
|
||||
"tiktoken>=0.7.0",
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
@@ -65,20 +68,6 @@ dev = [
|
||||
"ruff>=0.1.0",
|
||||
"mypy>=1.5.0",
|
||||
]
|
||||
vision = [
|
||||
"mlx-vlm==0.3.10", # Pinned: stable, tested version (ADR-012)
|
||||
]
|
||||
audio = [
|
||||
# mlx-audio pinned: 0.3.1 has tiktoken regression (Issue #479)
|
||||
# We bundle complete tiktoken workaround in mlxk2/audio/whisper_tokenizer.py
|
||||
"mlx-audio==0.3.1",
|
||||
"tiktoken>=0.7.0", # Required for Whisper tiktoken fallback
|
||||
]
|
||||
all = [
|
||||
"mlx-vlm==0.3.10", # Pinned: stable, tested version
|
||||
"mlx-audio==0.3.1", # Pinned: tiktoken regression patched by mlxk2
|
||||
"tiktoken>=0.7.0",
|
||||
]
|
||||
|
||||
[tool.setuptools]
|
||||
license-files = [
|
||||
|
||||
@@ -1,19 +0,0 @@
|
||||
# mlx_knife requirements
|
||||
# Core dependencies for HuggingFace model management
|
||||
|
||||
huggingface-hub>=1.3.0
|
||||
requests>=2.32.0
|
||||
mlx-lm>=0.30.5 # Vision/Audio extras require 0.30.5+
|
||||
mlx>=0.30.0 # Core MLX library
|
||||
# Note: transformers pinned by mlx-lm (5.0.0rc3 as of 0.30.5)
|
||||
# Known issue: trust_remote_code dialog affects some models (Klear-46B)
|
||||
|
||||
# API Server dependencies (for 'mlxk server' command)
|
||||
fastapi>=0.116.0
|
||||
uvicorn>=0.35.0
|
||||
pydantic>=2.11.0
|
||||
|
||||
# Test dependencies (for FastAPI TestClient)
|
||||
httpx>=0.27.0
|
||||
|
||||
# Note: Python 3.9-3.14 supported, tested on Apple Silicon M1/M2/M3/M4
|
||||
Reference in New Issue
Block a user