diff --git a/CHANGELOG.md b/CHANGELOG.md index 0c96e75..bf1afea 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,47 @@ # Changelog +## [2.0.4] - 2026-XX-XX (WIP) + +> **First stable release with Audio/STT support.** This release consolidates beta.9 and beta.10 improvements into a production-ready package. + +### Highlights + +**Audio Transcription Production-Ready:** Complete Whisper support via mlx-audio backend. Works out-of-the-box with `pip install mlx-knife[audio]` - no ffmpeg, no system dependencies. Supports >10 minute audio files with excellent accuracy. + +**All Runtime Gates Working:** `mlxk run` preflight checks now match `mlxk list --health` output. Embedding models, unsupported STT types, and Voxtral tekken.json are properly blocked with helpful error messages. + +**Comprehensive Test Coverage:** 697 unit tests covering whisper tokenizer, embedding gates, audio CLI, and runtime compatibility. + +### Added (since beta.10) + +- **Unit Tests:** + - `test_whisper_tokenizer.py`: 47 tests for tiktoken workaround (get_encoding, get_tokenizer, Tokenizer class) + - `TestEmbeddingGate`: 3 tests for embedding model runtime blocking + +### Fixed (since beta.10) + +- **Run Preflight Consistency:** `run.py` now passes `probe` and `framework` to `audio_runtime_compatibility()`. STT model_type and tekken.json gates now work in CLI (previously only in list/health). + +- **STT model_type Gate:** Extended to accept `vibevoice` and `audio` model types (was only `whisper`/`voxtral`). VibeVoice-ASR and future STT models no longer blocked incorrectly. + +- **MLX 0.30.x Compatibility:** `whisper_tokenizer.py` now catches all exceptions (not just ImportError) when importing from mlx-audio. LANGUAGES fallback works when MLX API is incompatible. + +### Changed (since beta.10) + +- **Code Cleanup:** + - Removed ~45 LOC duplication in `audio_runner.py` (get_encoding now imported from whisper_tokenizer) + - `STT_MODEL_TYPES` in capabilities.py synchronized with runtime gate + +### Upgrade from beta.10 + +```bash +pip install --upgrade mlx-knife[all] +``` + +No breaking changes. All beta.10 functionality preserved. + +--- + ## [2.0.4-beta.10] - 2026-02-05 > **⚠️ Upgrade Notice:** If you installed beta.9 from PyPI, audio transcription does not work due to an incomplete tiktoken patch. Please upgrade to beta.10: `pip install mlx-knife[all]==2.0.4b10` diff --git a/README.md b/README.md index ad5f4af..e1638b4 100644 --- a/README.md +++ b/README.md @@ -28,6 +28,7 @@ - **Resumable Downloads** - Interrupted clone/pull operations continue automatically - **Safe Vision Batch Processing** - Automatic chunking prevents Metal OOM crashes - **Workspace Path Support** - Run/show/server/list commands work with local directories +- **Updated [SECURITY.md](SECURITY.md)** - Clarifies mlx-knife vs. upstream library behavior; important for offline/air-gapped use ### Core Functionality - **List & Manage Models**: Browse your HuggingFace cache with MLX-specific filtering diff --git a/SECURITY.md b/SECURITY.md index 144797a..dcdd166 100644 --- a/SECURITY.md +++ b/SECURITY.md @@ -2,25 +2,36 @@ ## Overview -MLX Knife is designed to run locally on your Apple Silicon Mac. It prioritizes user privacy and security by keeping all model execution local. Network activity is limited to explicit interactions with Hugging Face: downloading models (pull) and, in 2.0 alpha, an opt‑in alpha upload (push) when you run it explicitly. No background network traffic. +MLX Knife is designed to run locally on your Apple Silicon Mac. It prioritizes user privacy and security by keeping all model execution local. + +**Important distinction:** MLX Knife integrates upstream libraries (mlx-lm, mlx-vlm, mlx-audio, transformers) whose behavior is outside our direct control. This document describes what **mlx-knife itself** does; upstream libraries may behave differently. ## Security Model -### What MLX Knife Does +### What MLX Knife Itself Does - ✅ Runs models locally on your device -- ✅ Downloads models only from HuggingFace (trusted repository) +- ✅ Downloads models only from HuggingFace (via `pull`, `clone`) +- ✅ Uploads only when you explicitly run `push` (opt-in, requires credentials) - ✅ API server binds to localhost by default - ✅ No telemetry or usage tracking -- ✅ No external API calls (except explicit Hugging Face interactions: downloads via pull; optional upload via experimental push) -- ✅ Can upload a local workspace to Hugging Face only when you explicitly run `mlxk2 push` (alpha feature, opt‑in) +- ✅ No automatic updates or phone-home features -### What MLX Knife Doesn't Do -- ❌ No data is sent to external servers automatically or in the background +### What MLX Knife Itself Doesn't Do - ❌ No model outputs are logged or transmitted - ❌ No user tracking or analytics -- ❌ No automatic updates or phone-home features - - Note: The alpha `push` command will upload files from a user‑selected local folder to Hugging Face only when you run it explicitly and provide credentials. It never runs implicitly. +- ❌ mlx-knife code does not initiate network requests during `run`, `server`, or `show` + +### Third-Party Libraries + +MLX Knife uses external libraries to load and run models. These libraries may download additional files when a model is first used - this is outside mlx-knife's control. + +**What this means:** +- Downloading a model with `pull` does not guarantee fully offline use +- Some models may need additional downloads when first run +- We recommend models from `mlx-community/*` but cannot guarantee third-party behavior + +**For offline environments:** +Test each model while online before relying on offline use. Use `mlxk clone` to create a local workspace for better isolation. ## Reporting Security Vulnerabilities @@ -142,11 +153,9 @@ We provide security updates for these versions: | Version | Security Support | | ------- | ------------------ | -| 2.0.3 | :white_check_mark: Current stable | -| 2.0.2 | :white_check_mark: Supported | -| 2.0.1 | :white_check_mark: Supported | -| 2.0.0 | :white_check_mark: Supported | -| < 2.0.0 | :x: Upgrade recommended | +| 2.0.4 | :white_check_mark: Current stable | +| 2.0.3 | :white_check_mark: Supported | +| < 2.0.3 | :x: Upgrade recommended | ## Additional Resources diff --git a/benchmarks/README.md b/benchmarks/README.md index d0ea81a..930c421 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -17,8 +17,8 @@ benchmarks/ │ ├── 2025-12-20-v2.0.4b3.jsonl # Raw data (one file per test run) │ └── BENCHMARK-v1.0-*.md # Generated analysis reports ├── schemas/ # JSON Schema definitions -│ ├── report-v0.1.0.schema.json # lecacy schema -│ ├── report-v0.2.0.schema.json # Current schema +│ ├── report-v0.1.schema.json # legacy schema +│ ├── report-v0.2.2.schema.json # Current schema │ └── report-current.schema.json # Symlink → current schema ├── tools/ # Standalone tools │ ├── memmon.py # Memory monitor (background sampling) @@ -94,24 +94,22 @@ python benchmarks/tools/memplot.py memory.jsonl benchmark.jsonl -o timeline.html **Blue line:** RAM free over time (GB) -**Diamond markers (vm_pressure):** -- 🟢 **Green (0):** Normal - no memory pressure -- 🟡 **Yellow (1-2):** Warning - system preparing to swap -- 🔴 **Red (4):** Critical - system actively swapping +**RAM line marker colors** (per-point coloring based on available RAM): +- 🟢 **Green:** ≥32 GB free - healthy +- 🟠 **Orange:** 16-32 GB free - warning zone +- 🔴 **Red:** <16 GB free - critical -**Dashed threshold lines:** -- **Green line (32 GB):** 50% threshold (64 GB system) -- **Orange line (16 GB):** 25% threshold +**Memory pressure background** (semi-transparent overlays): +- 🟡 **Yellow `rgba(255, 204, 0, 0.15)`:** WARN level - system preparing to swap +- 🔴 **Red `rgba(255, 59, 48, 0.15)`:** CRITICAL level - system actively swapping **Red line (right axis):** Swap Used (MB) - only visible when > 0 #### Row 2: CPU Load -**User (cyan):** User space CPU % -**System (orange):** Kernel CPU % -**Idle (green fill):** Idle CPU % - -**Load Average (purple dashed):** 1-minute load (right axis) +**Load Average (purple):** 1-minute load average +**User (green fill):** User space CPU % +**System (red fill):** Kernel CPU % (stacked on User) #### Row 3: GPU Utilization (Apple Silicon) @@ -212,17 +210,17 @@ Example: 57 GB used in test_text_request_still_works_on_vision_model **RAM Free:** - Source: `vm_stat` (macOS native) - Calculation: `(free + inactive + purgeable + speculative) * page_size / 1e9` -- Sample rate: 500ms (2 samples/second) +- Sample rate: 200ms (5 samples/second) **Memory Pressure:** - Source: `sysctl kern.memorystatus_vm_pressure_level` - Values: 1=NORMAL, 2=WARN, 4=CRITICAL -- Sample rate: 500ms (synchronized with RAM) +- Sample rate: 200ms (synchronized with RAM) **Swap Used:** - Source: `sysctl vm.swapusage` - Unit: MB -- Sample rate: 500ms +- Sample rate: 200ms **Test Metadata:** - Source: Benchmark JSONL (pytest-json-report format) @@ -244,7 +242,7 @@ Example: 57 GB used in test_text_request_still_works_on_vision_model - **Fix planned:** Log parsing in v0.3.0/v1.0 3. **Dense test sequences** - - Tests shorter than 500ms sample rate → no coloring + - Tests shorter than 200ms sample rate → no coloring - Typical: Fast infrastructure tests (<100ms) - **Workaround:** Test labels show all tests diff --git a/benchmarks/TESTING.md b/benchmarks/TESTING.md index 6d8d063..a686305 100644 --- a/benchmarks/TESTING.md +++ b/benchmarks/TESTING.md @@ -152,12 +152,12 @@ python benchmarks/validate_reports.py benchmarks/reports/2025-12-20-v2.0.4b3.jso ## Schema Reference -### Current Schema: v0.2.0 +### Current Schema: v0.2.2 Required fields: ```json { - "schema_version": "0.2.0", + "schema_version": "0.2.2", "timestamp": "2025-12-20T02:26:10.722510+00:00", "mlx_knife_version": "2.0.4-beta.3", "test": "tests_2.0/live/test_cli_e2e.py::test_run_command[discovered_00]", @@ -249,6 +249,62 @@ Model not found in comparison file. Check: --- +## Planned: Canonical Paths (v0.3.0) + +**Status:** Planned for Schema v0.3.0 + +### Directory Structure + +``` +benchmarks/ +├── runs/ # Raw JSONL data (new) +│ ├── 2026-02-06_wet-benchmark.jsonl +│ └── 2026-02-06_wet-memory.jsonl +├── reports/ # Generated Markdown only +│ └── BENCHMARK-*.md +├── schemas/ +└── tools/ +``` + +### Naming Convention + +``` +runs/_.jsonl +``` + +Examples: +- `runs/2026-02-06_wet-benchmark.jsonl` +- `runs/2026-02-06_wet-memory.jsonl` + +### Tool Defaults + +| Tool | Input Default | Output Default | +|------|---------------|----------------| +| `generate_report.py` | `runs/*_benchmark.jsonl` (latest) | `reports/BENCHMARK-*.md` | +| `memmon.py` | — | `runs/_-memory.jsonl` | +| `memplot.py` | `runs/*_memory.jsonl` (latest) | `reports/*-timeline.html` | + +### New Flags + +```bash +# Compare with previous run (auto-detected) +python benchmarks/tools/generate_report.py --compare-with-previous wet-benchmark + +# Explicit paths (unchanged) +python benchmarks/tools/generate_report.py \ + runs/2026-02-06_wet-benchmark.jsonl \ + --compare runs/2026-02-05_wet-benchmark.jsonl +``` + +### Migration + +1. Create `runs/` directory +2. Move existing `reports/*.jsonl` → `runs/` +3. Rename to canonical format +4. Update tool defaults + +--- + ## Future: Phase 1 (mlxk-benchmark) Phase 1 will introduce a standalone benchmark package: diff --git a/benchmarks/generate_benchmark_report.py b/benchmarks/generate_benchmark_report.py index c3f4a10..3c7e9a3 100644 --- a/benchmarks/generate_benchmark_report.py +++ b/benchmarks/generate_benchmark_report.py @@ -12,6 +12,17 @@ Usage: # With comparison python benchmarks/generate_benchmark_report.py new.jsonl --compare old.jsonl + + # With GPU/memory correlation (v1.1) + python benchmarks/generate_benchmark_report.py benchmark.jsonl --memory memory.jsonl + + # Full comparison with GPU analysis + python benchmarks/generate_benchmark_report.py new-benchmark.jsonl \\ + --memory new-memory.jsonl \\ + --compare old-benchmark.jsonl + + Note: When using --compare, the memory file for the comparison run is + auto-detected if it follows the naming convention (benchmark → memory). """ import argparse @@ -29,11 +40,144 @@ except ImportError: # Template version -TEMPLATE_VERSION = "1.0" +TEMPLATE_VERSION = "1.1" REPORTS_DIR = Path("benchmarks/reports") SCHEMA_PATH = Path("benchmarks/schemas/report-current.schema.json") +def correlate_memory_to_tests( + benchmark_data: List[dict], + memory_data: List[dict] +) -> Dict[str, Dict]: + """Correlate memory samples to individual tests by timestamp. + + For each test, finds all memory samples that fall within [test_start, test_end] + and aggregates GPU/memory statistics. + + Args: + benchmark_data: List of benchmark test results (with timestamp, duration) + memory_data: List of memory samples (with ts Unix timestamp) + + Returns: + Dict mapping test identifier to aggregated GPU/memory stats + """ + # Filter out non-test entries (e.g., summary lines) + tests = [e for e in benchmark_data if "timestamp" in e and "duration" in e] + + # Sort memory samples by timestamp + sorted_memory = sorted(memory_data, key=lambda m: m.get("ts", 0)) + + correlations = {} + + for test in tests: + # Parse test timestamp to Unix + try: + test_dt = datetime.fromisoformat(test["timestamp"]) + test_start = test_dt.timestamp() + test_end = test_start + test["duration"] + except (ValueError, KeyError): + continue + + # Find memory samples within test window + # Using binary search would be faster, but linear is fine for ~3000 samples + samples_in_window = [ + m for m in sorted_memory + if test_start <= m.get("ts", 0) <= test_end + ] + + if not samples_in_window: + continue + + # Aggregate GPU stats + gpu_device = [m.get("gpu_device_util", 0) for m in samples_in_window] + gpu_renderer = [m.get("gpu_renderer_util", 0) for m in samples_in_window] + gpu_tiler = [m.get("gpu_tiler_util", 0) for m in samples_in_window] + cpu_user = [m.get("cpu_user", 0) for m in samples_in_window] + cpu_sys = [m.get("cpu_sys", 0) for m in samples_in_window] + mem_pressure = [m.get("memory_pressure", 1) for m in samples_in_window] + + # Create unique test identifier + test_id = test.get("test", "unknown") + + correlations[test_id] = { + "sample_count": len(samples_in_window), + "gpu_device_avg": round(sum(gpu_device) / len(gpu_device), 1) if gpu_device else 0, + "gpu_device_max": max(gpu_device) if gpu_device else 0, + "gpu_renderer_avg": round(sum(gpu_renderer) / len(gpu_renderer), 1) if gpu_renderer else 0, + "gpu_renderer_max": max(gpu_renderer) if gpu_renderer else 0, + "gpu_tiler_avg": round(sum(gpu_tiler) / len(gpu_tiler), 1) if gpu_tiler else 0, + "cpu_user_avg": round(sum(cpu_user) / len(cpu_user), 1) if cpu_user else 0, + "cpu_sys_avg": round(sum(cpu_sys) / len(cpu_sys), 1) if cpu_sys else 0, + "memory_pressure_max": max(mem_pressure) if mem_pressure else 1, + } + + return correlations + + +def find_memory_file_for_benchmark(benchmark_file: Path) -> Optional[Path]: + """Auto-detect memory JSONL file for a benchmark file. + + Convention: benchmark file "2026-02-06-wet-benchmark-1.jsonl" + → memory file "2026-02-06-wet-memory-1.jsonl" + + Args: + benchmark_file: Path to benchmark JSONL file + + Returns: + Path to memory file if found, None otherwise + """ + # Replace "benchmark" with "memory" in filename + memory_name = benchmark_file.name.replace("benchmark", "memory") + memory_path = benchmark_file.parent / memory_name + + if memory_path.exists(): + return memory_path + return None + + +def detect_cold_starts(benchmark_data: List[dict]) -> Dict[str, bool]: + """Detect which tests are cold starts (first test for a given model). + + Cold starts typically have higher latency due to model loading, + cache warming, and JIT compilation. + + A cold start is the FIRST test for a specific model, regardless of + test name or parametrization. Subsequent tests on the same model + benefit from cached weights and warm GPU state. + + Args: + benchmark_data: List of benchmark test results + + Returns: + Dict mapping test identifier to is_cold_start boolean + """ + # Track first occurrence of each model + seen_models = set() + cold_starts = {} + + # Sort by timestamp to ensure correct ordering + sorted_tests = sorted( + [e for e in benchmark_data if "timestamp" in e and "model" in e], + key=lambda e: e.get("timestamp", "") + ) + + for test in sorted_tests: + test_id = test.get("test", "unknown") + model_id = test.get("model", {}).get("id", "unknown") + + # Normalize model_id: some tests report "mlx-community/foo", others just "foo" + # This is a data inconsistency in test fixtures, not different models + normalized_model_id = model_id.replace("mlx-community/", "") + + if normalized_model_id not in seen_models: + cold_starts[test_id] = True + seen_models.add(normalized_model_id) + else: + cold_starts[test_id] = False + + return cold_starts + + def load_schema() -> dict: """Load current JSON schema.""" if not SCHEMA_PATH.exists(): @@ -811,10 +955,132 @@ Quality Flags (Thresholds: RAM <5 GB free, zombies >0): md += "```\n\n" + # GPU Analysis Section (v1.1 - only if memory data provided) + gpu_correlations = stats.get("gpu_correlations", {}) + cold_starts = stats.get("cold_starts", {}) + + if gpu_correlations: + md += "\n---\n\n" + md += "## GPU Analysis (v1.1)\n\n" + md += "Per-test GPU utilization correlated from memory monitoring data.\n\n" + + # Aggregate GPU stats by model + model_gpu_stats = {} + for test_id, gpu_stats in gpu_correlations.items(): + # Find the corresponding test entry to get model info + test_entry = next((t for t in stats.get("tests", {}).values() + if any(r["model"] in test_id for r in t.get("runs", []))), None) + if not test_entry: + continue + + for run in test_entry.get("runs", []): + model_id = run["model"] + if model_id not in model_gpu_stats: + model_gpu_stats[model_id] = { + "gpu_device_samples": [], + "gpu_renderer_samples": [], + "cold_start_count": 0, + "total_count": 0, + } + model_gpu_stats[model_id]["gpu_device_samples"].append(gpu_stats["gpu_device_avg"]) + model_gpu_stats[model_id]["gpu_renderer_samples"].append(gpu_stats["gpu_renderer_avg"]) + model_gpu_stats[model_id]["total_count"] += 1 + if cold_starts.get(test_id, False): + model_gpu_stats[model_id]["cold_start_count"] += 1 + + # Cold-start analysis + cold_start_tests = [(tid, gpu_correlations.get(tid, {})) + for tid, is_cold in cold_starts.items() if is_cold] + + if cold_start_tests: + md += "### Cold-Start Tests\n\n" + md += "First test per model (includes model loading overhead).\n\n" + md += f"```\n" + md += f"{'Test':<60} {'GPU Dev':<8} {'GPU Rnd':<8} {'Samples':<8}\n" + md += f"{'='*60} {'='*8} {'='*8} {'='*8}\n" + + for test_id, gpu_stats in cold_start_tests[:15]: # Limit to 15 + test_short = test_id.split("::")[-1][:58] + if gpu_stats: + md += f"{test_short:<60} {gpu_stats['gpu_device_avg']:>6.1f}% {gpu_stats['gpu_renderer_avg']:>6.1f}% {gpu_stats['sample_count']:>8}\n" + else: + md += f"{test_short:<60} {'N/A':>8} {'N/A':>8} {'N/A':>8}\n" + + if len(cold_start_tests) > 15: + md += f"... and {len(cold_start_tests) - 15} more cold-start tests\n" + md += f"```\n\n" + + # High GPU utilization tests + high_gpu_tests = [(tid, gs) for tid, gs in gpu_correlations.items() + if gs.get("gpu_device_avg", 0) > 50] + high_gpu_tests.sort(key=lambda x: x[1]["gpu_device_avg"], reverse=True) + + if high_gpu_tests: + md += "### High GPU Utilization Tests\n\n" + md += "Tests with >50% average GPU device utilization.\n\n" + md += f"```\n" + md += f"{'Test':<55} {'GPU Dev':<10} {'GPU Rnd':<10} {'Pressure':<10}\n" + md += f"{'='*55} {'='*10} {'='*10} {'='*10}\n" + + for test_id, gpu_stats in high_gpu_tests[:10]: + test_short = test_id.split("::")[-1][:53] + pressure = gpu_stats.get("memory_pressure_max", 1) + pressure_str = "WARN" if pressure > 1 else "OK" + md += f"{test_short:<55} {gpu_stats['gpu_device_avg']:>8.1f}% {gpu_stats['gpu_renderer_avg']:>8.1f}% {pressure_str:>10}\n" + + md += f"```\n\n" + + # GPU utilization summary + all_gpu_device = [gs["gpu_device_avg"] for gs in gpu_correlations.values()] + all_gpu_renderer = [gs["gpu_renderer_avg"] for gs in gpu_correlations.values()] + + # Get comparison GPU data if available + compare_gpu = compare_stats.get("gpu_correlations", {}) if compare_stats else {} + compare_gpu_device = [gs["gpu_device_avg"] for gs in compare_gpu.values()] if compare_gpu else [] + compare_gpu_renderer = [gs["gpu_renderer_avg"] for gs in compare_gpu.values()] if compare_gpu else [] + + if all_gpu_device: + md += "### GPU Utilization Summary\n\n" + md += f"```\n" + sample_interval = stats.get("sample_interval_ms", 200) + md += f"Sample interval: {sample_interval}ms\n" + md += f"Correlated tests: {len(gpu_correlations)}\n" + md += f"Cold-start tests: {len(cold_start_tests)}\n" + + gpu_dev_avg = sum(all_gpu_device)/len(all_gpu_device) + gpu_dev_max = max(all_gpu_device) + gpu_rnd_avg = sum(all_gpu_renderer)/len(all_gpu_renderer) + gpu_rnd_max = max(all_gpu_renderer) + + if compare_gpu_device: + old_dev_avg = sum(compare_gpu_device)/len(compare_gpu_device) + old_dev_max = max(compare_gpu_device) + old_rnd_avg = sum(compare_gpu_renderer)/len(compare_gpu_renderer) + old_rnd_max = max(compare_gpu_renderer) + + dev_avg_delta = gpu_dev_avg - old_dev_avg + dev_max_delta = gpu_dev_max - old_dev_max + rnd_avg_delta = gpu_rnd_avg - old_rnd_avg + rnd_max_delta = gpu_rnd_max - old_rnd_max + + md += f"GPU Device avg: {gpu_dev_avg:>5.1f}% (was {old_dev_avg:>5.1f}%, Δ {dev_avg_delta:+.1f}%)\n" + md += f"GPU Device max: {gpu_dev_max:>5.1f}% (was {old_dev_max:>5.1f}%, Δ {dev_max_delta:+.1f}%)\n" + md += f"GPU Renderer avg: {gpu_rnd_avg:>5.1f}% (was {old_rnd_avg:>5.1f}%, Δ {rnd_avg_delta:+.1f}%)\n" + md += f"GPU Renderer max: {gpu_rnd_max:>5.1f}% (was {old_rnd_max:>5.1f}%, Δ {rnd_max_delta:+.1f}%)\n" + else: + md += f"GPU Device avg: {gpu_dev_avg:.1f}%\n" + md += f"GPU Device max: {gpu_dev_max:.1f}%\n" + md += f"GPU Renderer avg: {gpu_rnd_avg:.1f}%\n" + md += f"GPU Renderer max: {gpu_rnd_max:.1f}%\n" + + md += f"```\n\n" + md += "\n---\n\n" md += "## Files\n\n" md += f"- **Benchmark report:** `{input_file}`\n" md += f"- **Schema:** `benchmarks/schemas/report-v{stats['schema_version']}.schema.json`\n" + if gpu_correlations: + md += f"- **Memory data:** correlated\n" return md @@ -841,6 +1107,11 @@ def main(): type=Path, help='Output markdown file (default: auto-generated in benchmarks/reports/)' ) + parser.add_argument( + '--memory', + type=Path, + help='Memory monitoring JSONL file for GPU/memory correlation (v1.1 feature)' + ) args = parser.parse_args() @@ -874,6 +1145,36 @@ def main(): # Calculate statistics stats = calculate_statistics(data) + # Load memory data and correlate (v1.1 feature) + gpu_correlations = {} + cold_starts = {} + if args.memory: + if not args.memory.exists(): + print(f"❌ Memory file not found: {args.memory}") + sys.exit(1) + print(f"🔬 Loading memory data: {args.memory}") + memory_data = load_jsonl(args.memory) + # Extract summary entry (contains interval_ms, duration, etc.) + memory_summary = next((m["summary"] for m in memory_data if "summary" in m), {}) + # Filter out summary entry for sample processing + memory_samples = [m for m in memory_data if "summary" not in m] + sample_interval_ms = memory_summary.get("interval_ms", 200) + print(f"✓ Loaded {len(memory_samples)} memory samples (interval: {sample_interval_ms}ms)") + + # Correlate memory samples to tests + gpu_correlations = correlate_memory_to_tests(data, memory_samples) + print(f"✓ Correlated GPU data for {len(gpu_correlations)} tests") + + # Detect cold starts + cold_starts = detect_cold_starts(data) + cold_count = sum(1 for v in cold_starts.values() if v) + print(f"✓ Detected {cold_count} cold-start tests") + + # Store in stats for report generation + stats["gpu_correlations"] = gpu_correlations + stats["cold_starts"] = cold_starts + stats["sample_interval_ms"] = sample_interval_ms + # Load and calculate comparison statistics if requested compare_stats = None if args.compare: @@ -887,6 +1188,16 @@ def main(): compare_stats = calculate_statistics(compare_data) print(f"✓ Loaded {len(compare_data)} comparison entries") + # Auto-detect memory file for comparison run (v1.1) + compare_memory_file = find_memory_file_for_benchmark(args.compare) + if compare_memory_file: + print(f"🔬 Auto-detected comparison memory: {compare_memory_file.name}") + compare_memory_data = load_jsonl(compare_memory_file) + compare_memory_samples = [m for m in compare_memory_data if "summary" not in m] + compare_gpu_correlations = correlate_memory_to_tests(compare_data, compare_memory_samples) + print(f"✓ Correlated GPU data for {len(compare_gpu_correlations)} comparison tests") + compare_stats["gpu_correlations"] = compare_gpu_correlations + # Generate report markdown = generate_markdown(stats, input_file, args.compare, compare_stats) diff --git a/docs/ADR/ADR-016-Memory-Aware-Model-Loading.md b/docs/ADR/ADR-016-Memory-Aware-Model-Loading.md index bb82672..d34bb52 100644 --- a/docs/ADR/ADR-016-Memory-Aware-Model-Loading.md +++ b/docs/ADR/ADR-016-Memory-Aware-Model-Loading.md @@ -96,9 +96,53 @@ This is a **hardware fact** (from `sysctl -n hw.memsize`), not a heuristic. **Phase 2b:** ✅ Complete (2.0.4-beta.9) - Model Switching Memory Gate **Phase 3 (Future):** Issue #46 -- [ ] Configurable threshold (env var or CLI flag) -- [ ] Vision overhead estimation based on model architecture -- [ ] KV-Cache size estimation based on context length + +### Phase 3a: User-Configurable Limits +- [ ] `MLXK_MAX_IMAGES` env var +- [ ] `--max-images N` CLI flag +- [ ] `MLXK_MEMORY_THRESHOLD` env var (override 70%) + +### Phase 3b: Benchmark-Based Heuristics + +**Approach:** Empirische Daten statt theoretischer Berechnung. + +**Daten sammeln:** +``` +Für jedes Vision-Modell: + - Config: image_size, patch_size, hidden_size, vision_config + - Test: 1, 2, 4, 8, 16 Images (verschiedene Größen) + - Messen: Peak Memory (memmon.py), OOM-Punkt + - Hardware: M1/M2/M3, 16/32/64/96 GB +``` + +**Korrelation finden:** +``` +Peak_Memory ≈ f(image_size², patch_size, num_images, model_size) +``` + +**Heuristik-Formel ableiten:** +```python +def estimate_vision_memory(config, num_images): + base = model_size_bytes + image_size = config.get("vision_config", {}).get("image_size", 384) + patch_size = config.get("vision_config", {}).get("patch_size", 14) + hidden_size = config.get("hidden_size", 4096) + + # Empirisch kalibrierte Konstanten aus Benchmarks + patches_per_image = (image_size / patch_size) ** 2 + per_image_overhead = patches_per_image * hidden_size * BYTES_PER_ACTIVATION + + return base + (per_image_overhead * num_images * SAFETY_FACTOR) +``` + +**Infrastruktur:** `memmon.py`, `memplot.py` aus beta.9 existieren bereits. + +**Komplexität bei Multimodal:** +- Dynamic Tiling (Qwen2-VL, MiMo-VL) → variable patches pro Bild +- Audio + Vision (Gemma-3n) → zusätzlicher Audio Encoder +- Mixed Input → Kombinations-Overhead + +**Pragmatischer Fallback:** Wenn Heuristik unsicher → User-Limit (Phase 3a) verwenden. --- diff --git a/mlxk2/audio/whisper_tokenizer.py b/mlxk2/audio/whisper_tokenizer.py index 7bafe44..a87ac98 100644 --- a/mlxk2/audio/whisper_tokenizer.py +++ b/mlxk2/audio/whisper_tokenizer.py @@ -24,9 +24,11 @@ import tiktoken # Try to import LANGUAGES from mlx-audio (still present in 0.3.1) # Fall back to our own definition if import fails +# Note: Catch Exception (not just ImportError) because mlx-audio imports MLX +# which may fail with AttributeError on incompatible systems try: from mlx_audio.stt.models.whisper.tokenizer import LANGUAGES, TO_LANGUAGE_CODE -except ImportError: +except Exception: # Fallback: Full LANGUAGES dict from mlx-audio LANGUAGES = { "en": "english", diff --git a/mlxk2/core/audio_runner.py b/mlxk2/core/audio_runner.py index 566bbc7..01e31de 100644 --- a/mlxk2/core/audio_runner.py +++ b/mlxk2/core/audio_runner.py @@ -30,71 +30,23 @@ from ..operations.workspace import is_workspace_path # ============================================================================ def _apply_tiktoken_patch(): - """Apply tiktoken asset patch globally at module import time.""" + """Apply tiktoken asset patch globally at module import time. + + Patches mlx-audio's get_encoding() to use our bundled tiktoken files + from mlxk2/assets/whisper/ instead of the removed upstream assets. + + The actual implementation lives in mlxk2.audio.whisper_tokenizer to avoid + code duplication. This function just wires it up as a monkey-patch. + """ try: - import base64 - import tiktoken - from functools import lru_cache + # Import mlx-audio's tokenizer module (target of the patch) + import mlx_audio.stt.models.whisper.tokenizer as mlx_whisper_tokenizer - # Import tokenizer module first (but don't trigger get_encoding yet) - import mlx_audio.stt.models.whisper.tokenizer as whisper_tokenizer - - # Get path to our bundled tiktoken assets - assets_dir = Path(__file__).parent.parent / "assets" / "whisper" - if not assets_dir.exists(): - # Assets not found - skip patching (will fall back to HF WhisperProcessor) - return - - @lru_cache(maxsize=None) - def patched_get_encoding(name: str = "gpt2", num_languages: int = 99): - """Patched get_encoding using mlxk2's bundled tiktoken files.""" - vocab_path = assets_dir / f"{name}.tiktoken" - - if not vocab_path.exists(): - raise FileNotFoundError( - f"Tiktoken vocabulary file not found: {vocab_path}\n" - f"mlx-audio Issue #479: assets were removed in f7328a4" - ) - - with open(vocab_path) as fid: - ranks = { - base64.b64decode(token): int(rank) - for token, rank in (line.split() for line in fid if line) - } - - n_vocab = len(ranks) - special_tokens = {} - - # Build special tokens (from mlx-audio tokenizer.py:343-358) - from mlx_audio.stt.models.whisper.tokenizer import LANGUAGES - - specials = [ - "<|endoftext|>", - "<|startoftranscript|>", - *[f"<|{lang}|>" for lang in list(LANGUAGES.keys())[:num_languages]], - "<|translate|>", - "<|transcribe|>", - "<|startoflm|>", - "<|startofprev|>", - "<|nospeech|>", - "<|notimestamps|>", - *[f"<|{i * 0.02:.2f}|>" for i in range(1501)], - ] - - for token in specials: - special_tokens[token] = n_vocab - n_vocab += 1 - - return tiktoken.Encoding( - name=vocab_path.name, - explicit_n_vocab=n_vocab, - pat_str=r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""", - mergeable_ranks=ranks, - special_tokens=special_tokens, - ) + # Import our bundled implementation + from mlxk2.audio.whisper_tokenizer import get_encoding # Patch the module globally - whisper_tokenizer.get_encoding = patched_get_encoding + mlx_whisper_tokenizer.get_encoding = get_encoding except ImportError: # mlx-audio not installed - skip patching diff --git a/mlxk2/core/capabilities.py b/mlxk2/core/capabilities.py index 59fa165..f20d8f0 100644 --- a/mlxk2/core/capabilities.py +++ b/mlxk2/core/capabilities.py @@ -78,18 +78,25 @@ VISION_MODEL_TYPES = frozenset({ # STT (Speech-to-Text) model types - Audio ONLY models (no text generation/chat) # These models transcribe audio to text, they cannot generate text or have conversations +# Note: audio_runtime_compatibility() uses substring matching for flexibility +# For 2.0.4 stable: Only Whisper is production-ready +# - Voxtral: tekken.json tokenizer bug (mlx-audio#450) +# - VibeVoice: Missing tokenizer, upstream bug (docs/ISSUES/vibevoice-missing-tokenizer.md) +# - Qwen3-ASR: Not yet released in mlx-audio STT_MODEL_TYPES = frozenset({ - "voxtral", # Voxtral (Audio → Text) - mlx-audio backend - "whisper", # OpenAI Whisper variants - mlx-audio backend + "whisper", # OpenAI Whisper variants - mlx-audio backend (only production-ready) }) # Audio model types (ADR-019, ADR-020) - All audio-capable models # Includes both STT models AND multimodal chat models with audio +# Note: detect_audio_capability() uses substring matching for flexibility AUDIO_MODEL_TYPES = frozenset({ "gemma3n", # Google Gemma 3n (Vision + Audio + Text) - mlx-vlm backend "gemma3n_audio", # Audio encoder subcomponent "voxtral", # Voxtral (Audio → Text) - mlx-audio backend "whisper", # OpenAI Whisper variants - mlx-audio backend + "vibevoice", # VibeVoice-ASR (Whisper-based with diarization) - mlx-audio backend + "audio", # Generic audio/STT models - mlx-audio backend }) diff --git a/mlxk2/operations/common.py b/mlxk2/operations/common.py index 7c05eb9..3a636c3 100644 --- a/mlxk2/operations/common.py +++ b/mlxk2/operations/common.py @@ -294,10 +294,13 @@ def detect_audio_capability(probe: Path, config: Optional[Dict[str, Any]]) -> bo if "audio_config" in config: return True - # Check model_type (Whisper, Voxtral, Gemma-3n) + # Check model_type (Whisper, Voxtral, VibeVoice, Gemma-3n) + # Uses substring matching for flexibility (e.g., "vibevoice_asr" matches "vibevoice") mt = config.get("model_type") - if isinstance(mt, str) and mt.lower() in AUDIO_MODEL_TYPES: - return True + if isinstance(mt, str): + mt_lower = mt.lower() + if any(audio_type in mt_lower for audio_type in AUDIO_MODEL_TYPES): + return True # Check preprocessor_config.json for WhisperFeatureExtractor (Whisper variants) preprocessor_config_path = probe / "preprocessor_config.json" @@ -370,8 +373,11 @@ def detect_audio_backend(probe: Path, config: Optional[Dict[str, Any]]) -> Optio if isinstance(vision_config, dict) and len(vision_config) > 0: return Backend.MLX_VLM - # Priority 3: Whisper model_type = mlx-audio STT - if "whisper" in model_type_lower: + # Priority 3: STT model_type = mlx-audio STT (Whisper only for 2.0.4 stable) + # Uses substring matching for flexibility + # Note: Voxtral/VibeVoice/Qwen3-ASR excluded - see docs/ISSUES/ + stt_model_types = ["whisper"] + if any(stt in model_type_lower for stt in stt_model_types): return Backend.MLX_AUDIO # Priority 4: WhisperFeatureExtractor in preprocessor = mlx-audio STT @@ -488,9 +494,11 @@ def audio_runtime_compatibility( if spec is None: return False, "mlx-audio not installed (pip install mlx-knife[audio])" - # Gate 2: model_type must be supported by mlx-audio (Whisper, Voxtral only) + # Gate 2: model_type must be supported by mlx-audio # This catches mis-routed models like Qwen3-Omni that have WhisperFeatureExtractor # but are NOT STT models (they're multimodal with unsupported architecture) + # Known STT types: whisper, voxtral, vibevoice, audio (generic) + # Note: Must stay in sync with stt_keywords in detect_audio_backend() if probe is not None: config_path = probe / "config.json" if config_path.exists(): @@ -499,9 +507,11 @@ def audio_runtime_compatibility( model_type = config.get("model_type", "") if isinstance(model_type, str): model_type_lower = model_type.lower() - # Check if model_type is a known STT type - if not any(stt in model_type_lower for stt in ["whisper", "voxtral"]): - return False, f"Model type '{model_type}' not supported by mlx-audio (only Whisper/Voxtral)" + # Check if model_type is a known STT type (substring match) + # For 2.0.4 stable: Only Whisper is production-ready + stt_model_types = ["whisper"] + if not any(stt in model_type_lower for stt in stt_model_types): + return False, f"Model type '{model_type}' not supported (only Whisper is production-ready in 2.0.4)" except Exception: pass # If config can't be read, proceed (health check will catch it) diff --git a/mlxk2/operations/run.py b/mlxk2/operations/run.py index 7fa3fcf..6e819ce 100644 --- a/mlxk2/operations/run.py +++ b/mlxk2/operations/run.py @@ -418,7 +418,8 @@ def run_model( # Audio models: check based on backend (ADR-020) audio_backend = detect_audio_backend(model_path, config) if audio_backend: - compatible, reason = audio_runtime_compatibility(audio_backend) + # Pass probe and framework for model_type/tekken.json gates + compatible, reason = audio_runtime_compatibility(audio_backend, model_path, framework) else: # Fallback: unknown audio model compatible, reason = False, "Unknown audio backend" diff --git a/tests_2.0/test_capabilities.py b/tests_2.0/test_capabilities.py index 7ec409d..1e203f9 100644 --- a/tests_2.0/test_capabilities.py +++ b/tests_2.0/test_capabilities.py @@ -575,3 +575,66 @@ class TestMemoryThreshold: # Allow small rounding difference assert abs(threshold - expected) < 1024**2 # Within 1MB + + +class TestEmbeddingGate: + """Tests for embedding model runtime compatibility gate (common.py:636-639). + + Embedding models should be detected but blocked from `mlxk run` with a + helpful message pointing to `mlxk embed`. + """ + + def test_detect_capabilities_embedding_model_type(self, tmp_path): + """model_type='embedding' should return only EMBEDDINGS capability.""" + from mlxk2.operations.common import detect_capabilities + + caps = detect_capabilities( + model_type="embedding", + hf_name="test/embedding-model", + tok_hints={}, + config={}, + probe=tmp_path, + ) + + assert caps == ["embeddings"] + assert "text-generation" not in caps + + def test_embedding_model_runtime_incompatible(self, tmp_path): + """Embedding models should have runtime_compatible=False with helpful reason.""" + from mlxk2.operations.common import build_model_object + + # Create minimal embedding model structure with workspace sentinel + config = {"model_type": "embedding"} + (tmp_path / "config.json").write_text(json.dumps(config)) + (tmp_path / "model.safetensors").write_bytes(b"x" * 100) + # Add workspace sentinel so it's treated as workspace path (ADR-018) + sentinel = {"managed_by": "mlxk", "source": "test"} + (tmp_path / ".mlxk-workspace.json").write_text(json.dumps(sentinel)) + # Add README with MLX library tag so framework is detected as MLX + readme = """--- +library_name: mlx +--- +# Test Embedding Model +""" + (tmp_path / "README.md").write_text(readme) + + # Use absolute path so it's treated as workspace path + result = build_model_object( + hf_name=str(tmp_path), # Absolute path = workspace + model_root=tmp_path, + selected_path=tmp_path, + ) + + assert result["runtime_compatible"] is False + assert "mlxk embed" in result["reason"] + assert "embeddings" in result["capabilities"] + + def test_embedding_model_detected_by_name_heuristic(self, tmp_path): + """Models with 'embedding' in name should be detected as embedding models.""" + config = {"model_type": "bert"} # Not explicitly "embedding" type + (tmp_path / "config.json").write_text(json.dumps(config)) + + caps = probe_model_capabilities(tmp_path, "test/text-embedding-3-small") + + assert caps.is_embedding is True + assert "embeddings" in caps.capabilities_list