Release 2.0.4-beta.8: Audio transcription support (experimental)

Audio input via --audio flag (CLI) and input_audio content type (Server API). Uses mlx-vlm native audio processing. ~30s duration limit (model constraint). Currently only Gemma-3n tested (requires --repair-index fix). Also includes: - SERVER-HANDBOOK compliance (image limits, validation error envelopes) - Dependency updates: mlx>=0.30.0, mlx-lm>=0.30.0, huggingface-hub>=1.0.0 - Audio E2E test suite + ADR-019
2026-07-01 20:44:14 -04:00 · 2026-01-23 20:20:59 +01:00
parent 5751545b8b
commit e8b10ea10b
33 changed files with 2346 additions and 197 deletions
@@ -4,9 +4,9 @@
  <img src="https://github.com/mzau/mlx-knife/raw/main/mlxk-demo.gif" alt="MLX Knife Demo" width="900">
 </p>

-**Current Version: 2.0.4-beta.7** (Stable: 2.0.3)
+**Current Version: 2.0.4-beta.8** (Stable: 2.0.3)

-[![GitHub Release](https://img.shields.io/badge/version-2.0.4--beta.7-blue.svg)](https://github.com/mzau/mlx-knife/releases)
+[![GitHub Release](https://img.shields.io/badge/version-2.0.4--beta.8-blue.svg)](https://github.com/mzau/mlx-knife/releases)
 [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
 [![Python 3.9+](https://img.shields.io/badge/python-3.10+(3.9)-blue.svg)](https://www.python.org/downloads/)
 [![Apple Silicon](https://img.shields.io/badge/Apple%20Silicon-green.svg)](https://support.apple.com/en-us/HT211814)
@@ -18,6 +18,7 @@
 ## Features

 ### What's New in 2.0.4 (Coming Soon - Currently Beta)
+- **Audio Transcription (Experimental)** - Speech-to-text via `--audio` flag (CLI + Server API)
 - **Vision Models with EXIF Metadata** - Image analysis + automatic GPS/date/camera extraction visible to the model
 - **Unix Pipe Integration** - Chain models without temp files (`vision → text` workflows)
 - **Local Development Workflow** - Clone → Repair → Test models without HuggingFace round-trips
@@ -76,7 +77,7 @@ This license applies **only** to the `mlx-knife` code and **does not extend** to
 MLX Knife has been comprehensively tested and verified on:

 ✅ **Python 3.9.6 - 3.14** - Text LLMs fully supported (mlx-lm 0.28.4+)
-✅ **Python 3.10 - 3.14** - Vision models supported (mlx-vlm 0.3.9+; beta.7 uses commit fc8c92e31983a52761f37d503f903ec40bebbd62 with MXFP4 support)
+✅ **Python 3.10 - 3.14** - Vision models supported (mlx-vlm 0.3.9+; beta.8 uses commit 5812270 with audio + MXFP4 support)

 **Note:** Vision features require Python 3.10+. Native macOS Python 3.9.6 users need to upgrade (e.g., via Homebrew).

@@ -106,17 +107,17 @@ mlxk --version  # → mlxk 2.0.3 (latest stable on PyPI)
 ### Via GitHub (Latest Beta)

 ```bash
-# Install 2.0.4-beta.7 (Workspace discovery + Server robustness)
-pip install "git+https://github.com/mzau/mlx-knife.git@v2.0.4-beta.7"
+# Install 2.0.4-beta.8 (Audio transcription + Server enhancements)
+pip install "git+https://github.com/mzau/mlx-knife.git@v2.0.4-beta.8"

 # With Vision support (Python 3.10+ required)
-pip install "git+https://github.com/mzau/mlx-knife.git@v2.0.4-beta.7#egg=mlx-knife[vision]"
+pip install "git+https://github.com/mzau/mlx-knife.git@v2.0.4-beta.8#egg=mlx-knife[vision]"

 # Verify installation
-mlxk --version  # → mlxk 2.0.4b7
+mlxk --version  # → mlxk 2.0.4b8
 ```

-**Beta.7 note:** Uses mlx-vlm commit fc8c92e (includes MXFP4 quantization support). The `[vision]` extra automatically installs the correct version.
+**Beta.8 note:** Uses mlx-vlm commit 5812270 (includes audio support + MXFP4 quantization). The `[vision]` extra automatically installs the correct version.

 **For production use:** Wait for 2.0.4 stable on PyPI (requires mlx-vlm 0.3.10 release).

@@ -134,7 +135,7 @@ pip install -e ".[dev,test]"
 pip install -e ".[dev,test,vision]"

 # Verify installation
-mlxk --version  # → mlxk 2.0.4b7
+mlxk --version  # → mlxk 2.0.4b8

 # Run tests and quality checks (before committing)
 pytest -v
@@ -210,7 +211,7 @@ mlxk run "Phi-4" "Hello"                    # Fuzzy match
 mlxk show "Qwen3@e96" --json                # Specific version
 ```

-### Local Paths (2.0.4-beta.7+)
+### Local Paths (2.0.4-beta.6+)

 | Format | Example |
 |--------|---------|
@@ -240,7 +241,7 @@ mlxk run ./fixed "Test"

 ---

-## Workspace Development Workflow (2.0.4-beta.7+)
+## Workspace Development Workflow (2.0.4-beta.6+)

 **Complete local development cycle** for model experimentation, repair, and testing without HuggingFace round-trips:

@@ -310,8 +311,8 @@ Image analysis via the `--image` flag (CLI and server). Requires Python 3.10+.

 - **Python 3.10+** (mlx-vlm dependency)
 - **Installation:** `pip install mlx-knife[vision]`
- **Backend:** mlx-vlm commit c536165df2b3b4aece3a795b2e414349f935e750 (auto-installed)
- **Beta.4 note:** The `[vision]` extra automatically installs mlx-vlm from git with the Pixtral pad_token fix. Will switch to PyPI v0.3.10 when released.
+- **Backend:** mlx-vlm commit 5812270 (audio + MXFP4 support, auto-installed)
+- **Beta.8 note:** The `[vision]` extra automatically installs mlx-vlm from git with audio support. Will switch to PyPI v0.3.10 when released.

 #### Usage

@@ -475,6 +476,54 @@ mlxk convert <model> <output> --repair-index

 **Reporting Issues:** If you encounter vision model failures, please report with model name and error message to help improve compatibility tracking.

+### Audio Model Compatibility
+
+> **🧪 Experimental:** Audio support is new in v2.0.4-beta.8. Currently only Gemma-3n tested.
+
+**✅ Tested & Working Models** (mlx-knife v2.0.4-beta.8):
+
+| Model | Size | Notes |
+|-------|------|-------|
+| `gemma-3n-E2B-it-4bit` | ~2.1GB | Requires workspace repair workflow (see below) |
+
+**⚠️ Setup Required:** The mlx-community model has an index mismatch (mlx-vlm #624). Prepare once:
+
+```bash
+MLXK2_ENABLE_ALPHA_FEATURES=1
+mlxk clone mlx-community/gemma-3n-E2B-it-4bit ./gemma-3n-audio
+mlxk convert ./gemma-3n-audio ./gemma-3n-audio-FIXED --repair-index
+mlxk run ./gemma-3n-audio-FIXED --audio test.wav  # Now works
+```
+
+**⚙️ Audio-Specific Defaults:**
+
+| Setting | Audio Default | Text/Vision Default | Reason |
+|---------|---------------|---------------------|--------|
+| Temperature | 0.2 | 0.7 | Reduces multilingual drift on 4-bit models |
+| Default Prompt | "Transcribe this audio." | - | Simple prompt reduces multilingual drift |
+
+**⚠️ Known Limitations:**
+
+| Limitation | Details | Workaround |
+|------------|---------|------------|
+| **Duration limit** | ~30 seconds max (Gemma-3n model constraint: 188 tokens at 6.25 tokens/sec) | Split longer recordings |
+| File size | 5MB limit | Split larger files |
+| Multi-audio | Not supported (mlx-vlm token mismatch bug) | Process one file at a time |
+| Audio+Vision combined | Audio silently ignored when images present | Use audio-only or vision-only |
+| Phonetic errors | Mishearing observed (e.g., "A man" → "Amen") | Try `--temperature 0` for consistency |
+| Format support | WAV confirmed; other formats untested | Convert to WAV |
+
+**💡 Tips for Best Results:**
+
+```bash
+# After workspace setup (see above):
+# Explicit transcription with greedy sampling for consistency
+mlxk run ./gemma-3n-audio-FIXED --audio speech.wav --temperature 0
+
+# Default settings (temperature 0.2, simple prompt)
+mlxk run ./gemma-3n-audio-FIXED --audio speech.wav
+```
+

 ## JSON API

@@ -1182,7 +1231,7 @@ Apache License 2.0 — see `LICENSE` (root) and `mlxk2/NOTICE`.

 <p align="center">
  <b>Made with ❤️ by The BROKE team <img src="broke-logo.png" alt="BROKE Logo" width="30" align="middle"></b><br>
-  <i>Version 2.0.4-beta.7 | January 2026</i><br>
+  <i>Version 2.0.4-beta.8 | January 2026</i><br>
  <a href="https://github.com/mzau/broke-nchat">💬 Web UI: nChat - lightweight chat interface</a> •
  <a href="https://github.com/mzau/broke-cluster">🔮 Multi-node: BROKE Cluster</a>
 </p>