Release 2.0.4-beta.10: Audio PyPI fix (tiktoken workaround complete)

Audio/Whisper works with pip install - no Git workaround needed.
See CHANGELOG.md for details.

Tested: 647 passed, 11 skipped (Python 3.10-3.12)
This commit is contained in:
The BROKE Cluster Team
2026-02-05 10:42:50 +01:00
parent 40479b0247
commit e021fb32cd
13 changed files with 819 additions and 38 deletions
+2 -1
View File
@@ -4,7 +4,7 @@ venv31?/
venv_*/
venv-*/
test_env*/
test_results*.log
test-install-venv/
mypy_*.log
ruff_*.log
*/__pycache__/*
@@ -29,6 +29,7 @@ ML-workspaces/
# Test artifacts (generated reports)
*_report.json
test_results_3_*.log
test-img-collection/
small-img-collection
benchmarks/reports/*.html
+40
View File
@@ -1,5 +1,45 @@
# Changelog
## [2.0.4-beta.10] - 2026-02-05
> **⚠️ Upgrade Notice:** If you installed beta.9 from PyPI, audio transcription does not work due to an incomplete tiktoken patch. Please upgrade to beta.10: `pip install mlx-knife[all]==2.0.4b10`
### Highlights
**Audio Works Out-of-the-Box:** Complete tiktoken workaround for mlx-audio Issue #479. PyPI installation (`pip install mlx-knife[audio]`) now works without any manual Git installs. We bundle the full Whisper tokenizer (~340 LOC) from mlx-audio commit 9349644 and patch `Model.get_tokenizer()` at runtime to fallback to tiktoken when HuggingFace processor is unavailable.
**Beta.9 Audio Bug:** The beta.9 release on PyPI had an incomplete tiktoken patch - it bundled the assets but didn't patch `Model.get_tokenizer()` (the class was incorrectly named `Whisper` instead of `Model`). This caused "Processor not found" errors with Whisper models.
**Runtime Compatibility Accuracy:** Fixed `runtime_compatible` field in `mlxk list --health` showing incorrect values. Now properly gates embedding models, mis-routed audio models (Qwen3-Omni), transformers 5.x video_processor bugs, and unsupported tokenizers (Voxtral tekken.json).
### Added
- **Whisper Tokenizer Patch (mlx-audio Issue #479):**
- New `mlxk2/audio/` module with `whisper_tokenizer.py` (~340 LOC)
- Complete `Tokenizer` class and `get_tokenizer()` from mlx-audio commit 9349644
- `audio_runner.py`: `_apply_whisper_tokenizer_patch()` patches `Model.get_tokenizer()`
- tiktoken>=0.7.0 dependency (OpenAI core library, stable API)
- Bundled tiktoken assets: `gpt2.tiktoken`, `multilingual.tiktoken`
### Fixed
- **Audio/Whisper with PyPI mlx-audio:** `pip install mlx-knife[audio]` now works without Git install workaround. The tiktoken regression in mlx-audio 0.3.1 (Issue #479) is fully patched.
- **`runtime_compatible` Accuracy:**
- Embedding models (Qwen3-Embedding) → Gate 5: "not supported by mlxk run"
- Mis-routed audio models (Qwen3-Omni) → Gate 3a: "model_type not supported by mlx-audio"
- transformers 5.x bugs (Qwen2-VL, MiMo-VL) → Gate 4a: "Video processor bug"
- Voxtral tekken.json → Gate 3a: "tekken.json tokenizer not supported"
### Changed
- **Documentation:**
- README.md: Audio installation simplified (no more Git install instructions)
- ARCHITECTURE.md: Added "Runtime Compatibility Decision Tree" and Probe concept
- ADR-020: Qwen3-Omni clarification (routes to mlx-vlm, not mlx-audio)
---
## [2.0.4-beta.9] - 2026-02-04
### Highlights
+13 -15
View File
@@ -4,9 +4,9 @@
<img src="https://github.com/mzau/mlx-knife/raw/main/mlxk-demo.gif" alt="MLX Knife Demo" width="900">
</p>
**Current Version: 2.0.4-beta.9** (Stable: 2.0.3)
**Current Version: 2.0.4-beta.10** (Stable: 2.0.3)
[![GitHub Release](https://img.shields.io/badge/version-2.0.4--beta.9-blue.svg)](https://github.com/mzau/mlx-knife/releases)
[![GitHub Release](https://img.shields.io/badge/version-2.0.4--beta.10-blue.svg)](https://github.com/mzau/mlx-knife/releases)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
[![Python 3.10-3.12](https://img.shields.io/badge/python-3.10--3.12-blue.svg)](https://www.python.org/downloads/)
[![Apple Silicon](https://img.shields.io/badge/Apple%20Silicon-green.svg)](https://support.apple.com/en-us/HT211814)
@@ -17,6 +17,8 @@
## Features
> **⚠️ Beta.9 Audio Bug:** If you installed `mlx-knife[audio]==2.0.4b9` from PyPI, audio transcription fails with "Processor not found". Upgrade to beta.10: `pip install mlx-knife[all]==2.0.4b10`
### What's New in 2.0.4 (Coming Soon - Currently Beta)
- **Audio Transcription (STT)** - Whisper speech-to-text (`--audio` flag, `pip install mlx-knife[audio]`)
- **Vision Models with EXIF Metadata** - Image analysis + automatic GPS/date/camera extraction visible to the model
@@ -94,15 +96,15 @@ mlxk --version # → mlxk 2.0.3
**Requirements:** macOS Apple Silicon, Python 3.9-3.12
### 2. PyPI Beta (2.0.4-beta.9 - Text + Vision + Audio)
### 2. PyPI Beta (2.0.4-beta.10 - Text + Vision + Audio)
```bash
pip install mlx-knife[all]==2.0.4b9
mlxk --version # → mlxk 2.0.4b9
pip install mlx-knife[all]==2.0.4b10
mlxk --version # → mlxk 2.0.4b10
```
**Requirements:** macOS Apple Silicon, Python 3.10-3.12
**Features:** Audio STT (Whisper), Vision with EXIF metadata, tiktoken workaround bundled
**Features:** Audio STT (Whisper), Vision with EXIF metadata, full tiktoken workaround
### 3. Developer Installation
@@ -111,7 +113,7 @@ git clone https://github.com/mzau/mlx-knife.git
cd mlx-knife
pip install -e ".[all,dev,test]"
mlxk --version # → mlxk 2.0.4b9
mlxk --version # → mlxk 2.0.4b10
pytest -v
```
@@ -449,18 +451,14 @@ mlxk convert <model> <output> --repair-index
### Audio Transcription (Speech-to-Text)
> **🎙️ New in beta.9:** Professional STT via dedicated Whisper models (mlx-audio backend). Backward compatible with Gemma-3n multimodal audio (mlx-vlm).
> **🎙️ New in beta.9/10:** Professional STT via dedicated Whisper models (mlx-audio backend). Beta.10 fixes PyPI install (no Git workaround needed). Backward compatible with Gemma-3n multimodal audio (mlx-vlm).
**Requirements:**
- **Python 3.10+** (mlx-audio dependency)
- **Installation:** (mlx-audio 0.3.1 PyPI has regression - install from Git):
```bash
pip install -e "git+https://github.com/Blaizzy/mlx-audio.git@9349644#egg=mlx-audio"
pip install tiktoken
```
- **Installation:** `pip install mlx-knife[audio]` (tiktoken workaround bundled)
- **No system dependencies:** MP3/WAV decoding via embedded libsndfile (no ffmpeg or Homebrew required)
**✅ Recommended Models** (mlx-knife v2.0.4-beta.9):
**✅ Recommended Models** (mlx-knife v2.0.4-beta.10):
| Model | Backend | Size | Duration | Notes |
|-------|---------|------|----------|-------|
@@ -1241,7 +1239,7 @@ Apache License 2.0 — see `LICENSE` (root) and `mlxk2/NOTICE`.
<p align="center">
<b>Made with ❤️ by The BROKE team <img src="broke-logo.png" alt="BROKE Logo" width="30" align="middle"></b><br>
<i>Version 2.0.4-beta.9 | February 2026</i><br>
<i>Version 2.0.4-beta.10 | February 2026</i><br>
<a href="https://github.com/mzau/broke-nchat">💬 Web UI: nChat - lightweight chat interface</a> •
<a href="https://github.com/mzau/broke-cluster">🔮 Multi-node: BROKE Cluster</a>
</p>
+2 -2
View File
@@ -4,7 +4,7 @@ This document contains version-specific details, complete file listings, and imp
## Current Status
**2.0.4-beta.9** — Audio transcription (Whisper via mlx-audio); Server `/v1/audio/transcriptions` endpoint; Probe/Policy architecture complete; Vision support Phase 1-3 (CLI + Server); Pipes/Memory-Aware; EXIF metadata; **Test Portfolio Separation complete**; Workspace Infrastructure (ADR-018 Phase 0a+0b+0c); Convert Operation (ADR-018 Phase 1); Resumable Clone; **Benchmark Schema v0.2.2** (Precise test timing).
**2.0.4-beta.10** **Audio PyPI Fix** (tiktoken workaround complete); Runtime compatibility accuracy; Audio transcription (Whisper via mlx-audio); Server `/v1/audio/transcriptions` endpoint; Probe/Policy architecture complete; Vision support Phase 1-3 (CLI + Server); Pipes/Memory-Aware; EXIF metadata; **Test Portfolio Separation complete**; Workspace Infrastructure (ADR-018 Phase 0a+0b+0c); Convert Operation (ADR-018 Phase 1); Resumable Clone; **Benchmark Schema v0.2.2** (Precise test timing).
### Test Results (Official Reference)
@@ -1614,7 +1614,7 @@ MLXK2_LIVE_PUSH=1 \
---
### A5. Complete Test File Structure (2.0.4-beta.9)
### A5. Complete Test File Structure (2.0.4-beta.10)
```
scripts/
+31 -1
View File
@@ -235,10 +235,40 @@ def detect_audio_backend(probe: Path, config: Optional[Dict]) -> Optional[Backen
| Whisper-* | `model_type: whisper*` | - | WhisperFeatureExtractor | MLX_AUDIO |
| VibeVoice-ASR | Name heuristic | - | WhisperFeatureExtractor | MLX_AUDIO |
| Gemma-3n | audio_config | vision_config (populated) | - | MLX_VLM |
| Qwen3-Omni | audio_config | vision_config (populated) | - | MLX_VLM |
**Note on Voxtral:** Config has `audio_config` but empty `vision_config: {}`. Priority 1 ensures it routes to mlx-audio (not mlx-vlm) per blaizzy's guidance. Works for both Original Mistral and mlx-knife converted variants.
#### Qwen3-Omni Model Family: Special Considerations
**Model Config Reality (mlx-community/Qwen3-Omni-30B-A3B-Instruct-4bit):**
```json
{
"model_type": "qwen3_omni_moe",
"audio_config": /* NOT PRESENT */,
"vision_config": {} /* EMPTY dict */
}
```
Plus `preprocessor_config.json` contains `"feature_extractor_type": "WhisperFeatureExtractor"`.
**Current Detection Result:**
- Priority 2 (audio_config + vision_config): **Skipped** (no audio_config)
- Priority 4 (WhisperFeatureExtractor): **Matched**`Backend.MLX_AUDIO`
**Runtime Compatibility Issue:**
- `model_type: "qwen3_omni_moe"` is NOT supported by mlx-lm
- Therefore `runtime_compatible: False` with reason: "Model type qwen3_omni_moe not supported."
**Status:** Qwen3-Omni is currently **not runnable** via mlx-knife because:
1. mlx-audio doesn't support `qwen3_omni_moe` model architecture
2. mlx-vlm doesn't support `qwen3_omni_moe` model architecture
3. The model lacks `audio_config` so it doesn't route to MLX_VLM anyway
**Future:** When mlx-vlm or mlx-audio adds Qwen3-Omni support, the detection logic may need adjustment to:
- Add Priority 1.5: `model_type == "qwen3_omni*"` → appropriate backend
- Or: Model converter creates `audio_config` during conversion
### Complete Routing Hierarchy
**Three-tier routing logic (run.py:452-620):**
+124
View File
@@ -22,6 +22,130 @@ All code paths follow this sequence:
**Rationale:** Consistent probing and policy enforcement prevents silent fallbacks and ensures errors are visible at the earliest possible stage.
#### The Probe Concept
**User Perspective (UX):**
"Probe" is the step where mlx-knife reads a model's metadata files to understand:
- What the model can do (text, vision, audio, embeddings)
- Whether it's healthy (files present, formats correct)
- Whether it can run on this system (backend availability, memory)
This happens automatically during `mlxk list`, `mlxk show`, `mlxk run`, and server model loading.
**Implementation:**
A "probe" is a `Path` object pointing to the model's snapshot directory containing:
```
probe/
├── config.json # Model architecture (model_type, vision_config, audio_config)
├── tokenizer_config.json # Chat template, tokenizer settings
├── tokenizer.json # Vocabulary (optional, may be sentencepiece)
├── preprocessor_config.json # Vision/audio processor info (optional)
└── *.safetensors # Weight files (checked for naming convention)
```
**Key Probe Functions:**
| Function | Location | Purpose |
|----------|----------|---------|
| `detect_framework()` | `common.py` | MLX vs PyTorch vs GGUF |
| `detect_model_type()` | `common.py` | chat, base, audio, embedding |
| `detect_capabilities()` | `common.py` | text-generation, vision, audio, embeddings |
| `detect_vision_capability()` | `common.py` | vision_config, preprocessor_config |
| `detect_audio_capability()` | `common.py` | audio_config, WhisperFeatureExtractor |
| `detect_audio_backend()` | `common.py` | MLX_AUDIO (STT) vs MLX_VLM (multimodal) |
| `check_runtime_compatibility()` | `health.py` | Backend supports model_type |
**Probe vs Config:**
- `probe`: The `Path` to the snapshot directory (filesystem location)
- `config`: The parsed `dict` from `config.json` (model metadata)
Both are passed to detection functions because some signals come from config fields, others from file presence.
#### Runtime Compatibility Decision Tree
The `runtime_compatible` field in `mlxk list --json` follows this decision tree:
```
runtime_compatible?
├─[1] healthy == False?
│ └─→ False (reason from health check)
├─[2] framework != "MLX"?
│ └─→ False ("Incompatible framework: {framework}")
├─[3] has_audio AND audio_backend != None?
│ │
│ ├─[3a] audio_backend == MLX_AUDIO?
│ │ │
│ │ ├─ mlx-audio not installed?
│ │ │ └─→ False ("mlx-audio not installed")
│ │ │
│ │ ├─ model_type NOT in [whisper*, voxtral]?
│ │ │ └─→ False ("Model type '{x}' not supported by mlx-audio")
│ │ │
│ │ └─ tekken.json exists WITHOUT tokenizer.json?
│ │ └─→ False ("Voxtral tekken.json tokenizer not supported")
│ │
│ └─[3b] audio_backend == MLX_VLM?
│ │
│ ├─ vision_runtime_compatibility(probe) fails?
│ │ └─→ False (vision reason)
│ │
│ └─ check_runtime_compatibility(probe) fails?
│ └─→ False ("Model type '{x}' not supported")
├─[4] has_vision?
│ │
│ ├─[4a] vision_runtime_compatibility(probe):
│ │ │
│ │ ├─ Python < 3.10?
│ │ │ └─→ False ("Vision requires Python 3.10+")
│ │ │
│ │ ├─ mlx-vlm not installed?
│ │ │ └─→ False ("mlx-vlm not installed")
│ │ │
│ │ └─ transformers 5.x + temporal_patch_size?
│ │ └─→ False ("Video processor bug in transformers 5.x")
│ │
│ └─[4b] check_runtime_compatibility(probe):
│ │
│ └─ mlx-lm doesn't support model_type?
│ └─→ False ("Model type '{x}' not supported")
├─[5] "embeddings" in capabilities?
│ └─→ False ("Embedding models not supported by mlxk run")
└─[6] else (text-only models):
└─ check_runtime_compatibility(probe):
├─ Legacy weight format (weights.*.safetensors)?
│ └─→ False ("Legacy format not supported by mlx-lm")
└─ mlx-lm doesn't support model_type?
└─→ False ("Model type '{x}' not supported")
If all gates pass → True (runtime_compatible)
```
**Gate Priority:**
| Priority | Gate | Checked For |
|----------|------|-------------|
| 1 | Health | All models |
| 2 | Framework | All models |
| 3 | Audio Backend | Audio-capable models |
| 4 | Vision Backend | Vision-capable models (non-audio) |
| 5 | Embeddings | Embedding models |
| 6 | Text/LLM | Text-only models |
**Implementation:** `build_model_object()` in `common.py:599-634`
### 2. No Silent Fallbacks
If a model requires a specific capability but the corresponding backend is unavailable, the system **must fail explicitly**. Do not degrade to a lower-capability mode.
+1 -1
View File
@@ -7,4 +7,4 @@ import warnings
# Issue parity with 1.1.0 (Issue #22)
warnings.filterwarnings('ignore', message='urllib3 v2 only supports OpenSSL 1.1.1+')
__version__ = "2.0.4b9"
__version__ = "2.0.4b10"
+9
View File
@@ -0,0 +1,9 @@
"""Audio support module for mlxk2.
Contains workarounds for mlx-audio regressions, specifically:
- Whisper tokenizer (tiktoken-based) for Issue #479
"""
from .whisper_tokenizer import Tokenizer, get_tokenizer
__all__ = ["Tokenizer", "get_tokenizer"]
+442
View File
@@ -0,0 +1,442 @@
# Copyright 2023 Apple Inc.
# SPDX-License-Identifier: MIT
#
# Whisper tokenizer implementation from mlx-audio commit 9349644 (2026-01-26).
# This code was removed in PyPI release 0.3.1 (2026-01-29), breaking Whisper
# transcription for models without HuggingFace processor.
#
# Bundled here as workaround for mlx-audio Issue #479.
# See: https://github.com/Blaizzy/mlx-audio/issues/479
#
# Original source:
# https://github.com/Blaizzy/mlx-audio/blob/9349644/mlx_audio/stt/models/whisper/tokenizer.py
from __future__ import annotations
import base64
import string
from dataclasses import dataclass, field
from functools import cached_property, lru_cache
from pathlib import Path
from typing import Dict, List, Optional, Tuple
import tiktoken
# Try to import LANGUAGES from mlx-audio (still present in 0.3.1)
# Fall back to our own definition if import fails
try:
from mlx_audio.stt.models.whisper.tokenizer import LANGUAGES, TO_LANGUAGE_CODE
except ImportError:
# Fallback: Full LANGUAGES dict from mlx-audio
LANGUAGES = {
"en": "english",
"zh": "chinese",
"de": "german",
"es": "spanish",
"ru": "russian",
"ko": "korean",
"fr": "french",
"ja": "japanese",
"pt": "portuguese",
"tr": "turkish",
"pl": "polish",
"ca": "catalan",
"nl": "dutch",
"ar": "arabic",
"sv": "swedish",
"it": "italian",
"id": "indonesian",
"hi": "hindi",
"fi": "finnish",
"vi": "vietnamese",
"he": "hebrew",
"uk": "ukrainian",
"el": "greek",
"ms": "malay",
"cs": "czech",
"ro": "romanian",
"da": "danish",
"hu": "hungarian",
"ta": "tamil",
"no": "norwegian",
"th": "thai",
"ur": "urdu",
"hr": "croatian",
"bg": "bulgarian",
"lt": "lithuanian",
"la": "latin",
"mi": "maori",
"ml": "malayalam",
"cy": "welsh",
"sk": "slovak",
"te": "telugu",
"fa": "persian",
"lv": "latvian",
"bn": "bengali",
"sr": "serbian",
"az": "azerbaijani",
"sl": "slovenian",
"kn": "kannada",
"et": "estonian",
"mk": "macedonian",
"br": "breton",
"eu": "basque",
"is": "icelandic",
"hy": "armenian",
"ne": "nepali",
"mn": "mongolian",
"bs": "bosnian",
"kk": "kazakh",
"sq": "albanian",
"sw": "swahili",
"gl": "galician",
"mr": "marathi",
"pa": "punjabi",
"si": "sinhala",
"km": "khmer",
"sn": "shona",
"yo": "yoruba",
"so": "somali",
"af": "afrikaans",
"oc": "occitan",
"ka": "georgian",
"be": "belarusian",
"tg": "tajik",
"sd": "sindhi",
"gu": "gujarati",
"am": "amharic",
"yi": "yiddish",
"lo": "lao",
"uz": "uzbek",
"fo": "faroese",
"ht": "haitian creole",
"ps": "pashto",
"tk": "turkmen",
"nn": "nynorsk",
"mt": "maltese",
"sa": "sanskrit",
"lb": "luxembourgish",
"my": "myanmar",
"bo": "tibetan",
"tl": "tagalog",
"mg": "malagasy",
"as": "assamese",
"tt": "tatar",
"haw": "hawaiian",
"ln": "lingala",
"ha": "hausa",
"ba": "bashkir",
"jw": "javanese",
"su": "sundanese",
"yue": "cantonese",
}
TO_LANGUAGE_CODE = {
**{language: code for code, language in LANGUAGES.items()},
"burmese": "my",
"valencian": "ca",
"flemish": "nl",
"haitian": "ht",
"letzeburgesch": "lb",
"pushto": "ps",
"panjabi": "pa",
"moldavian": "ro",
"moldovan": "ro",
"sinhalese": "si",
"castilian": "es",
"mandarin": "zh",
}
# Path to bundled tiktoken assets in mlxk2
_ASSETS_DIR = Path(__file__).parent.parent / "assets" / "whisper"
@dataclass
class Tokenizer:
"""A thin wrapper around `tiktoken` providing quick access to special tokens."""
encoding: tiktoken.Encoding
num_languages: int
language: Optional[str] = None
task: Optional[str] = None
sot_sequence: Tuple[int, ...] = ()
special_tokens: Dict[str, int] = field(default_factory=dict)
def __post_init__(self):
for special in self.encoding.special_tokens_set:
special_token = self.encoding.encode_single_token(special)
self.special_tokens[special] = special_token
sot: int = self.special_tokens["<|startoftranscript|>"]
translate: int = self.special_tokens["<|translate|>"]
transcribe: int = self.special_tokens["<|transcribe|>"]
langs = tuple(LANGUAGES.keys())[: self.num_languages]
sot_sequence = [sot]
if self.language is not None:
sot_sequence.append(sot + 1 + langs.index(self.language))
if self.task is not None:
task_token: int = transcribe if self.task == "transcribe" else translate
sot_sequence.append(task_token)
self.sot_sequence = tuple(sot_sequence)
def encode(self, text, **kwargs):
return self.encoding.encode(text, **kwargs)
def decode(self, token_ids: List[int], **kwargs) -> str:
token_ids = [t for t in token_ids if t < self.timestamp_begin]
return self.encoding.decode(token_ids, **kwargs)
def decode_with_timestamps(self, token_ids: List[int], **kwargs) -> str:
"""Timestamp tokens are above other special tokens' id range and are ignored by decode().
This method decodes given tokens with timestamps tokens annotated, e.g. "<|1.08|>".
"""
return self.encoding.decode(token_ids, **kwargs)
@cached_property
def eot(self) -> int:
return self.encoding.eot_token
@cached_property
def transcribe(self) -> int:
return self.special_tokens["<|transcribe|>"]
@cached_property
def translate(self) -> int:
return self.special_tokens["<|translate|>"]
@cached_property
def sot(self) -> int:
return self.special_tokens["<|startoftranscript|>"]
@cached_property
def sot_lm(self) -> int:
return self.special_tokens["<|startoflm|>"]
@cached_property
def sot_prev(self) -> int:
return self.special_tokens["<|startofprev|>"]
@cached_property
def no_speech(self) -> int:
return self.special_tokens["<|nospeech|>"]
@cached_property
def no_timestamps(self) -> int:
return self.special_tokens["<|notimestamps|>"]
@cached_property
def timestamp_begin(self) -> int:
return self.special_tokens["<|0.00|>"]
@cached_property
def language_token(self) -> int:
"""Returns the token id corresponding to the value of the `language` field."""
if self.language is None:
raise ValueError("This tokenizer does not have language token configured")
return self.to_language_token(self.language)
def to_language_token(self, language):
if token := self.special_tokens.get(f"<|{language}|>", None):
return token
raise KeyError(f"Language {language} not found in tokenizer.")
@cached_property
def all_language_tokens(self) -> Tuple[int, ...]:
result = []
for token, token_id in self.special_tokens.items():
if token.strip("<|>") in LANGUAGES:
result.append(token_id)
return tuple(result)[: self.num_languages]
@cached_property
def all_language_codes(self) -> Tuple[str, ...]:
return tuple(self.decode([_l]).strip("<|>") for _l in self.all_language_tokens)
@cached_property
def sot_sequence_including_notimestamps(self) -> Tuple[int, ...]:
return tuple(list(self.sot_sequence) + [self.no_timestamps])
@cached_property
def non_speech_tokens(self) -> Tuple[int, ...]:
"""Returns the list of tokens to suppress in order to avoid any speaker tags or
non-speech annotations, to prevent sampling texts that are not actually spoken
in the audio, e.g.
- (SPEAKING FOREIGN LANGUAGE)
- [DAVID] Hey there,
keeping basic punctuations like commas, periods, question marks, etc.
"""
symbols = list('"#()*+/:;<=>@[\\]^_`{|}~')
symbols += (
"<< >> <<< >>> -- --- -( -[ (' (\" (( )) ((( ))) [[ ]] {{ }} ".split()
)
# symbols that may be a single token or multiple tokens depending on the tokenizer.
# In case they're multiple tokens, suppress the first token, which is safe because:
# These are between U+2640 and U+267F miscellaneous symbols that are okay to suppress
# in generations, and in the 3-byte UTF-8 representation they share the first two bytes.
miscellaneous = set("\u2669\u266a\u266b\u266c\u266d\u266e\u266f")
assert all(0x2640 <= ord(c) <= 0x267F for c in miscellaneous)
# allow hyphens "-" and single quotes "'" between words, but not at the beginning
result = {self.encoding.encode(" -")[0], self.encoding.encode(" '")[0]}
for symbol in symbols + list(miscellaneous):
for tokens in [
self.encoding.encode(symbol),
self.encoding.encode(" " + symbol),
]:
if len(tokens) == 1 or symbol in miscellaneous:
result.add(tokens[0])
return tuple(sorted(result))
def split_to_word_tokens(self, tokens: List[int]):
if self.language in {"zh", "ja", "th", "lo", "my", "yue"}:
# These languages don't typically use spaces, so it is difficult to split words
# without morpheme analysis. Here, we instead split words at any
# position where the tokens are decoded as valid unicode points
return self.split_tokens_on_unicode(tokens)
return self.split_tokens_on_spaces(tokens)
def split_tokens_on_unicode(self, tokens: List[int]):
decoded_full = self.decode_with_timestamps(tokens)
replacement_char = "\ufffd"
words = []
word_tokens = []
current_tokens = []
unicode_offset = 0
for token in tokens:
current_tokens.append(token)
decoded = self.decode_with_timestamps(current_tokens)
if (
replacement_char not in decoded
or decoded_full[unicode_offset + decoded.index(replacement_char)]
== replacement_char
):
words.append(decoded)
word_tokens.append(current_tokens)
current_tokens = []
unicode_offset += len(decoded)
return words, word_tokens
def split_tokens_on_spaces(self, tokens: List[int]):
subwords, subword_tokens_list = self.split_tokens_on_unicode(tokens)
words = []
word_tokens = []
for subword, subword_tokens in zip(subwords, subword_tokens_list):
special = subword_tokens[0] >= self.eot
with_space = subword.startswith(" ")
punctuation = subword.strip() in string.punctuation
if special or with_space or punctuation or len(words) == 0:
words.append(subword)
word_tokens.append(subword_tokens)
else:
words[-1] = words[-1] + subword
word_tokens[-1].extend(subword_tokens)
return words, word_tokens
@lru_cache(maxsize=None)
def get_encoding(name: str = "gpt2", num_languages: int = 99) -> tiktoken.Encoding:
"""Load tiktoken encoding from bundled assets.
Uses assets from mlxk2/assets/whisper/ instead of mlx-audio's removed assets.
"""
vocab_path = _ASSETS_DIR / f"{name}.tiktoken"
if not vocab_path.exists():
raise FileNotFoundError(
f"Tiktoken vocabulary file not found: {vocab_path}\n"
f"This is an mlx-audio Issue #479 workaround.\n"
f"Expected assets in: {_ASSETS_DIR}"
)
with open(vocab_path) as fid:
ranks = {
base64.b64decode(token): int(rank)
for token, rank in (line.split() for line in fid if line)
}
n_vocab = len(ranks)
special_tokens = {}
specials = [
"<|endoftext|>",
"<|startoftranscript|>",
*[f"<|{lang}|>" for lang in list(LANGUAGES.keys())[:num_languages]],
"<|translate|>",
"<|transcribe|>",
"<|startoflm|>",
"<|startofprev|>",
"<|nospeech|>",
"<|notimestamps|>",
*[f"<|{i * 0.02:.2f}|>" for i in range(1501)],
]
for token in specials:
special_tokens[token] = n_vocab
n_vocab += 1
return tiktoken.Encoding(
name=vocab_path.name,
explicit_n_vocab=n_vocab,
pat_str=r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
mergeable_ranks=ranks,
special_tokens=special_tokens,
)
@lru_cache(maxsize=None)
def get_tokenizer(
multilingual: bool,
*,
num_languages: int = 99,
language: Optional[str] = None,
task: Optional[str] = None, # Literal["transcribe", "translate", None]
) -> Tokenizer:
"""Get a Whisper tokenizer.
Args:
multilingual: Whether to use multilingual tokenizer (True for most models)
num_languages: Number of languages supported (default 99)
language: Language code (e.g., 'en', 'de') or None for auto-detect
task: 'transcribe' or 'translate' or None
Returns:
Tokenizer instance
"""
if language is not None:
language = language.lower()
if language not in LANGUAGES:
if language in TO_LANGUAGE_CODE:
language = TO_LANGUAGE_CODE[language]
else:
raise ValueError(f"Unsupported language: {language}")
if multilingual:
encoding_name = "multilingual"
language = language or "en"
task = task or "transcribe"
else:
encoding_name = "gpt2"
language = None
task = None
encoding = get_encoding(name=encoding_name, num_languages=num_languages)
return Tokenizer(
encoding=encoding, num_languages=num_languages, language=language, task=task
)
+61
View File
@@ -105,6 +105,67 @@ def _apply_tiktoken_patch():
_apply_tiktoken_patch()
# ============================================================================
# CRITICAL: Patch Model.get_tokenizer() to use our tiktoken tokenizer
# ============================================================================
# mlx-audio 0.3.1 (PyPI) removed the get_tokenizer() function that creates
# tiktoken-based tokenizers. The Model.get_tokenizer() method now throws
# ValueError if no HuggingFace processor is available.
#
# We patch Model.get_tokenizer() to fall back to our bundled tokenizer
# (mlxk2.audio.whisper_tokenizer) when no processor is available.
#
# This MUST happen at module import time, before any Whisper model is loaded!
# ============================================================================
def _apply_whisper_tokenizer_patch():
"""Patch Model.get_tokenizer to fall back to our tiktoken tokenizer."""
try:
from mlx_audio.stt.models.whisper.whisper import Model as WhisperModel
from mlxk2.audio.whisper_tokenizer import get_tokenizer
# Store original method (if it exists and isn't already patched)
if hasattr(WhisperModel, "_mlxk_original_get_tokenizer"):
# Already patched
return
original_get_tokenizer = WhisperModel.get_tokenizer
def patched_get_tokenizer(self, language=None, task="transcribe"):
"""Patched get_tokenizer with tiktoken fallback.
First tries the original method (uses HF Processor if available).
Falls back to our bundled tiktoken-based tokenizer on failure.
"""
# Try original first (uses HF Processor if available)
if hasattr(self, "_processor") and self._processor is not None:
try:
return original_get_tokenizer(self, language, task)
except Exception:
# HF Processor failed, fall through to tiktoken
pass
# Fallback to our tiktoken-based tokenizer
return get_tokenizer(
self.is_multilingual,
num_languages=getattr(self, "num_languages", 99),
language=language,
task=task,
)
# Apply patch
WhisperModel.get_tokenizer = patched_get_tokenizer
WhisperModel._mlxk_original_get_tokenizer = original_get_tokenizer
except ImportError:
# mlx-audio not installed - skip patching
pass
# Apply Whisper tokenizer patch immediately at module import
_apply_whisper_tokenizer_patch()
class AudioRunner:
"""Wrapper around mlx-audio STT API for dedicated transcription models.
+84 -8
View File
@@ -426,25 +426,59 @@ def detect_capabilities(
return caps
def vision_runtime_compatibility() -> tuple[bool, Optional[str]]:
"""Vision uses mlx-vlm backend; mark compatible only if available."""
def vision_runtime_compatibility(probe: Optional[Path] = None) -> tuple[bool, Optional[str]]:
"""Vision uses mlx-vlm backend; mark compatible only if available.
Args:
probe: Optional path to model snapshot for video processor detection
Returns:
(is_compatible, reason): reason is None if compatible
"""
if sys.version_info < (3, 10):
return False, "Vision requires Python 3.10+ (mlx-vlm dependency)"
spec = importlib.util.find_spec("mlx_vlm")
if spec is None:
return False, "mlx-vlm not installed (install extras: vision)"
# Gate 3: Check for transformers 5.x video_processor bug
# transformers 5.0.x RC has a bug where video_processor_class_from_name()
# fails with "argument of type 'NoneType' is not iterable" for models
# with temporal_patch_size (video-capable models like Qwen2-VL)
if probe is not None:
try:
import transformers
tf_version = getattr(transformers, "__version__", "0.0.0")
# Check if transformers 5.x (RC or early release with potential bugs)
if tf_version.startswith("5."):
preprocessor_path = probe / "preprocessor_config.json"
if preprocessor_path.exists():
preproc_data = _json.loads(preprocessor_path.read_text(encoding="utf-8", errors="ignore"))
if isinstance(preproc_data, dict) and "temporal_patch_size" in preproc_data:
return False, f"Video processor bug in transformers {tf_version} (use transformers<5.0 or wait for fix)"
except Exception:
pass # If check fails, proceed (may still work)
return True, None
def audio_runtime_compatibility(backend: Backend) -> tuple[bool, Optional[str]]:
def audio_runtime_compatibility(
backend: Backend,
probe: Optional[Path] = None,
framework: str = "MLX"
) -> tuple[bool, Optional[str]]:
"""Audio runtime check based on backend (ADR-020).
Args:
backend: Backend.MLX_AUDIO (Whisper/Voxtral) or Backend.MLX_VLM (Gemma-3n)
backend: Backend.MLX_AUDIO (Whisper/Voxtral) or Backend.MLX_VLM (Gemma-3n, Qwen3-Omni)
probe: Path to model snapshot (required for MLX_VLM model_type check)
framework: Framework string (default "MLX")
Returns:
(is_compatible, reason): reason is None if compatible
"""
from .health import check_runtime_compatibility
if sys.version_info < (3, 10):
return False, "Audio requires Python 3.10+"
@@ -453,10 +487,47 @@ def audio_runtime_compatibility(backend: Backend) -> tuple[bool, Optional[str]]:
spec = importlib.util.find_spec("mlx_audio")
if spec is None:
return False, "mlx-audio not installed (pip install mlx-knife[audio])"
# Gate 2: model_type must be supported by mlx-audio (Whisper, Voxtral only)
# This catches mis-routed models like Qwen3-Omni that have WhisperFeatureExtractor
# but are NOT STT models (they're multimodal with unsupported architecture)
if probe is not None:
config_path = probe / "config.json"
if config_path.exists():
try:
config = _json.loads(config_path.read_text(encoding="utf-8", errors="ignore"))
model_type = config.get("model_type", "")
if isinstance(model_type, str):
model_type_lower = model_type.lower()
# Check if model_type is a known STT type
if not any(stt in model_type_lower for stt in ["whisper", "voxtral"]):
return False, f"Model type '{model_type}' not supported by mlx-audio (only Whisper/Voxtral)"
except Exception:
pass # If config can't be read, proceed (health check will catch it)
# Gate 3: Check for Voxtral tekken.json tokenizer bug (mlx-audio#450)
# Voxtral uses tekken.json (Mistral tokenizer format) which mlx-audio can't convert properly
tekken_path = probe / "tekken.json"
tokenizer_path = probe / "tokenizer.json"
if tekken_path.exists() and not tokenizer_path.exists():
return False, "Voxtral tekken.json tokenizer not supported (mlx-audio#450, upstream fix pending)"
return True, None
elif backend == Backend.MLX_VLM:
# Multimodal audio (Gemma-3n) needs mlx-vlm
return vision_runtime_compatibility()
# Multimodal audio (Gemma-3n, Qwen3-Omni) needs mlx-vlm
# Gate 1: mlx-vlm must be available (pass probe for video_processor bug check)
vlm_ok, vlm_reason = vision_runtime_compatibility(probe)
if not vlm_ok:
return vlm_ok, vlm_reason
# Gate 2: mlx-lm must support model_type (text-only fallback mode)
# This catches unsupported model_types like "qwen3_omni_moe"
if probe is not None:
text_ok, text_reason = check_runtime_compatibility(probe, framework)
if not text_ok:
return text_ok, text_reason
return True, None
else:
return False, "Unknown audio backend"
@@ -546,11 +617,12 @@ def build_model_object(hf_name: str, model_root: Path, selected_path: Optional[P
runtime_reason = f"Incompatible framework: {framework}"
elif has_audio and audio_backend is not None:
# Audio models: check based on backend (ADR-020)
runtime_compatible, runtime_reason = audio_runtime_compatibility(audio_backend)
runtime_compatible, runtime_reason = audio_runtime_compatibility(audio_backend, probe, framework)
elif has_vision:
# Vision models: check BOTH backends for full chat+vision support
# 1. mlx-vlm must be available (vision mode with images)
vision_ok, vision_reason = vision_runtime_compatibility()
# Pass probe for transformers 5.x video_processor bug detection
vision_ok, vision_reason = vision_runtime_compatibility(probe)
# 2. mlx-lm must support model_type (text-only mode without images)
text_ok, text_reason = check_runtime_compatibility(probe, framework)
@@ -561,6 +633,10 @@ def build_model_object(hf_name: str, model_root: Path, selected_path: Optional[P
runtime_compatible = False
# Prefer text_reason as it's more specific (model_type not supported)
runtime_reason = text_reason or vision_reason
elif Capability.EMBEDDINGS.value in capabilities:
# Embedding models: mlxk run doesn't support embeddings (future: mlxk embed)
runtime_compatible = False
runtime_reason = "Embedding models not supported by mlxk run (use mlxk embed)"
else:
runtime_compatible, runtime_reason = check_runtime_compatibility(probe, framework)
+2 -1
View File
@@ -382,7 +382,8 @@ def run_model(
return error_result
if is_vision_model:
compat, reason = vision_runtime_compatibility()
# Pass model_path for transformers 5.x video_processor bug detection
compat, reason = vision_runtime_compatibility(model_path)
if not compat:
error_msg = f"Model '{resolved_name}' is vision-capable but not runnable: {reason}"
error_result = f"Error: {error_msg}"
+8 -9
View File
@@ -66,19 +66,18 @@ dev = [
"mypy>=1.5.0",
]
vision = [
"mlx-vlm>=0.3.10", # Vision support (ADR-012)
"mlx-vlm==0.3.10", # Pinned: stable, tested version (ADR-012)
]
audio = [
# mlx-audio 0.3.1+ (tiktoken assets workaround bundled in mlxk2/assets/whisper/)
# See Issue #479: https://github.com/Blaizzy/mlx-audio/issues/479
"mlx-audio>=0.3.1",
"tiktoken>=0.7.0", # Required by bundled tiktoken assets
# mlx-audio pinned: 0.3.1 has tiktoken regression (Issue #479)
# We bundle complete tiktoken workaround in mlxk2/audio/whisper_tokenizer.py
"mlx-audio==0.3.1",
"tiktoken>=0.7.0", # Required for Whisper tiktoken fallback
]
all = [
"mlx-vlm>=0.3.10",
# mlx-audio 0.3.1+ (tiktoken assets workaround bundled in mlxk2/assets/whisper/)
"mlx-audio>=0.3.1",
"tiktoken>=0.7.0", # Required by bundled tiktoken assets
"mlx-vlm==0.3.10", # Pinned: stable, tested version
"mlx-audio==0.3.1", # Pinned: tiktoken regression patched by mlxk2
"tiktoken>=0.7.0",
]
[tool.setuptools]