Major Features: - Audio transcription via mlx-audio backend (Whisper, >10min duration) - OpenAI /v1/audio/transcriptions endpoint - Memory Gate System (Vision: 8GB, Audio: 4GB) - Config-based backend routing (ADR-020) - Benchmark toolchain (memmon/memplot, Schema v0.2.2) Key Fixes: - EuroLLM tokenizer decoding - Vision-model text-only routing regression - Multimodal model context length detection - Memory cleanup bug (mx.metal.clear_cache) - Orphan process bug Test Results: - Unit tests: 647 passed, 11 skipped (Python 3.10-3.12) - wet-umbrella: 171 passed total See CHANGELOG.md for complete details and known issues.
8.4 KiB
ADR-019 — Audio Input Support (via mlx-vlm) [BETA.8 ARCHIVED]
Status: Replaced by ADR-020 (2026-01-27) Target: 2.0.4-beta.8 (historical, implemented) Depends on: mlx-vlm ≥0.3.10 (GitHub only, not yet on PyPI) Replaced by: ADR-020 — Audio Backend Architecture
NOTE: This ADR documents the Beta.8 mlx-vlm-only audio implementation (Gemma-3n). For current architecture with mlx-audio support and auto-routing (Beta.9+), see ADR-020.
Context
mlx-vlm documents audio input for Omni models in main branch (post-0.3.9):
- Gemma-3n (Google): Vision + Audio + Text — explicitly documented in README
Not Supported:
- Qwen3-Omni (Alibaba): Architecture
qwen3_omni_moenot in mlx-lm/mlx-vlm- See: github.com/ml-explore/mlx-lm/issues/497
- Community PR #574 is text-only (Audio-Tower removed)
These models use the same mlx-vlm backend as vision models. mlx-knife can add audio support with minimal effort.
Current State (2026-01-19):
- mlx-vlm 0.3.10 available on GitHub, not yet on PyPI
- pyproject.toml uses Git URL dependency (blocks PyPI release)
- Audio functionality verified with Gemma-3n
Decision
Implement audio input via mlx-vlm's native audio processing (Gemma-3n multimodal support).
Scope: CLI-first
| Command | Audio Support |
|---|---|
list |
Shows +audio in type field |
show |
Shows audio capability in details |
run |
--audio file.wav parameter |
health |
Checks audio-capable models |
serve |
Phase 4 - pending |
Out of Scope
- STT/Transcription (separate feature via mlx-audio-plus, see Brainstorm ADR)
- TTS/Audio output (Qwen3-Omni can do it, but not exposed)
- stdin audio (
--audio -) - later - Audio segments/timestamps (STT feature)
Design
CLI Interface
# Audio-only (transcription) - uses default prompt and temperature 0.2
mlxk run gemma-3n-E2B-it-4bit --audio voice.wav
# With explicit prompt
mlxk run gemma-3n-E2B-it-4bit --audio voice.wav --prompt "Transcribe exactly what is said"
# Lower temperature for maximum consistency
mlxk run gemma-3n-E2B-it-4bit --audio voice.wav --temperature 0
Capability Detection
# capabilities.py
AUDIO_MODEL_TYPES = frozenset({
"gemma3n", # Google Gemma 3n (verified in mlx-vlm README)
# "qwen3_omni_moe", # Alibaba Qwen3-Omni — NOT SUPPORTED (mlx-lm #497)
})
# Detection via config.json:
# - model_type in AUDIO_MODEL_TYPES
# - "audio_config" present
list/show Output
$ mlxk list
NAME TYPE HEALTH SIZE
gemma-3n-E2B-it-4bit chat+vision+audio healthy 2.1GB
pixtral-12b-4bit chat+vision healthy 6.8GB
llama-3.2-3b-4bit chat healthy 1.8GB
Code Changes (Phase 1-2)
| File | Change |
|---|---|
cli.py |
--audio argument, context-aware temperature default |
run.py |
Audio passthrough to VisionRunner, capability checks |
vision_runner.py |
_prepare_audio(), audio= parameter, num_audios= |
capabilities.py |
AUDIO_MODEL_TYPES, Capability.AUDIO |
common.py |
detect_audio_capability() |
Implementation Phases
Phase 1: Capability Detection ✅
AUDIO_MODEL_TYPESin capabilities.pyis_audiodetection in ModelCapabilities (viaaudio_configin config.json)list/showshows audio capability- Consistent with vision capability display
Phase 2: CLI Run ✅
--audioargument in cli.py- VisionRunner audio passthrough
- Error handling (audio without audio-capable model → clear error)
- 5MB file size limit (~2-3 min at 16kHz mono; token count is the real constraint)
- Multi-audio blocked with clear error (mlx-vlm limitation)
- Context-aware temperature default (0.2 for audio, 0.7 for text/vision)
- Default prompt: "Transcribe what is spoken in this audio."
Phase 3: Evaluation ✅
Tested with gemma-3n-E2B-it-4bit (mlx-community).
Findings:
| Question | Result |
|---|---|
| Empty prompt allowed? | Yes, but results poor (model confused) |
| Audio-only without image? | ✅ Works well |
| Text prompt required? | Not required, but recommended for quality |
| Multiple audio files? | ❌ Not supported (mlx-vlm token mismatch bug) |
| Audio + Vision combined? | ❌ Audio silently ignored when images present (cause unclear: mlx-vlm or model) |
Temperature Impact:
| Temperature | Behavior |
|---|---|
| 0.7 (text default) | Multilingual drift (outputs Arabic/Hindi for "Amen" misinterpretation) |
| 0.2 (audio default) | Stable, good transcription quality |
| 0.0 | Most consistent, recommended for STT tasks |
Implemented Defaults:
- Audio temperature: 0.2 (vs 0.7 for text/vision)
- Audio prompt: "Transcribe what is spoken in this audio."
Known Limitations:
- Phonetic errors observed ("A man" → "Amen"); cause unclear (model limitation vs quantization)
- Multi-audio causes token mismatch error in mlx-vlm
- Audio+Vision combined: audio silently ignored (not blocked by mlx-knife to allow future compatibility)
- WAV format confirmed; other formats untested
Audio Duration Limit (~30 seconds):
Gemma-3n uses a fixed 188 soft tokens per audio clip, regardless of audio length:
- Encoding rate: 6.25 tokens/second (from model spec)
- Maximum duration: 188 ÷ 6.25 = ~30.08 seconds
| Audio Duration | Transcribed | Notes |
|---|---|---|
| ≤29 seconds | Full | ✅ Reliable |
| 30-35 seconds | ~29 seconds | Missing last few seconds |
| 43 seconds | ~29 seconds | Only first 2/3 transcribed |
Cause: Model architecture constraint in Gemma-3n. Audio is encoded to a fixed-size representation before being passed to the language model. Content beyond ~30 seconds is silently dropped during encoding.
Sources:
Gemma3nProcessor.audio_seq_length = 188(transformers 5.0+)- HuggingFace Gemma3n Docs
- Empirical testing with Gemma-3n
Phase 4: Server API (Pending)
Note: This design may evolve based on client ecosystem developments.
Goal: OpenAI-compatible audio input for /v1/chat/completions
OpenAI Standard Format (input_audio):
{
"model": "gemma-3n-E2B-it-4bit",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe this audio"},
{
"type": "input_audio",
"input_audio": {
"data": "<base64-encoded-wav>",
"format": "wav"
}
}
]
}]
}
Two Approaches for Audio Input:
| Approach | How it works | Client Support |
|---|---|---|
| STT → LLM | Audio → separate Whisper service → Text → LLM | Open WebUI, many clients |
| Native Audio | Audio directly to LLM (GPT-4o-audio, Gemma-3n) | New, limited client support |
mlx-knife uses native audio (Gemma-3n understands audio directly).
Most clients (Open WebUI) expect separate STT service, not input_audio.
WebUI Integration (nChat):
Browser provides native APIs for microphone access:
// Microphone access
navigator.mediaDevices.getUserMedia({ audio: true })
// Recording
const mediaRecorder = new MediaRecorder(stream)
mediaRecorder.ondataavailable = (e) => { /* audio blob */ }
Workflow: Record → Convert to WAV (JS library) → Base64 → JSON input_audio
Implementation TODO:
- Parse
input_audiocontent blocks in server_base.py - Base64 decode → temp file → VisionRunner
- 5MB limit enforcement (base64 ~33% overhead)
- Error handling for unsupported formats
- Live tests:
test_audio_server_e2e.py
Risks
| Risk | Mitigation |
|---|---|
| mlx-vlm 0.3.10 not on PyPI | Git dependency in pyproject.toml; blocks mlx-knife PyPI release |
| mlx-vlm API changes | Thin wrapper, quickly adaptable |
| Audio format restrictions | Documented: WAV confirmed, others untested |
| Memory requirements unclear | Document in README, profile before server support |
| Gemma-3n index/shard issue | mlxk convert --repair-index before use |
| Qwen3-Omni not supported | Only Gemma-3n as target, Qwen3-Omni out-of-scope |
| Phonetic errors | Document limitation, recommend temperature 0 |
References
- mlx-vlm README (main): Multi-Modal Example (Image + Audio)
- mlx-lm Issue #497: Qwen3-Omni Support Request (open since Sep 2025)
- mlx-lm PR #574: qwen3_omni_moe Text-only (Audio-Tower removed)
- Gemma-3n: mlx-community/gemma-3n-E2B-it-4bit