Files
mlx-knife/docs/ADR/ADR-019-Audio-Input-Support-beta8.md
T
The BROKE Cluster Team bf7480d042 Release 2.0.4-beta.9: Audio transcription via mlx-audio
Major Features:
- Audio transcription via mlx-audio backend (Whisper, >10min duration)
- OpenAI /v1/audio/transcriptions endpoint
- Memory Gate System (Vision: 8GB, Audio: 4GB)
- Config-based backend routing (ADR-020)
- Benchmark toolchain (memmon/memplot, Schema v0.2.2)

Key Fixes:
- EuroLLM tokenizer decoding
- Vision-model text-only routing regression
- Multimodal model context length detection
- Memory cleanup bug (mx.metal.clear_cache)
- Orphan process bug

Test Results:
- Unit tests: 647 passed, 11 skipped (Python 3.10-3.12)
- wet-umbrella: 171 passed total

See CHANGELOG.md for complete details and known issues.
2026-02-04 03:10:30 +01:00

8.4 KiB

ADR-019 — Audio Input Support (via mlx-vlm) [BETA.8 ARCHIVED]

Status: Replaced by ADR-020 (2026-01-27) Target: 2.0.4-beta.8 (historical, implemented) Depends on: mlx-vlm ≥0.3.10 (GitHub only, not yet on PyPI) Replaced by: ADR-020 — Audio Backend Architecture


NOTE: This ADR documents the Beta.8 mlx-vlm-only audio implementation (Gemma-3n). For current architecture with mlx-audio support and auto-routing (Beta.9+), see ADR-020.


Context

mlx-vlm documents audio input for Omni models in main branch (post-0.3.9):

  • Gemma-3n (Google): Vision + Audio + Text — explicitly documented in README

Not Supported:

  • Qwen3-Omni (Alibaba): Architecture qwen3_omni_moe not in mlx-lm/mlx-vlm
    • See: github.com/ml-explore/mlx-lm/issues/497
    • Community PR #574 is text-only (Audio-Tower removed)

These models use the same mlx-vlm backend as vision models. mlx-knife can add audio support with minimal effort.

Current State (2026-01-19):

  • mlx-vlm 0.3.10 available on GitHub, not yet on PyPI
  • pyproject.toml uses Git URL dependency (blocks PyPI release)
  • Audio functionality verified with Gemma-3n

Decision

Implement audio input via mlx-vlm's native audio processing (Gemma-3n multimodal support).

Scope: CLI-first

Command Audio Support
list Shows +audio in type field
show Shows audio capability in details
run --audio file.wav parameter
health Checks audio-capable models
serve Phase 4 - pending

Out of Scope

  • STT/Transcription (separate feature via mlx-audio-plus, see Brainstorm ADR)
  • TTS/Audio output (Qwen3-Omni can do it, but not exposed)
  • stdin audio (--audio -) - later
  • Audio segments/timestamps (STT feature)

Design

CLI Interface

# Audio-only (transcription) - uses default prompt and temperature 0.2
mlxk run gemma-3n-E2B-it-4bit --audio voice.wav

# With explicit prompt
mlxk run gemma-3n-E2B-it-4bit --audio voice.wav --prompt "Transcribe exactly what is said"

# Lower temperature for maximum consistency
mlxk run gemma-3n-E2B-it-4bit --audio voice.wav --temperature 0

Capability Detection

# capabilities.py
AUDIO_MODEL_TYPES = frozenset({
    "gemma3n",        # Google Gemma 3n (verified in mlx-vlm README)
    # "qwen3_omni_moe", # Alibaba Qwen3-Omni — NOT SUPPORTED (mlx-lm #497)
})

# Detection via config.json:
# - model_type in AUDIO_MODEL_TYPES
# - "audio_config" present

list/show Output

$ mlxk list
NAME                      TYPE                 HEALTH   SIZE
gemma-3n-E2B-it-4bit     chat+vision+audio    healthy  2.1GB
pixtral-12b-4bit         chat+vision          healthy  6.8GB
llama-3.2-3b-4bit        chat                 healthy  1.8GB

Code Changes (Phase 1-2)

File Change
cli.py --audio argument, context-aware temperature default
run.py Audio passthrough to VisionRunner, capability checks
vision_runner.py _prepare_audio(), audio= parameter, num_audios=
capabilities.py AUDIO_MODEL_TYPES, Capability.AUDIO
common.py detect_audio_capability()

Implementation Phases

Phase 1: Capability Detection

  • AUDIO_MODEL_TYPES in capabilities.py
  • is_audio detection in ModelCapabilities (via audio_config in config.json)
  • list/show shows audio capability
  • Consistent with vision capability display

Phase 2: CLI Run

  • --audio argument in cli.py
  • VisionRunner audio passthrough
  • Error handling (audio without audio-capable model → clear error)
  • 5MB file size limit (~2-3 min at 16kHz mono; token count is the real constraint)
  • Multi-audio blocked with clear error (mlx-vlm limitation)
  • Context-aware temperature default (0.2 for audio, 0.7 for text/vision)
  • Default prompt: "Transcribe what is spoken in this audio."

Phase 3: Evaluation

Tested with gemma-3n-E2B-it-4bit (mlx-community).

Findings:

Question Result
Empty prompt allowed? Yes, but results poor (model confused)
Audio-only without image? Works well
Text prompt required? Not required, but recommended for quality
Multiple audio files? Not supported (mlx-vlm token mismatch bug)
Audio + Vision combined? Audio silently ignored when images present (cause unclear: mlx-vlm or model)

Temperature Impact:

Temperature Behavior
0.7 (text default) Multilingual drift (outputs Arabic/Hindi for "Amen" misinterpretation)
0.2 (audio default) Stable, good transcription quality
0.0 Most consistent, recommended for STT tasks

Implemented Defaults:

  • Audio temperature: 0.2 (vs 0.7 for text/vision)
  • Audio prompt: "Transcribe what is spoken in this audio."

Known Limitations:

  • Phonetic errors observed ("A man" → "Amen"); cause unclear (model limitation vs quantization)
  • Multi-audio causes token mismatch error in mlx-vlm
  • Audio+Vision combined: audio silently ignored (not blocked by mlx-knife to allow future compatibility)
  • WAV format confirmed; other formats untested

Audio Duration Limit (~30 seconds):

Gemma-3n uses a fixed 188 soft tokens per audio clip, regardless of audio length:

  • Encoding rate: 6.25 tokens/second (from model spec)
  • Maximum duration: 188 ÷ 6.25 = ~30.08 seconds
Audio Duration Transcribed Notes
≤29 seconds Full Reliable
30-35 seconds ~29 seconds Missing last few seconds
43 seconds ~29 seconds Only first 2/3 transcribed

Cause: Model architecture constraint in Gemma-3n. Audio is encoded to a fixed-size representation before being passed to the language model. Content beyond ~30 seconds is silently dropped during encoding.

Sources:

Phase 4: Server API (Pending)

Note: This design may evolve based on client ecosystem developments.

Goal: OpenAI-compatible audio input for /v1/chat/completions

OpenAI Standard Format (input_audio):

{
  "model": "gemma-3n-E2B-it-4bit",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "Transcribe this audio"},
      {
        "type": "input_audio",
        "input_audio": {
          "data": "<base64-encoded-wav>",
          "format": "wav"
        }
      }
    ]
  }]
}

Two Approaches for Audio Input:

Approach How it works Client Support
STT → LLM Audio → separate Whisper service → Text → LLM Open WebUI, many clients
Native Audio Audio directly to LLM (GPT-4o-audio, Gemma-3n) New, limited client support

mlx-knife uses native audio (Gemma-3n understands audio directly). Most clients (Open WebUI) expect separate STT service, not input_audio.

WebUI Integration (nChat):

Browser provides native APIs for microphone access:

// Microphone access
navigator.mediaDevices.getUserMedia({ audio: true })

// Recording
const mediaRecorder = new MediaRecorder(stream)
mediaRecorder.ondataavailable = (e) => { /* audio blob */ }

Workflow: Record → Convert to WAV (JS library) → Base64 → JSON input_audio

Implementation TODO:

  • Parse input_audio content blocks in server_base.py
  • Base64 decode → temp file → VisionRunner
  • 5MB limit enforcement (base64 ~33% overhead)
  • Error handling for unsupported formats
  • Live tests: test_audio_server_e2e.py

Risks

Risk Mitigation
mlx-vlm 0.3.10 not on PyPI Git dependency in pyproject.toml; blocks mlx-knife PyPI release
mlx-vlm API changes Thin wrapper, quickly adaptable
Audio format restrictions Documented: WAV confirmed, others untested
Memory requirements unclear Document in README, profile before server support
Gemma-3n index/shard issue mlxk convert --repair-index before use
Qwen3-Omni not supported Only Gemma-3n as target, Qwen3-Omni out-of-scope
Phonetic errors Document limitation, recommend temperature 0

References

  • mlx-vlm README (main): Multi-Modal Example (Image + Audio)
  • mlx-lm Issue #497: Qwen3-Omni Support Request (open since Sep 2025)
  • mlx-lm PR #574: qwen3_omni_moe Text-only (Audio-Tower removed)
  • Gemma-3n: mlx-community/gemma-3n-E2B-it-4bit