Release 2.0.2: Test infrastructure hardening & empirical validation

Stable release completing Issue #32 recovery plan - all tests passing. Bug Fixes: - Test collection regression (E2E suite parametrization) - Stop token ordering (batch + streaming modes) - E2E test temperature flakiness (deterministic sampling) - Web API framework detection (PR #42 by @limey, fixes #41) - E2E test marker fix (show_model_portfolio diagnostics) Architecture: - mlx-lm API evaluation: Keep manual text-based implementation - Stop token workarounds: All 3 validated (Phi-3, DeepSeek-R1, GPT-oss) Testing: - Portfolio Discovery: 73/81 tests, 17 models, 0 failures - E2E infrastructure hardened (TOKENIZERS, polling, gc.collect()) - Multi-Python validation: 3.9-3.13 passing Documentation: - ADR-009 Outstanding Work completed + Implementation Plan removed - TESTING-DETAILS.md: Portfolio Discovery + E2E Architecture updated - CHANGELOG.md: Complete 2.0.2 stable release notes
2026-07-01 20:44:14 -04:00 · 2025-11-15 18:50:18 +01:00
parent f88c1a273b
commit d32d3185dd
26 changed files with 2928 additions and 2254 deletions
@@ -1,6 +1,124 @@
 # Changelog

-## 2.0.1 — 2025-11-28
+## [2.0.2] - 2025-11-15
+
+**Stable Release**: Test infrastructure hardening, stop token validation with 17 models, and web API improvements.
+
+This release completes the 2.0.2 recovery plan (Issue #32) with extensive empirical validation, architecture decisions, and community contributions. Highlights: 73/81 E2E tests passing, stop token bugs fixed, web API framework detection for all MLX organizations.
+
+### Bug Fixes
+
+- **Test collection regression** (E2E test suite, ADR-011):
+  - **Problem**: `pytest tests_2.0/live/` failed with "fixture 'model_key' not found" without `-m live_e2e` marker
+  - **Root Cause**: `conftest.py:64-70` returned early without parametrizing when marker missing
+  - **Fix**: Added fallback parametrization with `["_skipped"]` - tests now collect and skip gracefully
+  - **Impact**: Collection works without markers (22 tests), with marker discovers 17 models (81 tests)
+  - **File**: `tests_2.0/live/conftest.py:68-72`
+
+- **Stop token ordering bugs** (batch AND streaming modes, ADR-009):
+  - **Problem**: Both `generate_batch()` and `generate_streaming()` filtered stop tokens by **list order** instead of **text position**
+  - **Impact**: Models generating multiple EOS tokens (e.g., Phi-3-mini: `<|end|><|endoftext|>`) could leak stop tokens into output
+  - **Evidence**: Phi-3-mini generates two token IDs: `32007='<|end|>'` then `32000='<|endoftext|>'`
+  - **Old behavior**: Checked stop tokens list `['<|endoftext|>', '</s>', '<|end|>']` → found `<|endoftext|>` first (position 146) → left `<|end|>` (position 139) in output
+  - **New behavior**: Finds **earliest stop token in text** → cuts at position 139 → clean output
+  - **Affected**: All models that generate multiple EOS tokens
+  - **Files**: `mlxk2/core/runner/__init__.py` (streaming: 441-466, batch: 619-631)
+  - **Validation**: 73/81 tests passing with diverse portfolio (Phi-3, DeepSeek-R1, GPT-oss, Llama, Qwen, Mistral, Mixtral)
+
+- **E2E test temperature flakiness** (Test reliability fix):
+  - **Problem**: CLI E2E tests used default `temperature=0.7` → non-deterministic outputs → flaky test results
+  - **Fix**: Added `temperature=0.0` to all CLI E2E tests for reproducible results
+  - **Rationale**: E2E tests validate code logic (stop token filtering), not model quality
+  - **Files**: `tests_2.0/live/test_cli_e2e.py`, `tests_2.0/live/test_utils.py` (TEST_TEMPERATURE constant)
+
+- **Web API framework detection** (PR #42 by @limey, fixes Issue #41): `/v1/models` endpoint now correctly lists MLX models from all organizations, not just `mlx-community/*`
+
+- **E2E test marker fix**: `pytest -m show_model_portfolio` now works for diagnostic model discovery
+
+### Architecture
+
+- **mlx-lm API evaluation** (ADR-009):
+  - **Question**: Migrate to `BatchGenerator(stop_tokens=...)` or keep manual implementation?
+  - **Research**: Source code analysis of mlx-lm 0.28.3 (`generate.py`, `BatchGenerator`)
+  - **Critical Finding**: BatchGenerator uses **token-ID based** stop detection (`set[int]`)
+  - **Fundamental Blockers**:
+    1. Cannot handle multi-token sequences like `"\nHuman:"` (required for Issue #14 chat turns)
+    2. No streaming support (we need SSE for `/v1/chat/completions`)
+    3. No "earliest position" logic (Phi-3-mini dual EOS breaks)
+    4. No reasoning parser integration (MXFP4 support breaks)
+  - **Historical Proof**: Issue #14 (1.x) validated text-based approach (114 tests passing, 1.0.4)
+  - **Decision**: **Keep manual text-based implementation** (migration impossible)
+  - **Impact**: No code changes needed, validation simplified
+
+- **Stop token workaround evaluation** (ADR-009):
+  - **Workaround 1** (Line 49): `<|end|>` special handling for Phi-3-mini
+    - **Validated**: 2 Phi-3 variants in portfolio (discovered_11, discovered_12)
+    - **Rationale**: Fixes `eos_token_id=null` bug, empirically stable
+    - **Decision**: **Keep** (0 failures, production stable)
+  - **Workaround 2** (Line 98): `reasoning_end` removal for DeepSeek-R1
+    - **Validated**: DeepSeek-R1-Distill-8B in portfolio (discovered_01)
+    - **Rationale**: Reasoning models need full output until final marker
+    - **Decision**: **Keep** (supports ADR-010 reasoning roadmap)
+  - **Workaround 3** (Line 100): `<|return|>` addition for GPT-oss
+    - **Validated**: gpt-oss-20b-MXFP4 in portfolio (discovered_16)
+    - **Rationale**: GPT-oss reasoning format requires special marker
+    - **Decision**: **Keep** (future-proof for larger reasoning models)
+  - **Evidence**: All 3 workarounds validated with 73/81 tests passing, 0 failures
+  - **File**: `mlxk2/core/runner/stop_tokens.py`
+
+### Testing
+
+- **Portfolio Discovery validation** (ADR-009, Issue #32):
+  - **Scope**: 17 models discovered, 15 testable (60% RAM budget), 2 skipped (RAM constraints)
+  - **Results**: 73/81 tests passing, 0 failures
+  - **Portfolio Families**: Phi-3, DeepSeek-R1, GPT-oss, Llama, Qwen, Mistral, Mixtral
+  - **Validation**: All 3 stop token workarounds actively used and validated
+  - **Performance**: 7:55 minutes (64GB M2 Max, sequential execution, no system freeze)
+  - **Command**: `HF_HOME=/path/to/cache pytest -m live_e2e tests_2.0/live/ -v`
+
+- **E2E test infrastructure hardening** (ADR-011):
+  - **TOKENIZERS_PARALLELISM=false**: Prevents fork warnings and potential deadlocks
+  - **Active cleanup polling**: Waits for actual process termination (not blind timeout)
+  - **Explicit garbage collection**: `gc.collect()` + 2s Metal memory buffer prevents RAM overlap
+  - **Conservative timeout**: 45s max wait for very large models (>40GB), polls every 500ms
+  - **Sequential execution warning**: TESTING-DETAILS.md documents parallel execution risks (Lines 91-128)
+  - **Files**:
+    - `tests_2.0/live/conftest.py` - TOKENIZERS_PARALLELISM + fallback parametrization
+    - `tests_2.0/live/server_context.py` - Active polling + gc.collect()
+
+### Documentation
+
+- **ADR-009 Outstanding Work completed**:
+  - **Portfolio Discovery**: Implemented (2.0.1), validated (2.0.2-beta.1)
+  - **Workaround Evaluation**: Completed with empirical evidence (3/3 kept)
+  - **Empirical Validation**: Expanded from 3 → 15 models tested
+  - **File**: `docs/ADR/ADR-009-Stop-Token-Detection-Fix.md` Lines 180-206
+
+- **TESTING-DETAILS.md harmonization**:
+  - **Portfolio Discovery section** (Line 24): Updated with current validation (17 models, 73/81 tests)
+  - **E2E Test Architecture section** (Lines 465-493): Updated with model_key fix, collection warning, current results
+  - **De-versioned**: Changed from "Validation (2.0.2)" to "Current validation" (timeless guide, not version-specific)
+  - **ADR references maintained**: Architecture context preserved
+
+- **CLAUDE.md accuracy audit**:
+  - **ADR-009 status**: Updated to reflect 2.0.1 + 2.0.2 completion timeline
+  - **ADR-011 status**: Updated to 73/81 tests passing, 17 models discovered
+  - **Roadmap**: Updated with recovery plan progress
+  - **All claims evidence-based**: No false "completed" claims
+
+**Production Code Changes**:
+- `mlxk2/__init__.py` - Version 2.0.2
+- `mlxk2/core/runner/__init__.py` - Stop token ordering fix (streaming + batch modes)
+- `mlxk2/core/runner/stop_tokens.py` - Workarounds validated and documented
+- `mlxk2/core/server_base.py` - Web API framework detection (PR #42)
+
+**Test Infrastructure & Documentation**:
+- `tests_2.0/live/*` - New E2E test infrastructure (conftest, server_context, test suites)
+- `docs/ADR/ADR-009-Stop-Token-Detection-Fix.md` - Outstanding Work completed
+
+---
+
+## 2.0.1 — 2025-11-8

 **Bug Fix & Enhancement Release**: CLI exit code propagation fixes + Portfolio Discovery for stop token validation.

@@ -21,12 +139,13 @@
    - Runtime exceptions: OOM, loading failures → exit 1 (was: exit 0)

 - **Stop token validation Portfolio Discovery** (GitHub Issue #32, ADR-009): Live stop token tests now support dynamic model discovery and HF_HOME-optional testing
-  - **Portfolio Discovery**: Auto-discovers all MLX chat models in `HF_HOME` cache (filter: MLX + healthy + runtime_compatible + chat)
+  - **Portfolio Discovery**: Auto-discovers all MLX chat models via `mlxk list --json` (filter: MLX + healthy + runtime_compatible + chat)
+  - **Refactored in 2.0.2**: Now uses production command instead of duplicating cache logic (~70 LOC eliminated)
  - **RAM-Aware Testing**: Progressive RAM budgets (40-70%) prevent OOM during multi-model validation
  - **Empirical Reporting**: Generates `stop_token_config_report.json` with cross-model stop token findings
  - **Fallback Support**: Tests work without `HF_HOME` using 3 predefined models (MXFP4, Qwen, Llama)
  - **Marker-Required**: Tests excluded from default suite, use `pytest -m live_stop_tokens` to run
-  - **Implementation**: `tests_2.0/test_stop_tokens_live.py` (~110 LOC for discovery + RAM gating)
+  - **Implementation**: `tests_2.0/test_stop_tokens_live.py` (~50 LOC for discovery + RAM gating)

 ### Testing

@@ -1,12 +1,12 @@
-# <img src="https://github.com/mzau/mlx-knife/raw/main/broke-logo.png" alt="BROKE Logo" width="60" style="vertical-align: middle;"> MLX-Knife 2.0
+# <img src="https://github.com/mzau/mlx-knife/raw/main/broke-logo.png" alt="BROKE Logo" width="60" align="middle"> MLX-Knife 2.0

 <p align="center">
  <img src="https://github.com/mzau/mlx-knife/raw/main/mlxk-demo.gif" alt="MLX Knife Demo" width="900">
 </p>

-**Current Stable Version: 2.0.1**
+**Current Stable Version: 2.0.2**

-[![GitHub Release](https://img.shields.io/badge/version-2.0.1-green.svg)](https://github.com/mzau/mlx-knife/releases)
+[![GitHub Release](https://img.shields.io/badge/version-2.0.2-green.svg)](https://github.com/mzau/mlx-knife/releases)
 [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
 [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
 [![Apple Silicon](https://img.shields.io/badge/Apple%20Silicon-green.svg)](https://support.apple.com/en-us/HT211814)
@@ -46,7 +46,7 @@ MLX Knife has been comprehensively tested and verified on:
 pip install mlx-knife

 # Verify installation
-mlxk --version  # → mlxk 2.0.1
+mlxk --version  # → mlxk 2.0.2
 ```

 ### Development Installation
@@ -60,7 +60,7 @@ cd mlx-knife
 pip install -e ".[dev,test]"

 # Verify installation
-mlxk --version  # → mlxk 2.0.1
+mlxk --version  # → mlxk 2.0.2

 # Run tests and quality checks (before committing)
 pytest -v
@@ -573,7 +573,7 @@ Apache License 2.0 — see `LICENSE` (root) and `mlxk2/NOTICE`.
 ---

 <p align="center">
-  <b>Made with ❤️ by The BROKE team <img src="broke-logo.png" alt="BROKE Logo" width="30" style="vertical-align: middle;"></b><br>
-  <i>Version 2.0.1 | November 2025</i><br>
+  <b>Made with ❤️ by The BROKE team <img src="broke-logo.png" alt="BROKE Logo" width="30" align="middle"></b><br>
+  <i>Version 2.0.2 | November 2025</i><br>
  <a href="https://github.com/mzau/broke-cluster">🔮 Next: BROKE Cluster for multi-node deployments</a>
 </p>
@@ -142,7 +142,8 @@ We provide security updates for these versions:

 | Version | Security Support   |
 | ------- | ------------------ |
-| 2.0.1   | :white_check_mark: Current stable |
+| 2.0.2   | :white_check_mark: Current stable |
+| 2.0.1   | :white_check_mark: Supported |
 | 2.0.0   | :white_check_mark: Supported |
 | < 2.0.0 | :x: Upgrade recommended |

@@ -0,0 +1,696 @@
+# MLX Knife Testing - Detailed Documentation
+
+This document contains version-specific details, complete file listings, and implementation specifics for the MLX Knife test suite. For timeless testing philosophy and quick start instructions, see [TESTING.md](TESTING.md).
+
+## Current Status
+
+✅ **306/306 unit tests passing** (November 2025) — 2.0.2 Stable; 20 skipped (opt-in)
+✅ **73/81 E2E tests passing** (November 2025) — ADR-011 completed; 8 skipped (RAM budget)
+✅ **Test environment:** macOS 14.x, M2 Max, Python 3.9-3.13
+✅ **Production verified & reported:** M1, M1 Max, M2 Max in real-world use
+✅ **License:** Apache 2.0 (was MIT in 1.x)
+✅ **Isolated test system** - user cache stays pristine with temp cache isolation
+✅ **3-category test strategy** - optimized for performance and safety
+
+### Skipped Tests Breakdown (20 total, standard run without HF_HOME)
+- **4 Live Stop Tokens tests** - Stop token validation with real models (requires `pytest -m live_stop_tokens`, ADR-009)
+- **1 Live Run test** - Private/org model detection (requires `pytest -m live_run`, Issue #37)
+- **3 Live Clone tests** - APFS same-volume clone workflow (requires `MLXK2_LIVE_CLONE=1`)
+- **1 Live List test** - Tests against user cache (requires HF_HOME with models)
+- **1 Live Push test** - Real HuggingFace push (requires `MLXK2_LIVE_PUSH=1`)
+- **7 Issue #27 tests** - Real-model health validation (requires HF_HOME or MLXK2_USER_HF_HOME setup)
+- **3 Additional opt-in tests** - Various live validation scenarios
+
+**Portfolio Discovery** (ADR-009) is implemented in `tests_2.0/test_stop_tokens_live.py`. When `HF_HOME` is set, tests auto-discover all MLX chat models in user cache using `mlxk list --json` (production command). This ensures Issue #32 fix is validated across the full model portfolio. **Current validation:** 17 models discovered, 15 testable (60% RAM budget), 73/81 tests passing, 0 failures. Portfolio includes: Phi-3, DeepSeek-R1, GPT-oss, Llama, Qwen, Mistral, Mixtral families.
+
+For complete test file structure, see [Appendix](#complete-test-file-structure-201).
+
+---
+
+## Test Execution Guide
+
+| Target | How to Run | Markers / Env | Includes | Network |
+|---|---|---|---|---|
+| Default 2.0 suite | `pytest -v` | — | JSON-API (list/show/health), Human-Output, Model-Resolution, Health-Policy, Push Offline (`--check-only`, `--dry-run`), Spec/Schema checks | No |
+| Spec-only | `pytest -m spec -v` | `spec` | Schema/contract tests, version sync, docs example validation | No |
+| Exclude Spec | `pytest -m "not spec" -v` | `not spec` | Everything except spec/schema checks | No |
+| Push offline | `pytest -k push -v` | — | Push offline tests (tests alpha feature: `--check-only`, `--dry-run`, error handling); no network, no credentials needed | No |
+| ⏭️ Live Push | `MLXK2_ENABLE_ALPHA_FEATURES=1 pytest -m live_push -v` | `live_push` (subset of `wet`) + Env: `MLXK2_ENABLE_ALPHA_FEATURES=1`, `MLXK2_LIVE_PUSH=1`, `HF_TOKEN`, `MLXK2_LIVE_REPO`, `MLXK2_LIVE_WORKSPACE` | JSON push against the real Hub; on errors the test SKIPs (diagnostic) | Yes |
+| ⏭️ Live List | `pytest -m live_list -v` | `live_list` (subset of `wet`) + Env: `HF_HOME` (user cache with models) | Tests list/health against user cache models | No (uses local cache) |
+| Clone offline | `pytest -k clone -v` | — | Clone offline tests (tests alpha feature: APFS validation, temp cache, CoW workflow); no network needed | No |
+| ⏭️ Live Clone (ADR-007) | `MLXK2_ENABLE_ALPHA_FEATURES=1 pytest -m live_clone -v` | `live_clone` + Env: `MLXK2_ENABLE_ALPHA_FEATURES=1`, `MLXK2_LIVE_CLONE=1`, `HF_TOKEN`, `MLXK2_LIVE_CLONE_MODEL`, `MLXK2_LIVE_CLONE_WORKSPACE` | Real clone workflow: pull→temp cache→APFS same-volume clone→workspace (ADR-007 Phase 1 constraints: same volume + APFS required) | Yes |
+| 🔒 Live Stop Tokens (ADR-009) | `pytest -m live_stop_tokens -v` | `live_stop_tokens` (required); Optional: `HF_HOME` (enables portfolio discovery) | Issue #32: Validates stop token behavior with real models. **With HF_HOME:** Portfolio Discovery auto-discovers all MLX chat models (filter: MLX+healthy+runtime+chat), RAM-aware skip, empirical report. **Without HF_HOME:** Uses 3 predefined models (see "Optional Setup" section for model requirements). | No (uses local cache) |
+| ⏭️ Live Run | `pytest -m live_run -v` | `live_run` + Env: `MLXK2_USER_HF_HOME` or `HF_HOME` (user cache with `mlx-community/Phi-3-mini-4k-instruct-4bit`) | Regression tests for Issue #37: Validates private/org MLX model framework detection in run command (renames Phi-3 to simulate private-org model) | No (uses local cache) |
+| 🔒 Live E2E (ADR-011) | `HF_HOME=/path/to/cache pytest -m live_e2e -v` | `live_e2e` (required) + Env: `HF_HOME` (optional, enables Portfolio Discovery); Requires: `httpx` installed | **✅ Working:** Server/HTTP/CLI validation with real models. Portfolio Discovery auto-discovers all MLX chat models via `mlxk list --json` (filter: MLX+healthy+runtime+chat), parametrized tests (one server per model), RAM-aware skip. | No (uses local cache) |
+| 🔍 Show E2E Portfolio | `HF_HOME=/path/to/cache pytest -m show_model_portfolio -s` | `show_model_portfolio` + Env: `HF_HOME` | **Convenience:** Displays which models would be tested by `live_e2e` tests. Shows table with model keys (discovered_XX), RAM requirements, and test/skip status. No actual testing performed - just displays portfolio. | No (uses local cache) |
+| 🔍 Manual Debug Mode | `mlxk run <model> "test prompt" --verbose` | Manual CLI usage with `--verbose` flag | **Quality Analysis:** Shows token generation details including multiple EOS token warnings. Use this for manual debugging of model quality issues. Output includes `[DEBUG] Token generation analysis` and `⚠️ WARNING: Multiple EOS tokens detected` for broken models. | No (uses local cache) |
+| ⏭️ Issue #27 real-model | `pytest -m issue27 tests_2.0/test_issue_27.py -v` | Marker: `issue27`; Env (required): `MLXK2_USER_HF_HOME` or `HF_HOME` (user cache, read-only). Env (optional): `MLXK2_ISSUE27_MODEL`, `MLXK2_ISSUE27_INDEX_MODEL`, `MLXK2_SUBSET_COUNT=0`. | Copies real models from user cache into isolated test cache; validates strict health policy on index-based models (no network) | No (uses local cache) |
+| Server tests (included) | `pytest -k server -v` | — | Basic server API tests (minimal, uses MLX stubs) | No |
+
+**Useful commands:**
+```bash
+# Only Spec
+pytest -m spec -v
+
+# Push tests (offline)
+pytest -k "push and not live" -v
+
+# Clone tests (offline)
+pytest -k "clone and not live" -v
+
+# Exclude Spec
+pytest -m "not spec" -v
+
+# Live Push only
+MLXK2_ENABLE_ALPHA_FEATURES=1 MLXK2_LIVE_PUSH=1 HF_TOKEN=... MLXK2_LIVE_REPO=... MLXK2_LIVE_WORKSPACE=... pytest -m live_push -v
+
+# Live Clone only
+MLXK2_ENABLE_ALPHA_FEATURES=1 MLXK2_LIVE_CLONE=1 HF_TOKEN=... MLXK2_LIVE_CLONE_MODEL=... MLXK2_LIVE_CLONE_WORKSPACE=... pytest -m live_clone -v
+
+# Live List only
+HF_HOME=/path/to/user/cache pytest -m live_list -v
+
+# Live Stop Tokens only (ADR-009)
+pytest -m live_stop_tokens -v  # Optional: HF_HOME=/path/to/cache for portfolio discovery
+
+# Live Run only
+HF_HOME=/path/to/user/cache pytest -m live_run -v
+
+# Live E2E only (ADR-011)
+HF_HOME=/path/to/user/cache pytest -m live_e2e -v  # See model list: pytest tests_2.0/live/test_server_e2e.py::TestChatCompletionsBatch --collect-only -q
+
+# Issue #27 only
+MLXK2_USER_HF_HOME=/path/to/user/cache pytest -m issue27 tests_2.0/test_issue_27.py -v
+
+# All live tests (umbrella)
+MLXK2_ENABLE_ALPHA_FEATURES=1 pytest -m wet -v
+```
+
+---
+
+## ⚠️ CRITICAL: Sequential Execution Required for E2E Tests
+
+**DO NOT use parallel execution with `-m live_e2e` tests:**
+
+```bash
+# ✅ SAFE - Sequential execution (one model at a time)
+HF_HOME=/path/to/cache pytest -m live_e2e -v
+
+# 🔥 DEADLY - Parallel execution (multiple large models simultaneously)
+HF_HOME=/path/to/cache pytest -m live_e2e -n auto  # ← NEVER DO THIS!
+```
+
+**Why parallel execution is dangerous:**
+
+| Risk | Impact | Evidence |
+|------|--------|----------|
+| **Multiple large models loading simultaneously** | System freeze requiring hardware reset | Experienced 2025-11-12 during development |
+| **RAM budget violation** | 8 workers × 20GB models = 160GB peak RAM usage | Even 64GB M2 Max cannot handle this |
+| **Metal GPU memory exhaustion** | MLX Metal cache shared across processes | Leads to GPU hang + system unresponsive |
+
+**Architecture protections (sequential mode only):**
+- ✅ **One server per test:** No parallel inference within a single test
+- ✅ **Active cleanup polling:** Waits for actual process termination (not blind timeout)
+- ✅ **Explicit garbage collection:** Forces Python GC + 2s Metal memory buffer
+- ✅ **Conservative timeout:** 45s max wait for very large models (>40GB), but polls every 500ms
+- ⚠️ **Large model transitions:** Models >20GB may have 10-15s RAM overlap during cleanup
+
+**Safe execution guidelines:**
+- Always run `pytest -m live_e2e` without `-n auto` or `-n <workers>`
+- If using pytest-xdist, ensure it's NOT active for E2E tests
+- Monitor system RAM during first run to understand your hardware limits
+- Expected duration: ~7-10 minutes for 15 models (sequential, with cleanup)
+
+**Note on `-n auto` (pytest-xdist):**
+- `-n auto`: Spawns one worker per CPU core (e.g., 8 workers on 8-core M2)
+- Each worker loads a separate model instance simultaneously
+- Safe for unit tests (mocked, no real models), DEADLY for E2E tests (real models)
+
+---
+
+## Python Version Verification Results
+
+**All standard tests validated on Apple Silicon with enhanced isolation**
+
+| Python Version | Status | Tests Passing | Skipped |
+|----------------|--------|---------------|---------|
+| 3.9.6 (macOS)  | ✅ Verified | 306/306 | 20 |
+| 3.10.x         | ✅ Verified | 306/306 | 20 |
+| 3.11.x         | ✅ Verified | 306/306 | 20 |
+| 3.12.x         | ✅ Verified | 306/306 | 20 |
+| 3.13.x         | ✅ Verified | 306/306 | 20 |
+
+**Note:** 20 skipped tests are opt-in (live tests, alpha features). Skipped count may vary by environment:
+- Without `HF_TOKEN`: +1 skip (live push test)
+- Without `MLXK2_ENABLE_ALPHA_FEATURES=1`: +3 skips (alpha feature tests)
+- Without `jsonschema`: +1 skip (spec validation test)
+
+All versions tested with `isolated_cache` system and MLX stubs for fast execution without model downloads.
+
+## Push Testing Details (2.0)
+
+This section summarizes what our test suite covers for the experimental `push` feature and what still requires live/manual checks.
+
+### Reference: Push CLI and JSON
+
+- Usage: `mlxk2 push <local_dir> <org/model> --private [--create] [--branch main] [--commit <msg>] [--check-only] [--json] [--verbose]`
+- Args:
+  - `--private` (required in alpha): Safety gate to avoid public uploads.
+  - `--create`: Create the repository if it does not exist (model repo).
+  - `--branch`: Target branch, default `main`. Missing branches are tolerated; with `--create`, the branch is proactively created (and upload retried once if the hub initially rejects the revision).
+  - `--commit`: Commit message, default `"mlx-knife push"`.
+  - `--check-only`: Analyze workspace locally; no network call; returns `data.workspace_health`.
+  - `--dry-run`: Compare local workspace to the remote branch and summarize changes without uploading (requires repo read access).
+  - `--json`: Print JSON response; in JSON mode, logs/progress are suppressed by default.
+  - `--verbose`: Human mode — append details (e.g., commit URL). In JSON mode, only toggles console log verbosity; the JSON payload is unchanged.
+
+- JSON fields (`data`):
+  - `repo_id: string` — target `org/model`.
+  - `branch: string` — target branch.
+  - `commit_sha: string|null` — commit id; null when `no_changes:true` or on noop.
+  - `commit_url: string|null` — link to commit; null when no commit created.
+  - `repo_url: string` — `https://huggingface.co/<org/model>`.
+  - `uploaded_files_count: int|null` — number of changed files; set to `0` on `no_changes:true`.
+  - `local_files_count: int|null` — approximate local file count scanned.
+  - `no_changes: boolean` — true when hub reports an empty commit (preferred signal) or no file operations are detected.
+  - `created_repo: boolean` — true when repo was created (with `--create`).
+  - `change_summary: {added:int, modified:int, deleted:int}` — optional; derived from hub response when available.
+  - `message: string|null` — short human hint; mirrors hub on no-op.
+  - `hf_logs: string[]` — buffered hub log lines (not printed in JSON mode unless `--verbose`).
+  - `experimental: true` and `disclaimer: string` — feature state markers.
+  - `workspace_health: {...}` — present only with `--check-only`:
+    - `healthy: bool`, `anomalies: []`, `config`, `weights.index`, `weights.pattern_complete`, etc.
+  - `dry_run: true` — present only with `--dry-run`.
+  - `dry_run_summary: {added:int, modified:int, deleted:int}` — present with `--dry-run`.
+  - `would_create_repo: bool` / `would_create_branch: bool` — planning hints when target does not exist.
+
+- Error types (`error.type`):
+  - `dependency_missing` — `huggingface-hub` not installed.
+  - `auth_error` — missing `HF_TOKEN` (unless `--check-only`).
+  - `workspace_not_found` — local_dir missing/not a directory.
+  - `repo_not_found` — repo missing without `--create`.
+  - `upload_failed` — hub returned an error (e.g., 403/permission).
+  - `push_operation_failed` — unexpected internal failure wrapper.
+
+- Exit codes: success → `0`; any `status:error` → `1`.
+
+### Automated (offline)
+
+- **Token/Workspace errors:** Missing `HF_TOKEN` and missing workspace produce proper JSON errors.
+- **CLI args (JSON mode):** Missing positional args emit JSON errors rather than usage text.
+- **Schema shape:** Push success/error outputs validate against `docs/json-api-schema.json`.
+- **No-op push:** Detects `no_changes: true`, sets `uploaded_files_count: 0`, carries hub message into JSON (`message`/`hf_logs`), and human output shows "no changes" without duplicate logs.
+- **Commit path:** Extracts `commit_sha`, `commit_url`, `change_summary` (+/~/−), correct `uploaded_files_count`; human `--verbose` includes URL.
+- **Repo/Branch handling:** Missing repo requires `--create`; with `--create` sets `created_repo: true`. Missing branch is tolerated; upload attempts proceed. With `--create`, the branch is proactively created and the upload is retried once if the hub rejects the revision (e.g., "Invalid rev id").
+- **Ignore rules:** `.hfignore` is merged with default ignores and forwarded to the hub.
+
+**Files:**
+- `tests_2.0/test_cli_push_args.py` (CLI errors and JSON outputs)
+- `tests_2.0/test_push_extended.py` (no-op vs commit, branch/repo, .hfignore, human; includes retry on invalid revision with `--create`)
+- `tests_2.0/spec/test_push_output_matches_schema.py` (schema success path)
+
+**Run (venv39):**
+```bash
+source venv39/bin/activate && pip install -e .
+pytest -q tests_2.0/test_cli_push_args.py tests_2.0/test_push_extended.py
+pytest -q tests_2.0/spec/test_push_output_matches_schema.py
+pytest -q tests_2.0/test_push_extended.py::test_push_retry_creates_branch_on_upload_revision_error
+```
+
+### Live (opt-in / wet)
+
+- Purpose: sanity-check real HF behavior (auth, no-op vs commit, URLs).
+- Defaults: Live tests are skipped. Enable with env vars and markers.
+- Env:
+  - `MLXK2_LIVE_PUSH=1`
+  - `HF_TOKEN` (write-enabled)
+  - `MLXK2_LIVE_REPO='org/model'`
+  - `MLXK2_LIVE_WORKSPACE='/abs/path/to/workspace'`
+- Command:
+  - `pytest -q -m wet tests_2.0/live/test_push_live.py`
+  - or `pytest -q -m live_push`
+
+## Pull/Preflight (Issue #30)
+
+Goal: Gated/private/not-found repos must not pollute the cache and should fail fast.
+
+- Behavior (2.0):
+  - Preflight uses `huggingface_hub.HfApi.model_info()` (metadata only; no download).
+  - Gated/Forbidden/Unauthorized/NotFound → `access_denied` before download; clear hint to set `HF_TOKEN`.
+  - Network timeouts/unspecific HTTP errors in preflight → degrade to a warning; allow the download layer (to surface meaningful error/timeout paths).
+  - Tokens: prefer `HF_TOKEN` (legacy `HUGGINGFACE_HUB_TOKEN` is read, but not promoted).
+  - Tests use isolated caches; the user cache is never touched.
+
+- Relevant tests: `tests_2.0/test_issue_30_preflight.py`
+  - `test_preflight_private_model_without_token`
+  - `test_preflight_nonexistent_model`
+  - `test_preflight_integration_in_pull`
+  - `test_preflight_prevents_cache_pollution`
+
+- Quick checks:
+  - `pytest -q tests_2.0/test_issue_30_preflight.py`
+  - CLI: `unset HF_TOKEN HUGGINGFACE_HUB_TOKEN; mlxk-json pull meta-llama/Llama-2-7b-hf --json`
+
+## Runner: Interruption & Recovery
+
+- Semantics (2.0): A new generation resets `_interrupted = False` at the start (recovery behavior). A previous Ctrl-C does not block the next generation.
+- Streaming:
+  - During an active generation, the runner yields a line `"[Generation interrupted by user]"` and stops.
+  - Token diffing in streaming is robust against minimal mocks (no StopIteration due to short `decode` sequences).
+- Batch:
+  - Resets the flag at the start of a new generation; filters stop tokens; chat stop tokens optional via `use_chat_stop_tokens=True`.
+- Relevant tests:
+  - `tests_2.0/test_ctrl_c_handling.py` (SIGINT, interruption behavior, interactive)
+  - `tests_2.0/test_interruption_recovery.py` (resetting the flag for new generations)
+  - `tests_2.0/test_runner_core.py` (consistency/batch/streaming, error handling)
+
+## Server Minimal Tests
+
+- Dependencies: `httpx`, `fastapi`, `uvicorn`, `pydantic` (via `[test]`).
+- Scope: OpenAI-compatible endpoints (minimal smoke); no real models required.
+- Optional for local verification; in CI currently "nice to have" (Backlog, not part of the 2.0 Guide).
+
+## Known Warnings
+
+- urllib3 LibreSSL notice on macOS Python 3.9
+  - Message: "urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3' …"
+  - Status: Harmless for our usage; suppressed in production code (see `mlxk2/__init__.py`, `warnings.filterwarnings(...)`).
+  - Tests: May still appear in pytest summary if third-party dependencies import `urllib3` before our package.
+  - Optional suppression in tests: add to `pytest.ini`:
+
+    ```ini
+    filterwarnings =
+        ignore:urllib3 v2 only supports OpenSSL 1.1.1+
+    ```
+
+## Issue #27 Tests (Real Multi-Shard Model Health)
+
+### Quick Start (Minimal)
+
+```bash
+# Set your HF cache (external SSD recommended)
+export HF_HOME=/Volumes/your-ssd/huggingface/cache
+
+# Select a model with index file (upstream repo)
+export MLXK2_ISSUE27_MODEL="mistralai/Mixtral-8x7B-Instruct-v0.1"
+
+# Optional: Bootstrap index if not in cache
+export MLXK2_BOOTSTRAP_INDEX=1
+
+# Run tests
+pytest tests_2.0/test_issue_27.py -v
+```
+
+### Purpose
+
+These tests validate the strict health policy against real upstream Hugging Face repositories that ship multi-shard safetensors with a `model.safetensors.index.json`. They complement the deterministic unit tests by exercising real-world layouts.
+
+### When to Run
+
+**Run them when:**
+- Your user cache contains at least one upstream PyTorch repo with a safetensors index (not MLX/GGUF conversions). Good candidates:
+  - `mistralai/Mistral-7B-Instruct-v0.2` or `-v0.3`
+  - `Qwen/Qwen1.5-7B-Chat`, `Qwen/Qwen2-7B-Instruct`
+  - `teknium/OpenHermes-2.5-Mistral`
+  - Gated: `meta-llama/Llama-2-7b-chat-hf`, `meta-llama/Llama-3-8B-Instruct`, `google/gemma-7b-it`
+- You want to sanity-check index-based completeness, shard deletion/truncation, and LFS pointer detection against real artifacts.
+
+**They are not useful when:**
+- Your cache only has MLX Community models (no `model.safetensors.index.json`) or GGUF models — the index-based tests will skip by design. In that case, rely on `tests_2.0/test_health_multifile.py` for deterministic coverage.
+
+### Environment Setup
+
+```bash
+# Set user cache (EITHER)
+export MLXK2_USER_HF_HOME=/absolute/path/to/huggingface/cache
+# OR
+export HF_HOME=/absolute/path/to/huggingface/cache  # Test harness preserves this
+
+# Select model with index file (recommended)
+export MLXK2_ISSUE27_MODEL="mistralai/Mixtral-8x7B-Instruct-v0.1"
+
+# Optional: Minimize copy size
+export MLXK2_SUBSET_COUNT=1      # Default 1
+export MLXK2_MIN_FREE_MB=512     # Default 512 MB
+
+# Run tests
+PYTHONPATH=. pytest tests_2.0/test_issue_27.py -v
+```
+
+### Optional Bootstrap (Opt-in, Minimal Workflow)
+
+```bash
+# Enable index bootstrap (fetches only index files, never modifies user cache)
+export MLXK2_BOOTSTRAP_INDEX=1
+
+# Optional: Separate model for index tests
+export MLXK2_ISSUE27_INDEX_MODEL="org/model-with-index"
+
+# Run
+pytest tests_2.0/test_issue_27.py -v
+```
+
+**Note:** Network is only needed if your user cache does not already contain an index file for the chosen repo. If the index exists in your cache, the tests copy it into the isolated cache and no network is required.
+
+### Troubleshooting
+
+**If you see SKIPs:**
+- "No safetensors index found" → The chosen model snapshot lacks an index file. Pick a model that has `model.safetensors.index.json` (or `pytorch_model.bin.index.json`).
+- "Not enough free space" → Free disk space; tests create a subset copy into an isolated temp cache.
+- "User model not found" → Verify your model exists in the user cache and `MLXK2_USER_HF_HOME` points to the `.../huggingface/cache` root.
+
+**Quick helper to list index-bearing models in your user cache:**
+
+```bash
+find "$MLXK2_USER_HF_HOME/hub" -type f \
+  \( -name 'model.safetensors.index.json' -o -name 'pytorch_model.bin.index.json' \) \
+| sed 's#.*/hub/models--\(.*\)/snapshots/.*#\1#; s#--#/#g' | sort -u
+```
+
+### Resource Considerations
+
+- **Disk:** Tests copy a minimal subset of files into an isolated cache (index + 1 smallest shard, or 1 Pattern-Shard).
+- **Network:** If you need to fetch a candidate model first, prefer downloading only `config.json`, `model.safetensors.index.json`, and 1-2 small shards to keep it light.
+
+## Manual MLX Chat Model Smoke Test (2.0)
+
+Goal: Pull a small MLX chat model, verify classification, prepare a local workspace, validate it offline, and push to a private repo while preserving chat intent. This helps issuers validate iOS-focused workflows.
+
+**Model choice (example):**
+- `mlx-community/Qwen2.5-0.5B-Instruct-4bit` (small, chat-oriented)
+
+### Steps
+
+1. **Pull (venv39):**
+   ```bash
+   mlxk2 pull mlx-community/Qwen2.5-0.5B-Instruct-4bit
+   ```
+
+2. **Verify in cache:**
+   ```bash
+   mlxk2 list --health "Qwen2.5-0.5B-Instruct-4bit"
+   # Expect: Framework MLX, Type chat, capabilities include chat
+   ```
+
+3. **Prepare local workspace from cache (dereference symlinks):**
+   ```bash
+   # Ensure HF_HOME points to your HF cache
+   # Compute cache path: $HF_HOME/models--mlx-community--Qwen2.5-0.5B-Instruct-4bit
+   # Find latest snapshot hash under snapshots/
+   # Copy to workspace and dereference symlinks:
+   rsync -aL "$HF_HOME/models--mlx-community--Qwen2.5-0.5B-Instruct-4bit/snapshots/<HASH>/" ./mymodel_test_workspace/
+   ```
+
+4. **Recommended README front-matter (to preserve intent on push):**
+   - Include YAML with tags and pipeline tag, e.g.:
+     - `tags: [mlx, chat]`
+     - `pipeline_tag: text-generation`
+     - `base_model: <upstream_base>`
+   - Keep model name containing `Instruct` or `chat` to aid chat detection
+
+5. **Offline validation (no network):**
+   ```bash
+   mlxk2 push --check-only ./mymodel_test_workspace <org/model> --json
+   # Expect: workspace_health.healthy: true
+   ```
+
+6. **Push to private repo:**
+   ```bash
+   mlxk2 push --private --create ./mymodel_test_workspace <org/model> --json
+   # Re-push without changes should show no_changes: true
+   ```
+
+7. **Post-push verification:**
+   ```bash
+   mlxk2 list --all --health <org/model>
+   # Current limitation: Framework may show PyTorch for non-mlx-community orgs
+   # This does not affect content; future M1 will parse model card tags (mlx)
+   ```
+
+## Real-Model Testing (Implemented in 2.0.1)
+
+**Status:** ✅ Live in 2.0.1 (Portfolio Discovery, ADR-009)
+
+### Portfolio Discovery
+
+Auto-discovers and tests all MLX chat models in user cache.
+
+**Location:** `test_stop_tokens_live.py` (Category 2: Live Tests)
+**Marker:** `live_stop_tokens`
+
+**Usage:**
+```bash
+# With HF_HOME: Auto-discovers all MLX chat models
+export HF_HOME=/path/to/cache
+pytest -m live_stop_tokens -v
+
+# Without HF_HOME: Uses 3 predefined models (must exist in cache)
+pytest -m live_stop_tokens -v  # → Runs if models present, else fails
+```
+
+**Features:**
+- ✅ **Model Filtering:** MLX + healthy + runtime_compatible + chat only
+- ✅ **Portfolio Discovery:** Uses `mlxk list --json` to discover all qualifying models (refactored: production command, ~70 LOC eliminated)
+- ✅ **RAM-Aware:** Progressive budgets prevent OOM (40%-70% of system RAM)
+- ✅ **Empirical Report:** Generates `stop_token_config_report.json` with findings
+- ✅ **Fallback:** Uses 3 predefined models (MXFP4, Qwen, Llama) if HF_HOME not set - models must exist in HF cache
+
+**Required models for fallback (without HF_HOME):**
+```bash
+mlxk pull mlx-community/gpt-oss-20b-MXFP4-Q8         # ~12GB RAM
+mlxk pull mlx-community/Qwen2.5-0.5B-Instruct-4bit   # ~1GB RAM
+mlxk pull mlx-community/Llama-3.2-3B-Instruct-4bit   # ~4GB RAM
+```
+
+### E2E Tests with Portfolio Discovery (ADR-011)
+
+**Status:** ✅ Working (refactored Nov 2025)
+
+Auto-discovers and validates Server/HTTP/CLI interfaces with real models.
+
+**Location:** `tests_2.0/live/` (test_server_e2e.py, test_cli_e2e.py, test_streaming_parity.py)
+**Marker:** `live_e2e`
+
+**Usage:**
+```bash
+# With HF_HOME: Auto-discovers all MLX chat models
+export HF_HOME=/path/to/cache
+pytest -m live_e2e -v
+
+# See which models will be tested
+pytest tests_2.0/live/test_server_e2e.py::TestChatCompletionsBatch --collect-only -q
+
+# ⚠️ IMPORTANT: Always test collection before release
+pytest -m live_e2e --collect-only  # Should work without errors
+```
+
+**Architecture:**
+- ✅ **Portfolio Discovery via `mlxk list --json`:** Uses production command instead of duplicating cache logic (~70 LOC eliminated)
+- ✅ **Parametrized Tests:** One pytest test per model (prevents RAM leaks from loop-based architecture)
+- ✅ **model_key parametrization:** Collection regression fixed with fallback to empty list
+- ✅ **Clean Lifecycle:** Each test gets its own server instance (45s timeout for MLX cleanup)
+- ✅ **RAM-Aware:** Same progressive budgets as stop token tests (40%-70%)
+- ✅ **Current result:** 73/81 tests passing (17 models discovered, 15 testable, 8 skipped: RAM budget) - no system freeze
+
+**Tests Covered:**
+- Server health/metadata endpoints
+- Chat completions (batch + streaming)
+- Text completions (batch + streaming)
+- CLI `mlxk run` (text + JSON output)
+- Streaming parity validation (Issue #20)
+- Stop token filtering (Issue #32)
+
+### RAM-Aware Model Selection
+
+**Implementation:** `get_safe_ram_budget_gb()`, `should_skip_model()`
+
+**Progressive RAM Budgets:**
+
+| System RAM | Budget | Available for Models |
+|------------|--------|---------------------|
+| 16GB | 40% | 6.4GB |
+| 32GB | 50% | 16GB |
+| 64GB | 60% | 38.4GB |
+| 96GB+ | 70% | 67GB+ |
+
+**Rationale:** OS overhead is ~4-6GB (constant), larger systems have more headroom.
+
+**Behavior:**
+- Models exceeding budget → Auto-skipped
+- Skip reason: "Model requires XGB but only YGB available"
+- Empirical report tracks skipped models
+
+**Example:**
+```python
+# 32GB system → 16GB budget
+# Qwen-0.5B (1GB) → ✅ RUN
+# Llama-3.2-3B (4GB) → ✅ RUN
+# Mistral-7B (8GB) → ✅ RUN
+# Mixtral-8x7B (32GB) → ⏭️ SKIP (exceeds 16GB budget)
+```
+
+## Future: Server E2E Testing (ADR-011)
+
+**Status:** Planned for post-2.0.1
+
+### Scope
+
+End-to-end validation of Server/HTTP/CLI with real models:
+- **HTTP API:** `/v1/chat/completions` (streaming + non-streaming)
+- **SSE Format:** Server-Sent Events validation
+- **CLI Integration:** `mlxk run`, `mlxk server` subprocess tests
+- **Streaming Parity:** Issue #20 regression protection
+
+### Planned Implementation
+
+**Location:** `tests_2.0/live/test_server_e2e.py`, `test_streaming_parity.py`, `test_cli_e2e.py`
+**Marker:** `live_e2e` (future)
+**Infrastructure:** Reuses Portfolio Discovery + RAM-Aware logic from ADR-009
+
+**Example:**
+```python
+@pytest.mark.live_e2e
+def test_server_streaming_portfolio(portfolio_models):
+    """Validate /v1/chat/completions SSE streaming across portfolio."""
+    for model in portfolio_models:
+        with LocalServer(model) as server:
+            response = requests.post(f"{server.url}/v1/chat/completions",
+                                    json={"stream": True, ...})
+            # Validate SSE format, stop tokens, no visible EOS
+```
+
+**See:** ADR-011 for detailed architecture
+
+---
+
+## Appendix
+
+### Complete Test File Structure (2.0.2)
+
+```
+tests_2.0/
+├── __init__.py
+├── conftest.py                        # Isolated test cache (HF_HOME override), safety sentinel, core fixtures
+├── conftest_runner.py                 # Runner-specific fixtures/mocks
+├── stubs/                             # Minimal mlx/mlx_lm stubs for unit/spec tests
+│   ├── mlx/
+│   │   └── core.py
+│   └── mlx_lm/
+│       ├── __init__.py
+│       ├── generate.py
+│       └── sample_utils.py
+├── spec/                              # JSON API spec/contract validation
+│   ├── test_cli_commands_json_flag.py         # CLI JSON flag behavior
+│   ├── test_cli_version_output.py             # Version command JSON shape
+│   ├── test_code_outputs_validate_against_schema.py  # Code outputs validate against schema
+│   ├── test_push_error_matches_schema.py      # Push error output matches schema
+│   ├── test_push_output_matches_schema.py     # Push success output matches schema
+│   ├── test_spec_doc_examples_validate.py     # Docs examples validate against JSON schema
+│   └── test_spec_version_sync.py              # Code/docs version consistency check
+├── live/                              # Opt-in live tests (markers)
+│   ├── __init__.py
+│   ├── conftest.py                              # Shared fixtures for live E2E tests (portfolio_models, pytest_generate_tests hook)
+│   ├── server_context.py                       # LocalServer context manager for E2E testing (30s timeout for MLX cleanup)
+│   ├── sse_parser.py                           # SSE parsing utilities for streaming validation
+│   ├── test_utils.py                           # Portfolio Discovery (via mlxk list --json), RAM gating utilities
+│   ├── test_cli_e2e.py                         # CLI integration E2E tests (ADR-011, parametrized)
+│   ├── test_clone_live.py                      # Live clone flow (requires MLXK2_LIVE_CLONE, HF_TOKEN)
+│   ├── test_list_human_live.py                 # Live list/health against user cache (requires HF_HOME)
+│   ├── test_push_live.py                       # Live push flow (requires MLXK2_LIVE_PUSH, HF_TOKEN)
+│   ├── test_server_e2e.py                      # Server E2E tests with real models (ADR-011, parametrized)
+│   └── test_streaming_parity.py                # Streaming vs batch parity tests (Issue #20, ADR-011, parametrized)
+├── test_adr004_error_logging.py       # ADR-004 error logging and redaction (tokens, paths)
+├── test_cli_log_json_flag.py          # CLI --log-json flag behavior and JSON log format
+├── test_cli_push_args.py              # Push CLI args and JSON error/output handling (offline)
+├── test_cli_run_exit_codes.py         # CLI exit codes for run command errors (Issue #38)
+├── test_clone_operation.py            # Clone operations with APFS optimization
+├── test_ctrl_c_handling.py            # SIGINT handling during run/interactive flows
+├── test_detection_readme_tokenizer.py # README/tokenizer-based framework detection
+├── test_edge_cases_adr002.py          # Naming/health edge cases (ADR-002)
+├── test_health_multifile.py           # Multi-file health completeness (index vs pattern)
+├── test_human_output.py               # Human rendering of list/health views
+├── test_integration.py                # Model resolution and health integration
+├── test_interactive_mode.py           # Interactive CLI mode prompts/history/streaming
+├── test_interruption_recovery.py      # Recovery semantics after interruption (flag reset)
+├── test_issue_27.py                   # Health policy exploration with real models (marker: issue27)
+├── test_issue_30_preflight.py         # Preflight for gated/private/not-found repos (Issue #30)
+├── test_issue_37_private_org_regression.py  # Issue #37 private/org MLX model detection (marker: live_run)
+├── test_json_api_list.py              # JSON API list contract (shape/fields)
+├── test_json_api_show.py              # JSON API show contract (base/files/config)
+├── test_legacy_formats.py             # Legacy model format detection (Issue #37)
+├── test_model_naming.py               # Conversion rules, bijection, parsing
+├── test_push_dry_run.py               # Push dry-run diff planning (added/modified/deleted)
+├── test_push_extended.py              # Extended push: no-op vs commit, branch/retry, .hfignore
+├── test_push_minimal.py               # Minimal push scenarios (offline)
+├── test_push_workspace_check.py       # Push check-only: workspace validation without network
+├── test_robustness.py                 # Robustness for rm/pull/disk/timeout/concurrency
+├── test_run_complete.py               # End-to-end run command (stream/batch/params)
+├── test_runner_core.py                # MLXRunner core generation/memory/stop tokens
+├── test_runtime_compatibility_reason_chain.py  # Runtime compatibility reason field decision chain (Issue #36)
+├── test_server_api_minimal.py         # Minimal OpenAI-compatible server endpoints (SSE, JSON)
+├── test_server_api.py.disabled        # Disabled server API tests (WIP/expanded scenarios)
+├── test_server_models_and_errors.py   # Server model loading and error handling
+├── test_server_streaming_minimal.py   # Server SSE streaming functionality
+├── test_server_token_limits_api.py    # Server token limit enforcement
+├── test_stop_tokens_live.py           # Stop token validation with real models (marker: live_stop_tokens, ADR-009)
+└── test_token_limits.py               # Dynamic token calculation; server vs run policies
+```
+
+---
+
+## Known Model Quality Issues
+
+Models with documented quality issues discovered during testing. Tests **will fail** when these issues occur (no workarounds).
+
+### Multiple EOS Token Generation
+
+| Model | Token IDs | Status | Evidence Date |
+|-------|-----------|--------|---------------|
+| Phi-3-mini-4k-instruct-4bit | 32007=`<\|end\|>`, 32000=`<\|endoftext\|>` | Fixed 2.0.2 | 2025-11-13 |
+
+**Issue:** Model generates multiple EOS tokens instead of stopping at first.
+**Detection:** Use `mlxk run --verbose` to see token generation details.
+**Fix:** MLX-Knife 2.0.2+ filters by earliest position in text (not list order).
+
+**Example usage:**
+```bash
+# Manual debugging with verbose mode
+mlxk run mlx-community/Phi-3-mini-4k-instruct-4bit "Write one sentence about cats." --verbose
+
+# Look for:
+# [DEBUG] Token generation analysis:
+# [DEBUG]   Last 3 tokens: ["29889='.'", "32007='<|end|>'", "32000='<|endoftext|>'"]
+# [DEBUG]   ⚠️ WARNING: Multiple EOS tokens detected (2) - model quality issue
+```
+
+---
+
+### Version History
+
+### 2.0.2 (2025-11-14)
+- ✅ Test infrastructure hardening (TOKENIZERS_PARALLELISM, active polling, gc.collect())
+- ✅ Portfolio Discovery validation complete (73/81 E2E tests, 17 models discovered)
+- ✅ Sequential execution warning added (TESTING-DETAILS.md CRITICAL section)
+- ✅ RAM logging infrastructure added (server_context.py, for future benchmark tooling)
+
+### 2.0.2-dev (2025-11-13)
+- ✅ Stop token ordering bugs fixed (batch + streaming modes)
+- ✅ E2E test suite completed (ADR-011)
+- ✅ 72/80 E2E tests passing baseline (15 models tested)
+- ✅ TESTING.md restructuring (timeless policy + version-specific details split)
+
+### 2.0.1 (2025-11-28)
+- ✅ Portfolio Discovery (Issue #32, ADR-009)
+- ✅ CLI exit code propagation (Issue #38)
+- ✅ 306/306 tests passing
+
+### 2.0.0 (2025-11-06)
+- ✅ Complete rewrite with Apache 2.0 license
+- ✅ JSON-first architecture
+- ✅ Isolated test system with sentinel protection
+- ✅ MLX stubs for fast testing without model downloads
+- ✅ 3-category test strategy
+
+---
+
+*MLX-Knife 2.0.2 Testing Details*
@@ -1,612 +0,0 @@
-# 2.0 Server/Run Implementation Guide
-
-**Purpose**: Step-by-step guide for Sonnet sessions implementing server/run functionality  
-**Created**: 2025-09-10  
-**Target**: 2.0.0-beta.1-local through beta.3 (public)
-
-## Quick Reference for Sonnet
-
-### What You're Building
- Port server/run functionality from 1.x (`main` branch) to 2.0 (`feature/2.0.0-alpha.1`)
- Preserve 2.0's modular architecture (`mlxk2/core/`, `mlxk2/operations/`, `mlxk2/output/`)
- Test-first approach using specifications in `docs/2.0-TEST-SPECIFICATIONS.md`
-
-### Key Files to Reference
-```bash
-# 1.x source files (use git show to view)
-git show main:mlx_knife/server.py          # FastAPI server implementation
-git show main:mlx_knife/mlx_runner.py      # MLX execution engine
-git show main:mlx_knife/reasoning_utils.py # Reasoning model support
-git show main:mlx_knife/cli.py             # CLI command definitions
-
-# 2.0 existing structure
-mlxk2/core/cache.py      # Extend with model detection
-mlxk2/operations/*.py    # Add run.py, serve.py, chat.py
-mlxk2/output/*.py        # Extend for streaming support
-mlxk2/cli.py            # Add new commands
-```
-
-## Implementation Steps
-
-### Step 1.0: Core Runner Implementation
-
-**File**: `mlxk2/core/runner.py`
-
-```python
-# Key components to port from mlx_runner.py:
-class MLXRunner:
-    """Core MLX model execution engine"""
-    
-    def __init__(self, model_name_or_path):
-        # Model loading logic
-        # Memory tracking
-        
-    def __enter__(self):
-        # Context manager entry
-        return self
-        
-    def __exit__(self, exc_type, exc_val, exc_tb):
-        # CRITICAL: Cleanup even on exception
-        
-    def generate_streaming(self, prompt, **kwargs):
-        # Generator for token-by-token output
-        yield from self._generate_tokens(prompt, **kwargs)
-        
-    def generate_batch(self, prompt, **kwargs):
-        # Complete generation at once
-        return "".join(self.generate_streaming(prompt, **kwargs))
-```
-
-**Critical Requirements**:
-1. Context manager pattern for memory safety
-2. Separate streaming vs batch generation
-3. Stop token filtering (CHAT_STOP_TOKENS)
-4. Dynamic token limits based on model context
-
-### Step 1.1: Complete Run Command 
-
-**File**: `mlxk2/operations/run.py`
-
-```python
-from mlxk2.core.runner import MLXRunner
-
-def run_model(
-    model_spec: str,
-    prompt: Optional[str] = None,
-    stream: bool = True,
-    max_tokens: Optional[int] = None,
-    temperature: float = 0.7,
-    top_p: float = 0.9,
-    **kwargs
-):
-    """Execute model with prompt - supports both single-shot and interactive modes.
-    
-    Args:
-        model_spec: Model specification
-        prompt: Input prompt (None = interactive mode)
-        stream: Enable streaming output
-        max_tokens: Maximum tokens (None = full model context)
-        temperature: Sampling temperature
-        top_p: Top-p sampling parameter
-    """
-    with MLXRunner(model_spec) as runner:
-        # Interactive mode: no prompt provided
-        if prompt is None:
-            interactive_chat(runner, stream=stream, max_tokens=max_tokens, **kwargs)
-        else:
-            # Single-shot mode: prompt provided
-            single_shot_generation(runner, prompt, stream=stream, max_tokens=max_tokens, **kwargs)
-
-def interactive_chat(runner, stream=True, **kwargs):
-    """Interactive conversation mode with history tracking."""
-    print("Starting interactive chat. Type 'exit' or 'quit' to end.\n")
-    
-    conversation_history = []
-    
-    while True:
-        try:
-            user_input = input("You: ").strip()
-            
-            if user_input.lower() in ['exit', 'quit', 'q']:
-                print("\nGoodbye!")
-                break
-                
-            if not user_input:
-                continue
-                
-            # Add user message to conversation history
-            conversation_history.append({"role": "user", "content": user_input})
-            
-            # Format conversation using chat template
-            formatted_prompt = runner._format_conversation(conversation_history)
-            
-            # Generate response
-            print("\nAssistant: ", end="", flush=True)
-            
-            if stream:
-                # Streaming mode
-                response_tokens = []
-                for token in runner.generate_streaming(formatted_prompt, use_chat_template=False, **kwargs):
-                    print(token, end="", flush=True)
-                    response_tokens.append(token)
-                response = "".join(response_tokens).strip()
-            else:
-                # Batch mode
-                response = runner.generate_batch(formatted_prompt, use_chat_template=False, **kwargs)
-                print(response)
-            
-            # Add assistant response to history
-            conversation_history.append({"role": "assistant", "content": response})
-            print()  # Newline after response
-            
-        except KeyboardInterrupt:
-            print("\n\nChat interrupted. Goodbye!")
-            break
-        except Exception as e:
-            print(f"\n[ERROR] {e}")
-            continue
-
-def single_shot_generation(runner, prompt, stream=True, **kwargs):
-    """Single prompt generation."""
-    if stream:
-        for token in runner.generate_streaming(prompt, **kwargs):
-            print(token, end="", flush=True)
-        print()  # Final newline
-    else:
-        result = runner.generate_batch(prompt, **kwargs)
-        print(result)
-```
-
-**CLI Integration** (`mlxk2/cli.py`):
-```python
-# Run command parser
-run_parser = subparsers.add_parser("run", help="Run model with prompt")
-run_parser.add_argument("model", help="Model name to run")
-run_parser.add_argument("prompt", nargs="?", help="Input prompt (optional - triggers interactive mode if omitted)")
-run_parser.add_argument("--max-tokens", type=int, help="Maximum tokens to generate (default: full model context)")
-run_parser.add_argument("--temperature", type=float, default=0.7, help="Sampling temperature")
-run_parser.add_argument("--top-p", type=float, default=0.9, help="Top-p sampling parameter")
-run_parser.add_argument("--no-stream", action="store_true", help="Disable streaming output (batch mode)")
-run_parser.add_argument("--json", action="store_true", help="Output in JSON format")
-run_parser.add_argument("--verbose", action="store_true", help="Show detailed output")
-
-# Usage examples:
-# mlxk2 run model "prompt"                    # Single-shot streaming
-# mlxk2 run model "prompt" --no-stream        # Single-shot batch
-# mlxk2 run model                             # Interactive streaming  
-# mlxk2 run model --no-stream                 # Interactive batch
-```
-
-**Key Changes from Basic to Complete:**
- ✅ **Interactive mode**: `prompt` parameter is now optional
- ✅ **Conversation history**: Tracks full chat context
- ✅ **Stream control**: `--no-stream` works in both modes
- ✅ **Full context tokens**: No arbitrary limits for run command
- ✅ **Chat template integration**: Uses model's native conversation format
-
-### Step 1.2: Beta.1 Completion
-
-**Complete the remaining Beta.1 requirements:**
-
-#### 1.2.1: Full Context Token Limits
-
-**File**: `mlxk2/core/runner.py`
-
-```python
-def _calculate_dynamic_max_tokens(self, server_mode: bool = False) -> int:
-    """Calculate dynamic max tokens based on model context and usage mode."""
-    if not self._context_length:
-        return 2048
-    
-    if server_mode:
-        # Server: half context for DoS protection
-        return self._context_length // 2
-    else:
-        # Run command: full context (user's own machine, be generous)
-        return self._context_length
-
-# Update generate_streaming and generate_batch to use:
-effective_max_tokens = max_tokens if max_tokens is not None else self._calculate_dynamic_max_tokens(server_mode=False)
-```
-
-#### 1.2.2: Ctrl-C Handling
-
-**Already implemented in our MLXRunner**: ✅
- Signal handler in `__init__`
- `_interrupted` flag checking during generation
- Graceful interruption with user message
-
-#### 1.2.3: Interactive Mode Implementation
-
-### Server Model Caching (Hot‑Swap, kein Reload pro Prompt)
-
-Ziel: Die UX‑Verbesserung aus 1.1.1 beibehalten – der Server lädt Modelle nicht für jeden Prompt neu.
-
- Mechanik:
-  - In `mlxk2/core/server_base.py` existiert ein globaler Runner‑Cache:
-    - `_model_cache: Dict[str, MLXRunner]` und `_current_model_path: Optional[str]`.
-    - `get_or_load_model(model_spec)`: gibt einen bestehenden `MLXRunner` zurück, falls bereits geladen; lädt nur bei Modellwechsel neu.
-    - Beim Wechsel wird der alte Runner unter Lock bereinigt (`runner.cleanup()`), dann der neue geladen (Hot‑Swap).
-  - Für den Server wird `MLXRunner(..., install_signal_handlers=False)` verwendet (keine Signal‑Handler‑Konflikte).
- Verhalten:
-  - Gleiches Modell über mehrere Requests → kein Reload → zügige Antworten, stabile UX.
-  - Anderes Modell → altes Modell freigeben, neues laden (Hot‑Swap), weiterhin kein Reload pro Prompt.
- Kontextlänge (Erinnerung):
-  - Run‑Command nutzt volle Kontextlänge; Server nutzt halbe Kontextlänge als DoS‑Schutz (`get_effective_max_tokens(..., server_mode=True)`).
-
-**File**: `mlxk2/operations/run.py` - Add missing methods:
-
-```python
-def _format_conversation(self, messages: List[Dict[str, str]]) -> str:
-    """Format conversation history into a prompt using chat template."""
-    if hasattr(self.tokenizer, 'chat_template') and self.tokenizer.chat_template:
-        try:
-            return self.tokenizer.apply_chat_template(
-                messages,
-                tokenize=False,
-                add_generation_prompt=True
-            )
-        except Exception:
-            # Fall back to legacy format
-            pass
-    
-    # Legacy Human:/Assistant: format
-    formatted_parts = []
-    for msg in messages:
-        role = msg["role"]
-        content = msg["content"]
-        if role == "system":
-            formatted_parts.append(f"System: {content}")
-        elif role == "user":
-            formatted_parts.append(f"Human: {content}")
-        elif role == "assistant":
-            formatted_parts.append(f"Assistant: {content}")
-    
-    return "\n\n".join(formatted_parts) + "\n\nAssistant: "
-```
-
-#### 1.2.4: Update CLI for Interactive Mode
-
-**File**: `mlxk2/cli.py`
-
-```python
-# Update run command argument parser
-run_parser.add_argument("prompt", nargs="?", help="Input prompt (optional - triggers interactive mode if omitted)")
-
-# Update run command handler
-elif args.command == "run":
-    result_text = run_model_enhanced(
-        model_spec=args.model,
-        prompt=args.prompt,  # Can be None for interactive mode
-        stream=not args.no_stream,
-        # ... other parameters
-    )
-```
-
-#### 1.2.5: Beta.1 Test Coverage
-
-**Files**: Complete test implementation for:
- `tests_2.0/test_run_complete.py` - All run command scenarios
- `tests_2.0/test_interactive_mode.py` - Conversation history and chat templates
- `tests_2.0/test_token_limits.py` - Full context vs server context
- `tests_2.0/test_ctrl_c_handling.py` - Interruption scenarios
-
-**Coverage Target**: 80% for run command functionality
-
-### Step 2.0: Server Implementation (Beta.2-local Core)
-
-**File**: `mlxk2/core/server_base.py`
-
-```python
-from fastapi import FastAPI
-from fastapi.middleware.cors import CORSMiddleware
-from pydantic import BaseModel
-
-# OpenAI-compatible request/response models
-class ChatCompletionRequest(BaseModel):
-    model: str
-    messages: List[Dict[str, str]]
-    stream: Optional[bool] = False
-    max_tokens: Optional[int] = None
-    
-class ChatCompletionResponse(BaseModel):
-    choices: List[Dict]
-    model: str
-    usage: Dict
-```
-
-**File**: `mlxk2/operations/serve.py`
-
-```python
-def start_server(model=None, port=8000, host="127.0.0.1"):
-    """Start OpenAI-compatible API server"""
-    # 1. Create FastAPI app
-    # 2. Setup endpoints (/v1/chat/completions, /v1/models)
-    # 3. Handle streaming vs non-streaming with SSE
-    # 4. Model hot-swapping support
-    # 5. Half context token limits (DoS protection)
-```
-
-### Step 2.1: Beta.2 Parity Features
-
-#### 2.1.1: Reasoning Support (GPT-OSS/MXFP4)
-
-**CRITICAL**: This is already implemented in 1.1.1-beta.3 and must be ported for parity!
-
-**File**: `mlxk2/core/reasoning.py`
-
-```python
-# Port from mlx_knife/reasoning_utils.py (1.x main branch)
-class ReasoningExtractor:
-    """Extract reasoning from GPT-OSS/MXFP4 models"""
-    
-    PATTERNS = {
-        'gpt-oss': {
-            'reasoning': r'<\|channel\|>analysis<\|message\|>(.*?)<\|end\|>',
-            'final': r'<\|channel\|>final<\|message\|>(.*?)(?:<\|return\|>|$)',
-        }
-    }
-    
-class StreamingReasoningParser:
-    """Parse reasoning tokens in real-time"""
-    # Real-time token classification
-    # Format as **[Reasoning]** / **[Answer]**
-```
-
-**Integration**: 
- Runner detects MXFP4/GPT-OSS models via `_is_reasoning_model()`
- Formats output as **[Reasoning]** ... --- **[Answer]**
- Server API includes reasoning in response metadata (optional)
-
-#### 2.1.2: Issue #30 - Gated Models Preflight
-
-**File**: `mlxk2/operations/pull.py`
-
-```python
-def preflight_repo_access(model_spec):
-    """Check repository access before download."""
-    try:
-        HfApi().model_info(repo_id, token=os.getenv("HUGGINGFACE_HUB_TOKEN"))
-    except HTTPError as e:
-        if e.response.status_code in [401, 403]:
-            return {"error": "Model requires authentication"}
-    return {"status": "accessible"}
-```
-
-## Testing Strategy
-
-### Test Organization
-```
-tests_2.0/
-├── test_runner_core.py         # Core MLXRunner tests
-├── test_run_command.py         # CLI run tests
-├── test_server_api.py          # OpenAI API compliance
-├── test_reasoning.py           # GPT-OSS reasoning
-└── test_chat_mode.py          # Interactive chat
-```
-
-### Test Fixtures to Use
-```python
-# From tests_2.0/conftest.py
-@pytest.fixture
-def temp_cache_dir():
-    """Isolated cache for testing"""
-    
-@pytest.fixture
-def mock_tiny_model():
-    """Minimal model for fast tests"""
-```
-
-## CRITICAL NOTES FOR SONNET
-
-### ⚠️ Open Issues to Fix During Port
-
-#### Issue #30: Gated Models Preflight Check [Beta.2]
-**Problem**: Pull von gated models startet Download, dann 403 → Cache pollution
-**Target**: 2.0.0-beta.2-local
-**Solution für 2.0**:
-```python
-# In mlxk2/operations/pull.py
-def preflight_repo_access(model_spec):
-    try:
-        HfApi().model_info(repo_id, token=os.getenv("HUGGINGFACE_HUB_TOKEN"))
-    except HTTPError as e:
-        if e.response.status_code in [401, 403]:
-            # Fail fast BEVOR Download
-            return {"error": "Model requires authentication. Please accept terms and set HUGGINGFACE_HUB_TOKEN"}
-```
-
-#### Ctrl-C Handling [Beta.1] (Nicht als Issue dokumentiert)
-**Problem**: Run/Server blockiert während Model-Generation, Ctrl-C funktioniert nicht
-**Target**: 2.0.0-beta.1-local (Core functionality!)
-**Solution für 2.0**:
-```python
-import signal
-import threading
-
-class MLXRunner:
-    def __init__(self):
-        self._interrupted = False
-        signal.signal(signal.SIGINT, self._handle_interrupt)
-    
-    def _handle_interrupt(self, signum, frame):
-        self._interrupted = True
-        # Generation-Loop checkt self._interrupted
-    
-    def generate_streaming(self):
-        for token in model.generate():
-            if self._interrupted:
-                yield "\n[Generation interrupted by user]"
-                break
-            yield token
-```
-
-### ⚠️ Model Loading & Caching
-**WICHTIG**: Der Server in 1.x cached Modelle im Memory. In 2.0:
- Model-Cache global in `mlxk2/core/server_base.py`
- NICHT bei jedem Request neu laden!
- Hot-swapping = nur wenn anderes Modell requested
-
-### ⚠️ JSON vs Human Output (CLI-Ebene)
-**WICHTIG**: 2.0 hat BEIDE Output-Modi auf CLI-Ebene:
- Default ohne `--json`: Human-readable output (wie 1.x)
- Mit `--json`: JSON output auf stdout
- Server API: Immer OpenAI-JSON Format (unabhängig von CLI)
- Streaming: Technisch separate Implementierung (SSE für Server, direktes Token-Streaming für CLI)
-
-### ⚠️ Stop Tokens & Code-Sharing
-**DESIGN-PRINZIP**: Server baut maximal auf run-Funktionalität auf!
-```python
-# Runner implementiert die Core-Logik
-CHAT_STOP_TOKENS = ["\nHuman:", "\nAssistant:", "\nUser:", "\nYou:"]
-
-# Server nutzt Runner - KEINE Duplikation
-from mlxk2.core.runner import MLXRunner
-# Server ruft runner.generate_streaming() oder runner.generate_batch()
-```
-**VORTEIL**: Einmal richtig implementiert, überall korrekt
-
-### ⚠️ Test Models & RAM-aware Filtering
-**LOKALE TESTS**: RAM-aware Filtering aus 1.x BEIBEHALTEN!
-```python
-# Aus 1.x TESTING.md - diese Logik portieren:
- 8GB Mac: Nur tiny models
- 16GB Mac: Bis zu 7B models  
- 32GB+ Mac: Alle models möglich
-```
-**GitHub CI**: Nicht möglich (keine Apple Silicon Runner)
- Docs müssen klar sagen: "Lokale Tests only"
- Badge "166/166 tests" bezieht sich auf lokale Ausführung
-
-## Common Pitfalls & Solutions
-
-### 1. Memory Leaks & Process Monitoring
-**Problem**: Model stays in memory after error / Zombie processes
-**Solution**: 
- Context manager mit garantiertem cleanup in `__exit__`
- Portiere Process-Monitoring aus 1.x beta.2:
-  - `test_server_functionality.py`: Server lifecycle tests
-  - Process guards gegen orphaned Python processes
-  - Automatic cleanup on Ctrl-C/SIGTERM
-
-### 2. Streaming vs Batch Inconsistency
-**Problem**: Different output between modes  
-**Solution**: Filter stop tokens in BOTH paths
-
-### 3. Token Limits
-**Problem**: Hardcoded limits truncate output  
-**Solution**: Dynamic limits aus 1.x (funktioniert gut!)
-```python
-# Von 1.x beibehalten:
- max_tokens=None → Dynamische Limits basierend auf Model-Context
- Explicit max_tokens → Respektieren
- Formel aus 1.x mlx_runner.py übernehmen
-```
-**Mögliche Verbesserung**: Config-basierte Overrides für spezielle Modelle
-
-### 4. Model Path Resolution
-**Problem**: Can't find models in cache  
-**Solution**: Use existing `mlxk2/core/cache.py` resolution
-
-## Version Milestones
-
-### 2.0.0-beta.1-local
-**Step 1.0**: ✅ MLXRunner core engine  
-**Step 1.1**: ✅ Complete run command (single-shot + interactive)  
-**Step 1.2**: 🔄 Beta.1 completion
- [ ] Full context token limits (no DoS protection)
- [ ] Interactive mode implementation
- [ ] CLI integration for interactive mode  
- [ ] 80% test coverage
- [x] **Ctrl-C handling** (already implemented)
-
-### 2.0.0-beta.2-local  
-**Goal**: 1.1.1-beta.3 parity + core stability
-**Step 2.0**: 🔄 Server implementation  
-**Step 2.1**: 🔄 Parity features (required for 1.x compatibility)
- [ ] OpenAI-compatible API server
- [ ] Half context token limits for server (DoS protection)
- [ ] Model hot-swapping support  
- [ ] SSE streaming endpoints
- [ ] **Reasoning models (GPT-OSS/MXFP4)** ← ALREADY IN 1.1.1-beta.3!
- [ ] Issue #30: Gated models preflight
- [ ] Enhanced error handling and logging
- [ ] Server lifecycle management (Ctrl-C, cleanup)
- [ ] 90% test coverage
-
-### 2.0.0-beta.3 (public)
-**Goal**: Production-ready with 1.1.1-beta.3 complete parity
- [ ] All core features stable and battle-tested
- [ ] Performance optimized
- [ ] Documentation complete
- [ ] 95%+ test coverage
- [ ] Integration testing with real-world scenarios
-
-### Beyond 2.0.0-beta.3 (Future Releases)
-**New features for post-beta.3 versions:**
- **System Prompt CLI Support** (`--system` parameter) - not yet specified
- Advanced reasoning model support (DeepSeek R1, QwQ, etc.)
- Custom reasoning token markers (`--reasoning-start`, `--reasoning-end`)
- Enhanced chat template system
-
-## Push Function Notes
-
-The `push` operation (experimental in alpha.3) remains functional throughout beta phases:
- May receive fixes between beta versions
- Minor enhancements possible
- Not blocking for server/run implementation
- Already working with user's workflow
-
-## Quick Commands for Development
-
-```bash
-# View 1.x implementation
-git show main:mlx_knife/server.py | less
-
-# Run 2.0 tests
-pytest tests_2.0/
-
-# Test specific functionality
-pytest tests_2.0/test_runner_core.py -v
-
-# Check coverage
-pytest tests_2.0/ --cov=mlxk2 --cov-report=term-missing
-
-# Create local beta tag (not pushed)
-git tag -a 2.0.0-beta.1-local -m "Initial server/run port"
-
-# Run local 2.0 version
-python -m mlxk2.cli run model "prompt"
-```
-
-## References for Each Step
-
-### Step 1.0 (Runner Core)
- Source: `git show main:mlx_knife/mlx_runner.py`
- Tests: `git show main:tests/unit/test_mlx_runner_memory.py`
-
-### Step 1.1 (Run Command)
- Source: `git show main:mlx_knife/cli.py` (run_model function)
- Tests: `git show main:tests/integration/test_run_command_advanced.py`
-
-### Step 2.0 (Server)
- Source: `git show main:mlx_knife/server.py`
- Tests: `git show main:tests/integration/test_server_functionality.py`
-
-### Step 3.0 (Reasoning)
- Source: `git show main:mlx_knife/reasoning_utils.py`
- Context: CLAUDE.md reasoning architecture section
-
-### Step 3.1 (Chat)
- Source: Search for "interactive_chat" in main branch
- Tests: Look for chat-related tests in integration
-
-## Success Criteria
-
-Each Sonnet session should:
-1. Write tests first (TDD)
-2. Implement minimal working version
-3. Verify tests pass
-4. Document any deviations from 1.x
-
-Remember: The goal is feature parity with 1.1.1-beta.3, not innovation. Port conservatively.
@@ -1,318 +0,0 @@
-# 2.0 Server/Run Test Specifications
-
-**Purpose**: Abstract test specifications extracted from 1.x for implementation in 2.0  
-**Created**: 2025-09-10  
-**For**: Sonnet implementation sessions
-
-## Open Issues to Address
-
-### Issue #30: Gated Models Preflight
- Test: Mock 403 response → Verify NO cache writes
- Test: Clear error message with actionable guidance
- Test: Successful auth → Normal pull flow
-
-### Ctrl-C Interruption Support
- Test: Long generation → Ctrl-C → Clean interruption
- Test: Server request → Ctrl-C → Graceful shutdown
- Test: No zombie processes after interrupt
-
-## Core Principles
-
-1. **Test-First**: Write failing tests before implementation
-2. **Isolated Caches**: Use temp_cache_dir fixtures, never touch user cache
-3. **Abstract Contracts**: Test behaviors, not implementations
-4. **Model-Agnostic**: Use tiny test models where possible
-
-## Server API Contract Tests
-
-### 1. OpenAI Compatibility (`test_server_api_compliance.py`)
-
-```python
-class TestOpenAICompliance:
-    """Verify OpenAI API compatibility"""
-    
-    def test_models_endpoint(self):
-        # GET /v1/models
-        # Returns: {"data": [{"id": "model-name", "object": "model", ...}]}
-        
-    def test_chat_completions_basic(self):
-        # POST /v1/chat/completions
-        # Body: {"model": "...", "messages": [...], "stream": false}
-        # Returns: {"choices": [{"message": {"content": "..."}}]}
-        
-    def test_chat_completions_streaming(self):
-        # POST /v1/chat/completions with stream=true
-        # Returns: SSE stream with data: prefixed chunks
-        # Final: data: [DONE]
-        
-    def test_completions_endpoint(self):
-        # POST /v1/completions
-        # Body: {"model": "...", "prompt": "...", "stream": false}
-        # Returns: {"choices": [{"text": "..."}]}
-```
-
-### 2. Dynamic Token Management (`test_server_token_limits.py`)
-
-```python
-class TestDynamicTokens:
-    """Test model-aware token limits (Issue #15/16)"""
-    
-    def test_no_max_tokens_uses_dynamic(self):
-        # Given: Model with 8K context
-        # When: max_tokens=None in request
-        # Then: Server uses appropriate dynamic limit (~2000-4000)
-        
-    def test_respects_explicit_max_tokens(self):
-        # Given: Any model
-        # When: max_tokens=500 in request
-        # Then: Server respects explicit limit
-        
-    def test_large_context_models(self):
-        # Given: 30K+ context model
-        # When: max_tokens=None
-        # Then: Larger dynamic limit applied
-```
-
-### 3. Model Hot-Swapping (`test_server_model_switching.py`)
-
-```python
-class TestModelSwitching:
-    """Test model switching without restart"""
-    
-    def test_switch_between_models(self):
-        # Given: Server running with model A
-        # When: Request specifies model B
-        # Then: Model B loads, A unloads, response from B
-        
-    def test_concurrent_model_requests(self):
-        # Given: Multiple requests with different models
-        # Then: Proper queueing/switching without crashes
-```
-
-### 4. Stop Token Filtering (`test_server_stop_tokens.py`)
-
-```python
-class TestStopTokens:
-    """Test stop token handling (Issue #14, #20)"""
-    
-    def test_chat_stop_tokens_filtered(self):
-        # Given: Chat mode
-        # Then: "\nHuman:", "\nAssistant:" never in output
-        
-    def test_streaming_vs_batch_consistency(self):
-        # Given: Same prompt
-        # When: stream=true vs stream=false
-        # Then: Identical output (no extra tokens)
-```
-
-## Run Command Contract Tests
-
-### 1. Complete Run Command (`test_run_complete.py`)
-
-```python
-class TestRunBasic:
-    """Basic run command functionality"""
-    
-    def test_run_single_shot_streaming(self):
-        # mlxk run model "prompt"
-        # Returns: Generated text to stdout, token-by-token
-        
-    def test_run_single_shot_batch(self):
-        # mlxk run model "prompt" --no-stream
-        # Returns: Complete output at once
-        
-    def test_run_interactive_streaming(self):
-        # mlxk run model (no prompt)
-        # Triggers: Interactive chat mode with streaming responses
-        
-    def test_run_interactive_batch(self):
-        # mlxk run model --no-stream (no prompt)
-        # Triggers: Interactive chat mode with batch responses
-        
-    def test_run_full_context_tokens(self):
-        # mlxk run model "prompt"
-        # Uses: Full model context length (no DoS protection)
-        # Verify: max_tokens defaults to model's full context
-        
-    def test_conversation_history_tracking(self):
-        # Interactive mode maintains conversation context
-        # Each new input includes previous conversation
-        
-    def test_chat_template_integration(self):
-        # Uses model's native chat template for conversation formatting
-        # Falls back to Human:/Assistant: if no template available
-```
-
-### 2. Server Token Management (`test_server_tokens.py`)
-
-```python
-class TestServerTokens:
-    """Server-specific token limit behavior"""
-    
-    def test_server_half_context_protection(self):
-        # Server mode uses half model context for DoS protection
-        # Given: Model with 8K context
-        # Server: Uses max 4K tokens by default
-        # Run: Uses full 8K tokens by default
-        
-    def test_server_vs_run_token_limits(self):
-        # Verify different token policies:
-        # Run command: Full context (generous)
-        # Server API: Half context (defensive)
-```
-
-### 3. Reasoning Models (`test_reasoning_models.py`)
-
-```python
-class TestReasoningModels:
-    """GPT-OSS/MXFP4 reasoning support"""
-    
-    def test_gpt_oss_reasoning_detection(self):
-        # Model name contains "gpt-oss" or "mxfp4"
-        # Automatic reasoning extraction
-        
-    def test_reasoning_formatting(self):
-        # Output: **[Reasoning]** ... **[Answer]** ...
-        
-    def test_hide_reasoning_flag(self):
-        # mlxk run model "prompt" --hide-reasoning
-        # Shows only answer, no reasoning
-```
-
-### 4. Memory Management (`test_memory_safety.py`)
-
-```python
-class TestMemorySafety:
-    """Context manager and cleanup"""
-    
-    def test_context_manager_cleanup(self):
-        # Model loaded in context
-        # Automatic cleanup on exit/exception
-        
-    def test_exception_safety(self):
-        # Exception during generation
-        # Resources still cleaned up
-```
-
-## Show Command Enhancements
-
-### Quantization Display (`test_show_quantization.py`)
-
-```python
-class TestShowQuantization:
-    """Enhanced quantization info (beta.3)"""
-    
-    def test_mxfp4_detection(self):
-        # Config has quantization.mode = "mxfp4"
-        # Shows: "Advanced mode 'mxfp4' (requires MLX ≥0.29.0)"
-        
-    def test_gguf_variants(self):
-        # Multiple .gguf files
-        # Lists all variants with sizes
-        
-    def test_precision_display(self):
-        # Shows: int4, int8, gguf, etc.
-```
-
-## Test Data Requirements
-
-### ⚠️ CRITICAL: Test Model Strategy
-
-**NIEMALS** user cache für Tests verwenden! Immer `temp_cache_dir` fixture!
-
-### Minimal Test Models
-```yaml
-tiny-models:
-  - hf-internal-testing/tiny-random-gpt2  # 12MB, for basic tests
-  - local-mock-models/fake-mxfp4-model     # Mock config.json only
-  - local-mock-models/fake-reasoning-model # Mock with reasoning markers
-
-real-models-optional:  # For @pytest.mark.server tests only
-  - mlx-community/Phi-3-mini-4k-instruct-4bit
-  - gpt-oss-20b-MXFP4-Q8  # For reasoning tests
-```
-
-## Implementation Priority
-
-### Priority A: Beta.1 - Complete Run Command (CRITICAL - Must Have)
-1. `mlxk2/core/runner.py` - MLX execution engine ✅
-2. Single-shot run: `mlxk2 run model "prompt"` ✅
-3. Interactive run: `mlxk2 run model` (no prompt)
-4. Streaming and batch modes for both
-5. Full context token limits (no DoS protection)
-6. Conversation history tracking
-7. Chat template integration
-8. Ctrl-C handling
-
-### Priority B: Beta.2 - Server Implementation (HIGH - Should Have) 
-1. OpenAI-compatible API server
-2. Half context token limits for server (DoS protection)
-3. Model hot-swapping support
-4. SSE streaming in server endpoints
-5. Reasoning model support
-6. System prompt support
-
-### Priority C: Beta.3 - Advanced Features (MEDIUM - Could Have)
-1. Performance optimizations
-2. Enhanced error handling
-3. Advanced reasoning features
-4. Issue #30: Gated models preflight
-
-## Critical Implementation Notes
-
-### 1. Streaming Architecture
-```python
-# 1.x uses generator pattern - PRESERVE THIS
-def generate_streaming(prompt, **kwargs):
-    for token in model.generate(...):
-        yield token
-        
-# Server SSE format - MUST MATCH
-data: {"choices": [{"delta": {"content": "token"}}]}
-data: [DONE]
-```
-
-### 2. Stop Token Management
-```python
-# Priority order (from 1.x mlx_runner.py)
-CHAT_STOP_TOKENS = ["\nHuman:", "\nAssistant:", "\nUser:", "\nYou:"]
-
-# 1. Check model's native stop tokens first
-# 2. Add chat stop tokens as fallback
-# 3. Filter from output in both streaming and batch
-```
-
-### 3. Model Loading Pattern
-```python
-# Context manager pattern from 1.x - CRITICAL
-class MLXRunner:
-    def __enter__(self):
-        self.load_model()
-        return self
-        
-    def __exit__(self, ...):
-        self.cleanup()  # MUST cleanup even on exception
-```
-
-## Version Strategy
-
-### Local Git Tags (Not Published)
- `2.0.0-beta.1-local` - Basic server/run port
- `2.0.0-beta.2-local` - Full reasoning support
-
-### Public Release
- `2.0.0-beta.3` - First public beta (fully tested)
-
-## Gotchas for Sonnet Sessions
-
-1. **Don't forget MLX version checks**: MXFP4 requires MLX ≥0.29.0
-2. **Test with isolated caches**: Never assume user has models
-3. **Preserve 1.x CLI interface**: Same commands, same flags
-4. **Keep modular boundaries**: Core vs Operations vs Output
-5. **Test streaming separately**: Different code paths
-
-## References
-
- 1.x source: `git show main:mlx_knife/server.py`
- 1.x tests: `git show main:tests/integration/test_server_functionality.py`
- Test patterns: `tests_2.0/conftest.py` for fixtures
@@ -177,10 +177,33 @@ if 'mxfp4' in str(model_path).lower():
  - Rationale: Deterministic guard for MXFP4-class models (pragmatic workaround)
 - No tokenizer state mutation side-effects observed (callable check + exception guard)

-**Outstanding Work:**
- Portfolio discovery not yet implemented (hard-coded 3 models in test suite)
- Workaround cleanup evaluation (lines 49, 99 in `stop_tokens.py`)
- Empirical validation scope expansion (currently 3 models, aim for full cache coverage)
+**Outstanding Work:** ✅ **COMPLETED (2.0.2)**
+
+All items completed in 2.0.2 recovery plan (2025-11-14):
+
+- ✅ **Portfolio Discovery:** Implemented (2.0.1), validated (2.0.2 Phase 2)
+  - Implementation: `discover_mlx_models_in_user_cache()` in `test_stop_tokens_live.py:167`
+  - Filter: MLX + healthy + runtime_compatible + chat (via `mlxk list --json`)
+  - Validation: 17 models discovered, 15 testable, 73/81 tests passing
+  - Evidence: E2E test suite (ADR-011) with portfolio parametrization
+
+- ✅ **Workaround Evaluation:** Evaluated (2.0.2 Phase 4), kept with rationale
+  - **Workaround 1 (Line 49):** `<|end|>` special handling for Phi-3-mini
+    - Validated: 2 Phi-3 variants in portfolio (discovered_11, discovered_12)
+    - Rationale: Fixes `eos_token_id=null` bug, empirically stable
+  - **Workaround 2 (Line 98):** `reasoning_end` removal for DeepSeek-R1
+    - Validated: DeepSeek-R1-Distill-8B in portfolio (discovered_01)
+    - Rationale: Reasoning models need full output until final marker
+  - **Workaround 3 (Line 100):** `<|return|>` addition for GPT-oss
+    - Validated: gpt-oss-20b-MXFP4 in portfolio (discovered_16)
+    - Rationale: GPT-oss reasoning format requires special marker
+  - **Decision:** Keep all workarounds (0 failures, production stable, future-proof)
+
+- ✅ **Empirical Validation:** Expanded from 3 → 15 models (2.0.2 Phase 2)
+  - Portfolio: 17 discovered (mlx-community cache), 15 testable (60% RAM budget)
+  - Coverage: Phi-3, DeepSeek-R1, GPT-oss, Llama, Qwen, Mistral, Mixtral families
+  - Results: 73/81 tests passing (90.1%), 8 skipped (RAM budget), 0 failures
+  - Infrastructure: Sequential execution, active cleanup polling, no system freeze

 ### Non-Goals (Beta.6)

@@ -294,27 +317,6 @@ def test_llama_regression():

 ---

-## Implementation Plan
-
-**Priority:** CRITICAL (Issue #32 open since September)
-
-**Tasks:**
-1. ✅ Research findings documented (this ADR)
-2. ⏳ Implement real-model test suite (`test_stop_tokens_live.py`)
-3. ⏳ Baseline measurement (document current behavior with all 3 models)
-4. ⏳ Apply 2-LOC fix (runner/__init__.py:468, 589)
-5. ⏳ Re-test & evaluate (compare before/after behavior)
-6. ⏳ Conditional: Implement `add_eos_token()` integration (ONLY if tests fail)
-7. ⏳ Conditional: Remove obsolete workarounds (ONLY if tests pass without them)
-8. ⏳ Update TESTING.md + CHANGELOG.md + close Issue #32
-
-**Estimated Effort:** 2-3 sessions (test suite implementation is non-trivial)
-**Blocker for:** 2.0.0 stable release
-
-**Key Decision Gate:** Step 5 → Step 6 (empirical testing determines if `add_eos_token()` is needed)
-
---
-
 ## References

 ### mlx-lm APIs Used
@@ -1,9 +1,9 @@
 # ADR-011: E2E Live Test Architecture

-**Status:** Proposed (Planned for Post-Beta.6 / Stable 2.0)
-**Date:** 2025-10-21
+**Status:** ✅ IMPLEMENTED (2.0.1+)
+**Date:** 2025-10-21 (Updated: 2025-11-12)
 **Supersedes:** 1.1.1 `test_end_token_issue.py` comprehensive testing
-**Affects:** Test Suite (Stable 2.0)
+**Affects:** Test Suite (Stable 2.0.1+)
 **Related:** ADR-009 (Stop Token Detection - provides Portfolio Discovery infrastructure)

 ---
@@ -125,41 +125,125 @@ def test_run_command_portfolio():
 **Timeline:** Post-Beta.6, before Stable release

 **Tasks:**
-1. ⏳ **Implement ADR-009 Portfolio Discovery** (prerequisite for E2E tests)
+1. ✅ **Implement ADR-009 Portfolio Discovery** (prerequisite for E2E tests)
   - `discover_mlx_models_in_cache()` helper
   - RAM gating logic (`should_skip_model()`)
-2. ⏳ **Server E2E Tests** (`test_server_e2e.py`)
+2. ✅ **Server E2E Tests** (`test_server_e2e.py`)
   - HTTP API validation
   - SSE streaming format
-3. ⏳ **Streaming Parity Tests** (`test_streaming_parity.py`)
+3. ✅ **Streaming Parity Tests** (`test_streaming_parity.py`)
   - Issue #20 regression protection
-4. ⏳ **CLI Integration Tests** (`test_cli_e2e.py`)
+4. ✅ **CLI Integration Tests** (`test_cli_e2e.py`)
   - `mlxk run` validation
   - Exit codes, error messages
-5. ⏳ **Documentation Updates**
+5. ✅ **Documentation Updates**
   - TESTING.md: E2E test coverage section

 ---

-## Implementation Status (2025-10-21)
+## Implementation Status

-**Status: NOT STARTED**
+**Status: ✅ COMPLETED** (2025-11-13)

-All tasks above are pending. This ADR documents the **planned architecture** for E2E tests.
+E2E test suite implemented and validated with 17 real MLX chat models.

-**Current Reality:**
- No E2E test suite exists (`tests_2.0/live/test_server_e2e.py` etc. not created)
- Portfolio discovery not implemented (hard-coded 3 models in `test_stop_tokens_live.py:174`)
- ADR-009 provides **test plan** for portfolio discovery, but implementation deferred
+### Current Results

-**Blocker:**
- Requires Portfolio Discovery implementation (ADR-009 Step 1, currently incomplete)
+**Command:** `HF_HOME=/path/to/cache pytest -m live_e2e -v`

-**Next Steps:**
-1. Complete ADR-009 Portfolio Discovery (Beta.6 scope)
-2. Implement E2E test suite (Post-Beta.6, pre-Stable 2.0)
+- ✅ **72/80 tests passing** (Portfolio: 17 models discovered, 15 testable, 2 RAM-skipped, 8 total skipped)
+- ✅ Server E2E: 35 tests (health, models, chat completions, completions - batch + streaming)
+- ✅ CLI Integration: 30 tests (text + JSON output, exit codes, stop token filtering)
+- ✅ Streaming Parity: 6 tests (Issue #20 protection)
+- ✅ Exit Code Tests: 2 tests (Issue #38 validation)
+- ⏱️ Duration: ~6-7 minutes

-**Estimated Effort:** 2-3 sessions (reuses ADR-009 infrastructure)
+### Key Fixes Applied
+
+For detailed bug analysis and fixes, see CHANGELOG.md 2.0.2 section. Summary:
+
+1. **Stop token ordering bug** (production bug) - Both `generate_batch()` and `generate_streaming()` now filter by earliest position in text
+2. **Test temperature flakiness** (test fix) - E2E tests use `temperature=0.0` for deterministic results
+3. **Portfolio Discovery collection** (test fix) - Marker check before discovery (keeps default `pytest` fast)
+4. **SSE sentinel validation** (test fix) - Explicit `[DONE]` check prevents client hangs
+5. **CLI subprocess args** (test fix) - Positional argument instead of `--prompt` flag
+6. **MXFP4 reasoning parity** (documented) - Removed from parity tests (ADR-010 known issue)
+
+### Quality Infrastructure
+
+- **Verbose Mode:** `mlxk run --verbose` shows token generation details including multiple EOS token warnings
+- **Quality Database:** Known Model Quality Issues tracked in TESTING-DETAILS.md
+- **Philosophy:** No hidden workarounds - broken models fail tests and are documented
+- **Note:** Initial `MLXK2_DEBUG_TOKENS` E2E test support removed (caused false positives matching metadata)
+
+---
+
+## Implementation Status (2025-11-12)
+
+**Status: ✅ COMPLETED - E2E Test Suite Refactored & Validated**
+
+E2E test suite successfully refactored with production-grade architecture. Validated with 17 real MLX models (15 passed, 2 skipped).
+
+### Refactoring Summary
+
+**1. Portfolio Discovery Refactored** (~70 LOC eliminated)
+- **Before:** Duplicated `mlxk list` logic (cache scanning, build_model_object, filtering)
+- **After:** Uses `mlxk list --json` via subprocess (production command)
+- **Location:** `tests_2.0/test_stop_tokens_live.py` Lines 167-234
+- **Benefit:** Tests use production code, automatically benefit from fixes
+
+**2. Test Architecture Fixed** (5 files refactored)
+- **Before:** `for model in portfolio:` loops → Server RAM leaks → System freeze
+- **After:** `@pytest.mark.parametrize` → One server per test → Clean lifecycle
+- **Files Modified:**
+  - `tests_2.0/live/server_context.py` - Timeout 5s→30s
+  - `tests_2.0/live/conftest.py` - pytest_generate_tests hook + model_info fixture
+  - `tests_2.0/live/test_server_e2e.py` - 2 tests parametrized
+  - `tests_2.0/live/test_streaming_parity.py` - 2 tests parametrized
+  - `tests_2.0/live/test_cli_e2e.py` - 2 tests parametrized
+
+**3. Marker-Required Fixture Added**
+- **File:** `tests_2.0/live/conftest.py` - Autouse fixture `_skip_unless_live_e2e_marker`
+- **Effect:** E2E tests skipped in default `pytest -v` run (marker-required 🔒)
+- **Test counts:** 306 passed, 46 skipped (26 E2E tests auto-skipped)
+
+### Validation Results
+
+**Test Execution:** `TestChatCompletionsBatch` with 17 discovered MLX models
+- ✅ **15 passed, 2 skipped** in 82.75s
+- ✅ No system freeze, clean RAM cleanup
+- ✅ Models tested: Qwen2.5 0.5B, Llama 3.2 3B, Mistral 7B, Mixtral 8x7B, etc.
+
+### Performance Comparison
+
+| Metric | OLD (Broken) | NEW (Fixed) |
+|--------|--------------|-------------|
+| Architecture | Loop-based | Parametrized |
+| Server timeout | 5s → SIGKILL | 30s → clean |
+| Test isolation | RAM leaks | Clean per test |
+| Discovery | 70 LOC duplicated | `mlxk list --json` |
+| Test success | 6/14 → freeze | **15/17 → success** |
+| Code reduction | — | ~500 LOC removed |
+
+### Infrastructure Delivered
+
+- ✅ `tests_2.0/live/server_context.py` - LocalServer with 30s timeout
+- ✅ `tests_2.0/live/sse_parser.py` - SSE parser utilities
+- ✅ `tests_2.0/live/conftest.py` - pytest_generate_tests + marker-required fixture
+- ✅ `tests_2.0/live/test_utils.py` - Utility functions
+- ✅ `tests_2.0/live/test_server_e2e.py` - Server E2E (parametrized)
+- ✅ `tests_2.0/live/test_streaming_parity.py` - Streaming parity (parametrized)
+- ✅ `tests_2.0/live/test_cli_e2e.py` - CLI integration (parametrized)
+- ✅ Marker: `live_e2e` added to pytest.ini
+- ✅ Documentation: TESTING-DETAILS.md updated
+
+### Reused Infrastructure
+
+- ✅ Portfolio Discovery from ADR-009 (refactored to `mlxk list --json`)
+- ✅ RAM gating logic
+- ✅ MLX modules fixture
+
+**Total Effort:** ~4 hours (refactoring + debugging + validation)

 ---

@@ -170,9 +254,13 @@ All tasks above are pending. This ADR documents the **planned architecture** for
 tests_2.0/
 ├── test_stop_tokens_live.py       # ADR-009: Runner stop tokens + portfolio
 ├── live/
-│   ├── test_server_e2e.py         # ADR-011: Server/HTTP
-│   ├── test_streaming_parity.py   # ADR-011: Issue #20
-│   └── test_cli_e2e.py            # ADR-011: CLI
+│   ├── conftest.py                # Shared fixtures + pytest_generate_tests
+│   ├── server_context.py          # LocalServer context manager (30s timeout)
+│   ├── sse_parser.py              # SSE parsing utilities
+│   ├── test_utils.py              # Utility functions
+│   ├── test_server_e2e.py         # ADR-011: Server/HTTP (parametrized)
+│   ├── test_streaming_parity.py   # ADR-011: Issue #20 (parametrized)
+│   └── test_cli_e2e.py            # ADR-011: CLI (parametrized)
 ```

 **Markers:**
@@ -201,8 +289,9 @@ pytest                      # Unit tests (skips all live)
 - ✅ Reusable infrastructure from ADR-009

 ### Negative
- ⚠️ Portfolio tests may take 10-30 minutes (10-50 models)
+- ⚠️ Portfolio tests take ~80s for 17 models (marker-required to avoid slowing default suite)
 - ⚠️ Maintenance overhead if Server API changes
+- ⚠️ Requires 30s timeout per test (longer than typical unit tests)

 ### Trade-offs

@@ -246,4 +335,8 @@ pytest                      # Unit tests (skips all live)
 pytest -m live_e2e -v  # All tests pass or skip gracefully
 ```

-No failures - only passes or skips (RAM/availability).
+**✅ ACHIEVED (2025-11-12):**
+- 15/17 models tested successfully (2 skipped due to RAM)
+- No test failures, only passes and graceful skips
+- System stability validated (no freeze, clean RAM cleanup)
+- Production-grade architecture (parametrized tests, 30s timeout)
@@ -0,0 +1,65 @@
+# ADR-012 — Vision / Multimodal Support Roadmap (Draft)
+
+- **Status:** Draft (internal evaluation)
+- **Authors:** mlx-knife maintainers
+- **Date:** 2024-11-11
+
+## Context
+
+Community feedback (e.g., Reddit comments comparing `mlx_lm.server + llama swap`) highlights the lack of multimodal support in MLX-Knife 2.0. The 2.0 CLI deliberately focused on LLM lifecycle management (cache, health, server, JSON automation). Before publicly committing to a multimodal feature set we need to document what “vision support” actually means for mlx-knife and how it fits with existing abstractions.
+
+## Goals
+
+1. Deliver an alpha-level `mlxk run` flow that accepts still images (one-shot prompt or chat) without breaking current UX patterns.
+2. Extend JSON API schemas so callers can submit/receive multimodal payloads predictably once the runner core is solid.
+3. Keep HF cache semantics identical to text-only models (no bespoke asset folders) and rely on upstream `mlx_lm` vision support.
+4. Preserve CLI-first ergonomics—no GUI dependencies; the sample Web client only mirrors server capabilities.
+
+## Proposed Phases
+
+### Phase 0 — Research & Scoping
+- Inventory MLX models on Hugging Face that expose vision/multimodal configs.
+- Validate mlx-lm backend hooks for image preprocessing (tokenizer, pixel transforms).
+- Capture UX expectations from comparable tools (Ollama `/api/chat` with `images`, OpenAI `responses` API).
+
+### Phase 1 — `mlxk run` Alpha (highest priority)
+- Implement `--image <path>` (repeatable) for `mlxk run` one-shot prompts. Optional text input keeps behaving exactly like today.
+- Inject a default system prompt (e.g., “You are a vision assistant. Describe the image.”) when the user supplies only `--image` so models never silently no-op.
+- Define chat semantics: the first turn may include both the image and text; follow-up turns stay text-only but retain the original image context.
+- Allow pure image invocations to respond immediately (default description) instead of forcing batch mode, unless other batch flags are set.
+
+### Phase 2 — Runner Hardening & Metadata
+- Wire preprocessing (resize/normalize) into `mlxk2/operations/run.py` in a model-agnostic way so upgrades track upstream `mlx_lm`.
+- Emit capability metadata (`capabilities: ["text","vision-understanding"]`) in `mlxk list --json` and `mlxk show` so automation can distinguish “image understanding” from text-only or potential future text→image cases.
+- Update health checks to verify auxiliary assets (tokenizers, image processors) are present in the HF snapshot.
+- Extend `docs/json-api-specification.md` with alpha notes for `inputs.images[]`, the “image-on-first-turn” contract, and explicit output modality indicators (text vs future image generation).
+
+### Phase 3 — Server & Web Hooks
+- Teach `mlxk serve` the OpenAI-style `content` format (`[{type:"input_text"}, {type:"input_image", image_url:{...}}]`) so off-the-shelf OpenAI clients and connectors can upload base64 images.
+- Introduce optional text-file uploads first (low-risk rehearsal) to validate multipart handling, temp storage, and streaming responses before turning on images.
+- Update the sample Web client to support drag-and-drop text files (and later, images) without adding GUI dependencies to the CLI core.
+
+### Phase 4 — Tooling & Docs
+- Document workflows in README (new “Vision Models” section) and add regression tests under `tests_2.0/vision/`.
+- Provide sample HF models + scripted downloads for contributors.
+- Evaluate need for GPU/ANE configuration toggles specific to large pixel encoders.
+
+## Non-Goals
+
+- Shipping a GUI or image annotation tool.
+- Supporting arbitrary video/audio inputs in the same release (separate ADR if needed).
+
+## Open Questions
+
+1. Do we gate vision support behind `MLXK2_ENABLE_ALPHA_FEATURES`, or carve out a dedicated `MLXK2_ENABLE_ALPHA_VISION=1` gate so releases can graduate independently?
+2. How do we represent mixed-mode outputs (e.g., returning base64 images) in the JSON schema without breaking OpenAI compatibility?
+3. Should we expose preprocessing hooks so advanced users can override transforms, or keep the runner opinionated?
+4. When (if ever) do we allow mid-chat image updates, or do we enforce “image only on first turn” permanently?
+
+## Next Steps
+
+1. Complete Phase 0 (catalog MLX vision-ready models, verify `mlx_lm` support status).
+2. Implement Phase 1 + 2 behind `MLXK2_ENABLE_ALPHA_VISION` and cut a 2.0.3 alpha release focused on end-user feedback.
+3. Collect real-world reports (CLI-only) to validate preprocessing, default prompts, and capability metadata.
+4. Schedule Phase 3 (server + web hooks) as a 2.0.4/2.0.5 beta milestone once alpha feedback is digested.
+5. Promote to 2.1 stable after Phase 4 (docs/tests) and when the env flag can be removed confidently.
@@ -16,9 +16,11 @@ This directory contains Architecture Decision Records (ADRs) that document signi
 | [ADR-006](ADR-006-Clone-Implementation-Revised.md) | Clone Implementation Revised | Superseded by ADR-007 | 2025-09-18 |
 | [ADR-007](ADR-007-Clone-Implementation-Fixed.md) | Clone Implementation Fixed Strategy | Accepted | 2025-09-18 |
 | [ADR-008](ADR-008-MLXModel-Package-Format.md) | MLXModel Package Format | Accepted | 2025-10-17 |
-| [ADR-009](ADR-009-Stop-Token-Detection-Fix.md) | Stop Token Detection Fix | Accepted | 2025-10-21 |
+| [ADR-009](ADR-009-Stop-Token-Detection-Fix.md) | Stop Token Detection Fix | Implemented | 2025-10-21 |
 | [ADR-010](ADR-010-Reasoning-Content-API.md) | Reasoning Content API | Draft | 2025-10-21 |
-| [ADR-011](ADR-011-E2E-Live-Test-Architecture.md) | E2E Live Test Architecture | Accepted | 2025-10-21 |
+| [ADR-011](ADR-011-E2E-Live-Test-Architecture.md) | E2E Live Test Architecture | Implemented | 2025-10-21 |
+| [ADR-012](ADR-012-Vision-Support-Roadmap.md) | Vision Support Roadmap | Draft | 2025-11-12 |
+| [ADR-013](ADR-013-Community-Model-Quality-Database.md) | Community Model Quality Database | Planned | 2025-11-13 |

 ## ADR Format

@@ -1,213 +0,0 @@
-# MLX-Knife 2.0 Versioning Strategy
-
-**Document Status:** Living (pre-release)  
-**Purpose:** Principles for versioning and deployment of MLX‑Knife 2.0 until stable
-
-## Versioning Schema
-
-### **2.0.0-alpha** (Feature-Complete for JSON-Only)
-**Scope:** Core JSON operations without server/run functionality
-
-**Features:**
- ✅ All 5 Operations: `list`, `health`, `show`, `pull`, `rm`
- ✅ JSON API fully implemented per specification
- ✅ Core functionality working (broke-cluster compatible)
- ⚠️ Experimental features MAY be present; they MUST be clearly labeled "experimental", safe by default, and must not break existing behavior
- ❌ Pre-release level testing
- ❌ No `server` or `run` commands
-
-**Quality Gate (alpha):**
- Core operations functional in isolation
- JSON schema stable and documented
- Basic edge case handling
-
-**Target Users:**
- Broke-cluster integration (POC environment)
- Early adopters for JSON automation
- Parallel deployment alongside 1.x
-
-### **2.0.0-beta** (Feature-Complete with Server/Run)
-**Scope:** All alpha features PLUS server/run functionality from 1.x
-
-**New in Beta:**
- ✅ `server` command - OpenAI-compatible API from 1.x
- ✅ `run` command - Interactive model execution from 1.x  
- ✅ Reasoning model support (GPT-OSS/MXFP4)
- ✅ Human output backend (already in alpha.3)
- ✅ **100% test coverage** including server/run tests
-
-**Version Strategy:**
- `2.0.0-beta.1-local` - Initial server/run port (git tag only)
- `2.0.0-beta.2-local` - Full reasoning support (git tag only)
- `2.0.0-beta.3` - First public beta release (PyPI)
-
-**Quality Gate:**
- Feature parity with 1.1.1-beta.3
- All server/run tests passing
- Reasoning models working
- Documentation complete
-
-**Target Users:**
- Internal testing and validation
- Beta.3: Public beta testers
- Full MLX-Knife functionality seekers
-
-### **2.0.0-rc** (Feature-Complete vs 1.x)
-**Scope:** Full feature parity with MLX-Knife 1.x
-
-**New Features:**
- ✅ `server` command - OpenAI-compatible API server
- ✅ `run` command - Interactive model execution
- ✅ `embed` command - Embedding generation (if merged from 1.x)
- ✅ Human-readable output via CLI layer formatting
-
-**Quality Gate:**
- All 1.x functionality replicated
- Migration path documented
- Performance parity or better
- Server functionality validated
-
-**Target Users:**
- Full 1.x replacement candidates
- Users requiring both JSON and human output
- Server-mode applications
-
-### **2.0.0-stable**
-**Scope:** Production-ready replacement for MLX-Knife 1.x
-
-**Requirements:**
- ✅ All RC features stable and documented
- ✅ Migration guide with examples
- ✅ Community feedback incorporated
- ✅ Long-term support commitment
- ✅ Package management (pip/brew) ready
-
-**Target Users:**
- All MLX-Knife users
- General availability deployment
-
-## Deployment Strategy
-
-### Broke-Cluster POC Environment
-
-**Parallel Deployment Architecture:**
-```bash
-# System-wide: MLX-Knife 1.1.0 (stable server functionality)
-pip install mlx-knife==1.1.0
-
-# Local development: MLX-Knife 2.0.0-alpha (JSON management)
-pip install -e /path/to/mlx-knife  # Local install (current 2.0 feature branch)
-```
-
-**Usage Pattern:**
-```bash
-# Server operations: Use 1.x (stable, proven)
-mlxk server --model "Phi-3-mini" --port 8000
-
-# Management operations: Use 2.0.0-alpha (JSON automation)
-mlxk-json list --json | jq '.data.models[].name'
-mlxk-json health --json | jq '.data.summary'
-mlxk-json pull "new-model" --json
-```
-
-**Benefits:**
- ✅ **Risk mitigation**: Server stability maintained with 1.x
- ✅ **Feature validation**: JSON API tested in production environment  
- ✅ **Gradual migration**: Teams can adopt 2.0 features incrementally
- ✅ **Rollback safety**: Can disable 2.0 without affecting server operations
-
-### Package Naming Strategy
-
-**Development Phase:**
- `mlx-knife` (1.1.0) - Stable production version
- `mlxk2` / `mlxk-json` - Development 2.0.0-alpha local install (single long‑lived 2.0 branch; releases via annotated tags)
-
-**Production Phase:**
- `mlx-knife` (2.0.0+) - New major version
- `mlx-knife-v1` (1.1.0) - Legacy support if needed
-
-## Quality Gates Summary
-
-| Version | Test Coverage | Features | Server Mode | Production Ready |
-|---------|---------------|----------|-------------|------------------|
-| **alpha** | ~70% (mock issues) | JSON-only (5 ops) | ❌ | Limited |
-| **beta** | 100% | JSON-only (5 ops) | ❌ | Yes (JSON) |
-| **rc** | 100% | Full parity | ✅ | Yes (All) |
-| **stable** | 100% + community | Full parity | ✅ | Yes (LTS) |
-
-## Success Metrics
-
-### Alpha Success Criteria
- [ ] Broke-cluster integration working
- [ ] Core JSON operations stable
- [ ] No user cache corruption in testing
- [ ] JSON schema documentation complete
-
-### Beta Success Criteria  
- [ ] 100% test pass rate
- [ ] Performance benchmarks established
- [ ] All ADR-002 edge cases handled
- [ ] Production deployment successful
-
-### RC Success Criteria
- [ ] Feature parity with 1.x achieved
- [ ] Migration guide validated
- [ ] Server mode performance acceptable
- [ ] Community feedback positive
-
-### Stable Success Criteria
- [ ] 6+ months beta stability
- [ ] Multiple production deployments
- [ ] Documentation comprehensive
- [ ] Long-term support plan
-
-## Timeline Estimates
-
-**Current Status:** Active alpha cycle with tagged pre‑releases
-  - JSON CLI stable for broke‑cluster use
-
-**Projected Milestones:**
- **2.0.0-alpha**: rolling alphas (tagged), experimental features allowed but clearly marked and safe by default
- **2.0.0-beta**: 4-6 weeks (robust testing)
- **2.0.0-rc**: 8-12 weeks (server/run implementation)  
- **2.0.0-stable**: 16-20 weeks (community validation)
-
-## Risk Mitigation
-
-### HuggingFace Cache Compatibility (CRITICAL)
-
-**Apple MLX Team & HuggingFace Hub Integration:**
- **~20+ MLX ecosystem users** depend on cache stability
- **HuggingFace Hub attention** - changes monitored by upstream
- **Cache structure**: MLX-Knife follows HuggingFace standards
-
-**Cache Safety Guidelines:**
-```markdown
-### Shared Cache Environment Best Practices
- **Read operations** (`list`, `health`, `show`): Always safe with concurrent processes
- **Write operations** (`pull`, `rm`): Coordinate with team during maintenance windows
- **Lock cleanup**: Automatic in MLX-Knife, avoid during active HuggingFace downloads
- **User responsibility**: Coordinate cache access, no special flags needed
-```
-
-### Parallel Deployment Risks
- **Configuration conflicts**: Different cache paths, environment variables
- **User confusion**: Clear naming and documentation required
- **Maintenance burden**: Supporting two codebases temporarily
-
-### Mitigation Strategies
- **Clear separation**: Different package names, installation paths
- **Comprehensive docs**: Usage examples, best practices, cache guidelines
- **Automated testing**: Both versions in CI/CD pipeline
- **Community support**: Active communication about roadmap
-
-## Decision Authority
-
-**Architecture Decisions:** Development team consensus required
-**Version Releases:** Lead maintainer approval + community review
-**Breaking Changes:** Major version bump + migration period
-**Support Policy:** LTS for stable versions, best-effort for pre-release
-
---
-
-This versioning strategy provides a clear path from current alpha-quality code to production-ready 2.0.0 while maintaining stability through parallel deployment with 1.x versions.
@@ -7,4 +7,4 @@ import warnings
 # Issue parity with 1.1.0 (Issue #22)
 warnings.filterwarnings('ignore', message='urllib3 v2 only supports OpenSSL 1.1.1+')

-__version__ = "2.0.1"
+__version__ = "2.0.2"
@@ -437,11 +437,22 @@ class MLXRunner:
                stop_tokens_to_check = [t for t in stop_tokens_to_check if isinstance(t, str) and t]
                if use_chat_stop_tokens:
                    stop_tokens_to_check.extend(self._chat_stop_tokens)
-                
-                for stop_token in stop_tokens_to_check:
-                    if stop_token in accumulated_response:
-                        stop_pos = accumulated_response.find(stop_token)
-                        text_before_stop = accumulated_response[:stop_pos]
+
+                # Find earliest stop token in accumulated response (ADR-011: multiple EOS token handling)
+                if stop_tokens_to_check:
+                    earliest_pos = len(accumulated_response)
+                    earliest_token = None
+
+                    for stop_token in stop_tokens_to_check:
+                        if stop_token in accumulated_response:
+                            pos = accumulated_response.find(stop_token)
+                            if pos < earliest_pos:
+                                earliest_pos = pos
+                                earliest_token = stop_token
+
+                    if earliest_token:
+                        # Found stop token - yield remaining text before it and stop
+                        text_before_stop = accumulated_response[:earliest_pos]
                        previously_yielded_length = len(accumulated_response) - len(new_text)
                        if len(text_before_stop) > previously_yielded_length:
                            new_part_before_stop = text_before_stop[previously_yielded_length:]
@@ -593,6 +604,26 @@ class MLXRunner:
        # Decode full response
        full_response = self.tokenizer.decode(all_tokens)

+        # Debug: Show raw generated tokens for quality analysis (enabled via --verbose)
+        if self.verbose:
+            print("\n[DEBUG] Token generation analysis:")
+            print(f"[DEBUG]   Generated {len(generated_tokens)} tokens")
+            if len(generated_tokens) >= 3:
+                last_3_ids = generated_tokens[-3:]
+                last_3_decoded = []
+                for tid in last_3_ids:
+                    try:
+                        decoded = self.tokenizer.decode([tid])
+                        last_3_decoded.append(f"{tid}={decoded!r}")
+                    except Exception:
+                        last_3_decoded.append(f"{tid}=<error>")
+                print(f"[DEBUG]   Last 3 tokens: {last_3_decoded}")
+
+                # Check for multiple EOS tokens (quality issue indicator)
+                eos_count = sum(1 for tid in last_3_ids if tid in self.tokenizer.eos_token_ids)
+                if eos_count > 1:
+                    print(f"[DEBUG]   ⚠️ WARNING: Multiple EOS tokens detected ({eos_count}) - model quality issue")
+
        # Remove prompt part (guard types to tolerate mocks)
        if isinstance(full_response, str) and isinstance(formatted_prompt, str) and full_response.startswith(formatted_prompt):
            response = full_response[len(formatted_prompt):]
@@ -601,18 +632,33 @@ class MLXRunner:
            response = decoded if isinstance(decoded, str) else str(decoded)

        # Filter stop tokens (strings only)
+        # Find the EARLIEST stop token in the response (not first in list)
        if self._stop_tokens:
-            for stop_token in [t for t in self._stop_tokens if isinstance(t, str) and t]:
-                if stop_token and stop_token in response:
-                    response = response[:response.find(stop_token)]
-                    break
+            stop_tokens_filtered = [t for t in self._stop_tokens if isinstance(t, str) and t]
+            earliest_pos = len(response)
+            earliest_token = None
+
+            for stop_token in stop_tokens_filtered:
+                if stop_token in response:
+                    pos = response.find(stop_token)
+                    if pos < earliest_pos:
+                        earliest_pos = pos
+                        earliest_token = stop_token
+
+            if earliest_token:
+                response = response[:earliest_pos]

        # Optionally filter chat stop tokens to prevent self-conversations in batch mode
+        # Find the EARLIEST chat stop token (same logic as above)
        if use_chat_stop_tokens and self._chat_stop_tokens:
+            earliest_pos = len(response)
            for stop_token in self._chat_stop_tokens:
                if stop_token and stop_token in response:
-                    response = response[:response.find(stop_token)]
-                    break
+                    pos = response.find(stop_token)
+                    if pos < earliest_pos:
+                        earliest_pos = pos
+            if earliest_pos < len(response):
+                response = response[:earliest_pos]

        # Format reasoning models output
        response = self._format_reasoning_response(response)
@@ -47,6 +47,8 @@ def extract_stop_tokens(tokenizer: Any, verbose: bool = False) -> StopTokenInfo:
                if isinstance(token_content, str) and token_content:
                    token_lower = token_content.lower()
                    if token_content == '<|end|>':
+                        # Always add <|end|> as stop token (fixes Phi-3-mini with eos_token_id=null)
+                        stop_tokens.add(token_content)
                        add_eos_token = getattr(tokenizer, 'add_eos_token', None)
                        if callable(add_eos_token):
                            try:
@@ -11,6 +11,8 @@ markers =
    live_clone: Alias for wet; clone live tests (require env, ADR-007 Phase 1)
    live_run: Opt-in run command tests with real models (require user cache model)
    live_stop_tokens: Opt-in stop token tests with real models (Issue #32, ADR-009)
+    live_e2e: E2E tests with real models and server (require HF_HOME, ADR-011)
+    show_model_portfolio: Display E2E test portfolio (convenience, no testing)
    issue27: Real-model health policy tests (opt-in; read-only user cache)
    slow: Tests that take >1 minute to run
 filterwarnings =
@@ -0,0 +1 @@
+"""Live E2E tests package (ADR-011)."""
@@ -0,0 +1,158 @@
+"""Shared fixtures for live E2E tests (ADR-011).
+
+This conftest.py provides pytest fixtures for the live/ test package.
+For utility functions and constants, see test_utils.py.
+"""
+
+from __future__ import annotations
+
+import os
+import sys
+import pytest
+
+# Prevent tokenizer fork warnings and potential deadlocks
+# See: https://github.com/huggingface/tokenizers/issues/1047
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+from pathlib import Path
+from typing import Dict, Any
+
+# Import utilities from test_utils
+from .test_utils import (
+    discover_mlx_models_in_user_cache,
+    TEST_MODELS,
+)
+
+# Import the real MLX modules fixture from parent test module
+# This is needed for tests that use MLXRunner directly (e.g., streaming parity)
+# The fixture is already decorated with @pytest.fixture in test_stop_tokens_live.py
+# We just import and re-export it here so it's available to tests in this package
+_parent_dir = Path(__file__).parent.parent
+sys.path.insert(0, str(_parent_dir))
+try:
+    from test_stop_tokens_live import _use_real_mlx_modules
+finally:
+    sys.path.remove(str(_parent_dir))
+
+# The imported fixture is now available to all tests in this package
+
+
+@pytest.fixture(scope="function", autouse=True)
+def _skip_unless_live_e2e_marker(request):
+    """Auto-skip E2E tests unless -m live_e2e is explicitly used.
+
+    E2E tests are marker-required (🔒) - they require real models and httpx.
+    This fixture ensures they are skipped in the default pytest run.
+
+    Exception: show_model_portfolio marker is allowed (convenience diagnostics).
+    """
+    # Check if test has live_e2e marker
+    if request.node.get_closest_marker("live_e2e"):
+        # Check if -m live_e2e or -m show_model_portfolio was specified
+        selected_markers = request.config.getoption("-m") or ""
+        if "live_e2e" not in selected_markers and "show_model_portfolio" not in selected_markers:
+            pytest.skip("Run with -m live_e2e to enable E2E tests with real models")
+
+
+def pytest_generate_tests(metafunc):
+    """Generate parametrized tests for model_key parameter.
+
+    If a test function has 'model_key' in its signature, this hook
+    automatically parametrizes it over all models in the portfolio.
+    This replaces the old loop-based approach (which caused RAM leaks)
+    with pytest-native parametrization for proper test isolation.
+
+    Each parametrized test gets its own server instance lifecycle,
+    preventing accumulated RAM leaks from improper cleanup.
+
+    IMPORTANT: This hook runs during COLLECTION phase. We check for
+    live_e2e marker BEFORE doing portfolio discovery to avoid slow
+    collection when marker is not requested (maintains marker-required 🔒).
+    """
+    if "model_key" in metafunc.fixturenames:
+        # Check if live_e2e marker is requested (COLLECTION-TIME check)
+        selected_markers = metafunc.config.getoption("-m") or ""
+        if "live_e2e" not in selected_markers:
+            # Parametrize with dummy value to allow collection
+            # Tests will be skipped by _skip_unless_live_e2e_marker fixture
+            # This prevents "fixture 'model_key' not found" errors
+            metafunc.parametrize("model_key", ["_skipped"])
+            return
+
+        # Portfolio Discovery at collection time (uses subprocess mlxk list)
+        discovered = discover_mlx_models_in_user_cache()
+
+        if discovered:
+            # Use discovered models - generate keys matching portfolio_models fixture
+            model_keys = [f"discovered_{i:02d}" for i in range(len(discovered))]
+        else:
+            # Fallback to hardcoded test models
+            model_keys = list(TEST_MODELS.keys())
+
+        # Parametrize the test over all model keys
+        metafunc.parametrize("model_key", model_keys)
+
+
+@pytest.fixture(scope="module")
+def portfolio_models():
+    """Dynamic model portfolio: discovered models OR hardcoded fallback.
+
+    Reuses Portfolio Discovery from ADR-009 (test_stop_tokens_live.py).
+    Enables portfolio testing when HF_HOME is set, falls back to
+    3 hardcoded test models otherwise (backward compatibility).
+
+    Returns:
+        Dict[str, Dict[str, Any]]: Model portfolio keyed by model_key
+            {
+                "discovered_00": {
+                    "id": "mlx-community/Llama-3.2-3B-Instruct-4bit",
+                    "ram_needed_gb": 4.0,
+                    "expected_issue": None,
+                    "description": "Discovered: ..."
+                },
+                ...
+            }
+    """
+    discovered = discover_mlx_models_in_user_cache()
+
+    if discovered:
+        # Convert discovered models to TEST_MODELS format
+        result = {}
+        for i, model in enumerate(discovered):
+            key = f"discovered_{i:02d}"
+            result[key] = {
+                "id": model["model_id"],
+                "ram_needed_gb": model["ram_needed_gb"],
+                "expected_issue": None,  # Unknown for discovered models
+                "description": f"Discovered: {model['model_id']} ({model['weight_count']} weights)"
+            }
+
+        print(f"\n🔍 Portfolio Discovery: Found {len(result)} MLX models in cache")
+        return result
+    else:
+        # Fallback to hardcoded test models
+        print(f"\n📋 Using hardcoded TEST_MODELS (3 models)")
+        return TEST_MODELS
+
+
+@pytest.fixture
+def model_info(portfolio_models, model_key):
+    """Get model info for the current parametrized model_key.
+
+    This fixture provides convenient access to model metadata in
+    parametrized tests. It automatically looks up the model_key
+    in the portfolio and returns the model info dict.
+
+    Usage:
+        def test_something(model_info):
+            model_id = model_info["id"]
+            ram_needed = model_info["ram_needed_gb"]
+            ...
+
+    Returns:
+        Dict[str, Any]: Model metadata with keys:
+            - id: Model ID (e.g., "mlx-community/Llama-3.2-3B-Instruct-4bit")
+            - ram_needed_gb: Estimated RAM requirement
+            - expected_issue: Known issue or None
+            - description: Human-readable description
+    """
+    return portfolio_models[model_key]
@@ -0,0 +1,143 @@
+"""LocalServer context manager for E2E testing (ADR-011).
+
+Provides a clean subprocess-based server lifecycle for testing:
+- Starts server with pre-loaded model
+- Waits for health check before yielding
+- Ensures graceful cleanup on exit
+"""
+
+from __future__ import annotations
+
+import gc
+import sys
+import time
+import subprocess
+from contextlib import contextmanager
+from typing import Optional
+
+try:
+    import httpx
+except ImportError:
+    httpx = None  # Will fail at test time with clear error
+
+# Optional: RAM monitoring for debugging (requires psutil)
+# Uncomment to enable RAM logging during test runs
+# try:
+#     import psutil
+#     def log_ram_status(stage: str) -> None:
+#         """Log current RAM status (non-blocking)."""
+#         mem = psutil.virtual_memory()
+#         free_gb = mem.available / (1024**3)
+#         total_gb = mem.total / (1024**3)
+#         print(f"[RAM-{stage}] Free: {free_gb:.1f}GB / {total_gb:.1f}GB ({mem.percent:.1f}% used)")
+# except ImportError:
+#     def log_ram_status(stage: str) -> None:
+#         pass  # psutil not installed, skip logging
+
+
+@contextmanager
+def LocalServer(
+    model: str,
+    port: int = 8765,
+    timeout: int = 60,
+    log_level: str = "warning"
+):
+    """Start a local mlx-knife server for E2E testing.
+
+    Context manager that:
+    1. Launches server subprocess with pre-loaded model
+    2. Waits for /health endpoint to respond (up to timeout)
+    3. Yields server URL for testing
+    4. Ensures graceful shutdown (SIGTERM → SIGKILL fallback)
+
+    Args:
+        model: Model ID to pre-load (e.g., "mlx-community/Llama-3.2-3B-Instruct-4bit")
+        port: Server port (default 8765, non-standard to avoid conflicts)
+        timeout: Startup timeout in seconds (default 60s for model loading)
+        log_level: Server log level (default "warning" to reduce noise)
+
+    Yields:
+        server_url: str like "http://127.0.0.1:8765"
+
+    Raises:
+        TimeoutError: If server fails to start within timeout
+        RuntimeError: If httpx not installed
+
+    Example:
+        >>> with LocalServer("mlx-community/Llama-3.2-3B-Instruct-4bit") as url:
+        ...     response = httpx.post(f"{url}/v1/chat/completions", json={...})
+        ...     assert response.status_code == 200
+    """
+    if httpx is None:
+        raise RuntimeError("httpx required for E2E tests (pip install httpx)")
+
+    # Start server subprocess
+    proc = subprocess.Popen(
+        [
+            sys.executable, "-m", "mlxk2.cli",
+            "serve",
+            "--model", model,
+            "--port", str(port),
+            "--host", "127.0.0.1",
+            "--log-level", log_level
+        ],
+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,
+        text=True
+    )
+
+    server_url = f"http://127.0.0.1:{port}"
+
+    # Wait for server health check
+    start_time = time.time()
+    last_error: Optional[Exception] = None
+
+    while time.time() - start_time < timeout:
+        try:
+            response = httpx.get(f"{server_url}/health", timeout=2.0)
+            if response.status_code == 200:
+                # Server ready
+                break
+        except Exception as e:
+            last_error = e
+            time.sleep(0.5)
+    else:
+        # Timeout: kill server and report error
+        proc.kill()
+        stdout, stderr = proc.communicate()
+
+        error_msg = (
+            f"Server failed to start within {timeout}s\n"
+            f"Last error: {last_error}\n"
+            f"--- STDOUT ---\n{stdout}\n"
+            f"--- STDERR ---\n{stderr}"
+        )
+        raise TimeoutError(error_msg)
+
+    try:
+        yield server_url
+    finally:
+        # Graceful shutdown with active polling for MLX memory cleanup
+        proc.terminate()
+
+        # Active polling: Check if process actually terminated (smart wait)
+        # instead of blindly waiting 45s every time
+        max_wait = 45  # Conservative timeout for very large models (>40GB)
+        start = time.time()
+
+        while time.time() - start < max_wait:
+            if proc.poll() is not None:
+                # Process terminated successfully
+                break
+            time.sleep(0.5)  # Poll every 500ms
+        else:
+            # Timeout: Process frozen after 45s, force kill
+            # Note: This should be rare - most models cleanup in 10-20s
+            proc.kill()
+            proc.wait()
+
+        # Explicit garbage collection + buffer time for Metal memory release
+        # This helps prevent RAM overlap when transitioning between large models
+        # (e.g., Mixtral 29GB → OpenCode 21GB back-to-back tests)
+        gc.collect()
+        time.sleep(2)  # Buffer for GPU memory deallocation (reduced from 5s)
@@ -0,0 +1,155 @@
+"""SSE (Server-Sent Events) parsing utilities for E2E testing (ADR-011).
+
+Provides utilities for parsing SSE streams from httpx responses.
+Handles OpenAI-compatible SSE format with "data:" lines and [DONE] sentinel.
+"""
+
+from __future__ import annotations
+
+import json
+from typing import Iterator, Dict, Any
+
+
+def parse_sse_stream(response) -> Iterator[Dict[str, Any]]:
+    """Parse SSE event stream from httpx response.
+
+    OpenAI-compatible SSE format:
+        data: {"id": "cmpl-xxx", "object": "chat.completion.chunk", ...}
+        data: {"choices": [{"delta": {"content": "token"}, ...}]}
+        ...
+        data: [DONE]
+
+    Args:
+        response: httpx.Response with streaming enabled (e.g., httpx.stream())
+
+    Yields:
+        Parsed JSON objects from each "data:" line (excludes [DONE])
+
+    Raises:
+        json.JSONDecodeError: If data line contains invalid JSON
+
+    Example:
+        >>> with httpx.stream("POST", url, json={...}) as response:
+        ...     for chunk in parse_sse_stream(response):
+        ...         print(chunk["choices"][0]["delta"].get("content", ""))
+    """
+    for line in response.iter_lines():
+        # SSE lines are "data: <json>" or "data: [DONE]"
+        if line.startswith("data: "):
+            data = line[6:]  # Strip "data: " prefix
+
+            # [DONE] sentinel marks end of stream
+            if data == "[DONE]":
+                break
+
+            # Parse JSON payload
+            try:
+                yield json.loads(data)
+            except json.JSONDecodeError as e:
+                # Re-raise with context for debugging
+                raise json.JSONDecodeError(
+                    f"Invalid SSE JSON: {data!r}",
+                    e.doc,
+                    e.pos
+                ) from e
+
+
+def collect_sse_content(response) -> str:
+    """Collect complete text content from SSE stream.
+
+    Convenience function that extracts "content" fields from SSE chunks
+    and concatenates them into a single string.
+
+    Args:
+        response: httpx.Response with streaming enabled
+
+    Returns:
+        Complete text content from all SSE chunks
+
+    Example:
+        >>> with httpx.stream("POST", url, json={"stream": True, ...}) as response:
+        ...     text = collect_sse_content(response)
+        ...     assert "<|end|>" not in text  # No visible stop tokens
+    """
+    content_parts = []
+
+    for chunk in parse_sse_stream(response):
+        # Extract content from delta
+        if "choices" in chunk:
+            for choice in chunk["choices"]:
+                delta = choice.get("delta", {})
+                if "content" in delta:
+                    content_parts.append(delta["content"])
+
+    return "".join(content_parts)
+
+
+def validate_sse_format(response) -> tuple[bool, str]:
+    """Validate SSE response format compliance.
+
+    Checks:
+    - All lines start with "data: " or are empty
+    - JSON payloads are valid
+    - Stream ends with "data: [DONE]"
+    - Chunks have expected OpenAI structure
+
+    Args:
+        response: httpx.Response with streaming enabled
+
+    Returns:
+        (is_valid, error_message) tuple
+
+    Example:
+        >>> with httpx.stream("POST", url, json={"stream": True, ...}) as response:
+        ...     valid, error = validate_sse_format(response)
+        ...     assert valid, f"Invalid SSE format: {error}"
+    """
+    try:
+        # Collect all lines to validate sentinel presence
+        all_lines = []
+        chunks = []
+        found_done_sentinel = False
+
+        for line in response.iter_lines():
+            all_lines.append(line)
+
+            # Parse SSE data lines
+            if line.startswith("data: "):
+                data = line[6:]  # Strip "data: " prefix
+
+                # Check for [DONE] sentinel
+                if data == "[DONE]":
+                    found_done_sentinel = True
+                    break
+
+                # Parse JSON chunks
+                try:
+                    chunks.append(json.loads(data))
+                except json.JSONDecodeError as e:
+                    return False, f"Invalid JSON in SSE stream: {data!r}"
+
+        # Validate [DONE] sentinel was present
+        if not found_done_sentinel:
+            return False, "Stream missing 'data: [DONE]' sentinel (clients would hang)"
+
+        if not chunks:
+            return False, "No SSE chunks received"
+
+        # Validate first chunk has required fields
+        first_chunk = chunks[0]
+        if "id" not in first_chunk:
+            return False, "First chunk missing 'id' field"
+        if "object" not in first_chunk:
+            return False, "First chunk missing 'object' field"
+
+        # Validate all chunks have choices
+        for i, chunk in enumerate(chunks):
+            if "choices" not in chunk:
+                return False, f"Chunk {i} missing 'choices' field"
+
+        return True, ""
+
+    except json.JSONDecodeError as e:
+        return False, f"Invalid JSON in SSE stream: {e}"
+    except Exception as e:
+        return False, f"SSE parsing error: {e}"
@@ -0,0 +1,278 @@
+"""CLI Integration E2E Tests (ADR-011).
+
+Validates CLI commands with real models:
+- `mlxk run` basic functionality
+- `--json` flag output formatting
+- Exit code propagation (Issue #38)
+- Stop token filtering across CLI interface
+
+Test Strategy:
+- Uses Portfolio Discovery for model selection
+- RAM-aware testing
+- Subprocess-based execution (true E2E)
+- Deterministic sampling (temperature=0.0) for reproducible code tests
+- Reuses exit code patterns from test_cli_run_exit_codes.py
+
+NOTE: Tests use temperature=0.0 (not CLI default 0.7) for deterministic
+code validation. Real usage testing with default settings is covered by
+ADR-013 (Community Model Quality Benchmarks).
+
+Opt-in via: pytest -m live_e2e
+Requires: HF_HOME set to model cache
+"""
+
+from __future__ import annotations
+
+import sys
+import json
+import os
+import subprocess
+import pytest
+from pathlib import Path
+
+# Import test utilities
+from .test_utils import (
+    should_skip_model,
+    TEST_PROMPT,
+    MAX_TOKENS,
+    TEST_TEMPERATURE,
+)
+# portfolio_models fixture is provided by conftest.py
+
+# Opt-in markers
+pytestmark = [pytest.mark.live_e2e, pytest.mark.slow]
+
+
+def _run_mlxk_subprocess(args: list[str], timeout: int = 60) -> tuple[str, str, int]:
+    """Run mlxk CLI in subprocess and capture output.
+
+    Args:
+        args: CLI arguments (e.g., ["run", "model-id", "prompt"])
+        timeout: Timeout in seconds
+
+    Returns:
+        (stdout, stderr, exit_code) tuple
+    """
+    result = subprocess.run(
+        [sys.executable, "-m", "mlxk2.cli"] + args,
+        capture_output=True,
+        text=True,
+        timeout=timeout
+    )
+    return result.stdout, result.stderr, result.returncode
+
+
+class TestRunCommandBasic:
+    """Basic `mlxk run` functionality tests.
+
+    Tests are parametrized per model via pytest_generate_tests hook.
+    Each test runs independently for clean isolation.
+    """
+
+    @pytest.mark.live_e2e
+    def test_run_command(self, portfolio_models, model_key):
+        """Validate `mlxk run` with model.
+
+        Parametrized test (one instance per model in portfolio).
+
+        Tests:
+        - Exit code 0 on success
+        - No visible stop tokens in output
+        - Output is non-empty
+        """
+        model_info = portfolio_models[model_key]
+        model_id = model_info["id"]
+
+        # RAM gating
+        should_skip, skip_reason = should_skip_model(model_key, portfolio_models)
+        if should_skip:
+            pytest.skip(skip_reason)
+
+        print(f"\nTesting {model_key}: {model_id}")
+
+        args = ["run", model_id, TEST_PROMPT, "--max-tokens", str(MAX_TOKENS), "--temperature", str(TEST_TEMPERATURE)]
+        stdout, stderr, exit_code = _run_mlxk_subprocess(args, timeout=90)
+
+        # Validate exit code
+        assert exit_code == 0, (
+            f"Expected exit code 0, got {exit_code}\n"
+            f"stdout: {stdout}\n"
+            f"stderr: {stderr}"
+        )
+
+        # Validate output is non-empty
+        assert stdout.strip(), "Output is empty"
+
+        # Validate no visible stop tokens
+        stop_tokens = [
+            "<|end|>", "<|eot_id|>", "<|im_end|>",
+            "<|endoftext|>", "</s>", "<|end_of_text|>"
+        ]
+        found_tokens = [t for t in stop_tokens if t in stdout]
+        assert not found_tokens, (
+            f"Model {model_id} has visible stop tokens: {found_tokens}\n"
+            f"Output: {stdout!r}"
+        )
+
+        print(f"✓ {model_key}: Passed (output: {len(stdout)} chars)")
+
+
+class TestRunCommandJSON:
+    """JSON output mode tests.
+
+    Tests are parametrized per model via pytest_generate_tests hook.
+    Use pytest -k or --maxfail to limit test count if needed.
+    """
+
+    @pytest.mark.live_e2e
+    def test_run_json_output(self, portfolio_models, model_key):
+        """Validate `mlxk run --json` output format.
+
+        Parametrized test (one instance per model in portfolio).
+
+        Tests:
+        - JSON envelope structure
+        - status: success on successful generation
+        - data.response contains output
+        - No visible stop tokens
+        """
+        model_info = portfolio_models[model_key]
+        model_id = model_info["id"]
+
+        # RAM gating
+        should_skip, skip_reason = should_skip_model(model_key, portfolio_models)
+        if should_skip:
+            pytest.skip(skip_reason)
+
+        print(f"\nTesting {model_key}: {model_id}")
+
+        args = ["run", model_id, TEST_PROMPT, "--max-tokens", str(MAX_TOKENS), "--temperature", str(TEST_TEMPERATURE), "--json"]
+        stdout, stderr, exit_code = _run_mlxk_subprocess(args, timeout=90)
+
+        # Validate exit code
+        assert exit_code == 0, (
+            f"Expected exit code 0, got {exit_code}\n"
+            f"stderr: {stderr}"
+        )
+
+        # Parse JSON
+        data = json.loads(stdout)
+
+        # Validate envelope structure
+        assert "status" in data, "Missing 'status' field"
+        assert data["status"] == "success", f"Expected status=success, got {data['status']}"
+        assert "data" in data, "Missing 'data' field"
+        assert "response" in data["data"], "Missing 'data.response' field"
+
+        # Extract response
+        response = data["data"]["response"]
+        assert response.strip(), "Response is empty"
+
+        # Validate no stop tokens
+        stop_tokens = ["<|end|>", "<|eot_id|>", "<|im_end|>"]
+        found_tokens = [t for t in stop_tokens if t in response]
+        assert not found_tokens, (
+            f"Model {model_id} has visible stop tokens in JSON: {found_tokens}\n"
+            f"Response: {response!r}"
+        )
+
+        print(f"✓ {model_key}: Passed (JSON output: {len(response)} chars)")
+
+
+class TestRunCommandExitCodes:
+    """Exit code propagation tests (Issue #38)."""
+
+    @pytest.mark.live_e2e
+    def test_run_invalid_model_exit_code_text(self):
+        """Validate exit code 1 for invalid model (text mode).
+
+        Tests Issue #38 fix: CLI properly propagates errors.
+        """
+        stdout, stderr, exit_code = _run_mlxk_subprocess(
+            ["run", "nonexistent/invalid-model-12345", "test"],
+            timeout=30
+        )
+
+        # Should return exit code 1 for errors
+        assert exit_code == 1, (
+            f"Expected exit code 1 for invalid model, got {exit_code}\n"
+            f"stdout: {stdout}"
+        )
+
+        # Error message should be present
+        output = stdout + stderr
+        assert "error" in output.lower() or "failed" in output.lower(), (
+            f"Expected error message in output, got: {output}"
+        )
+
+    @pytest.mark.live_e2e
+    def test_run_invalid_model_exit_code_json(self):
+        """Validate exit code 1 for invalid model (JSON mode).
+
+        Tests Issue #38 fix with --json flag.
+        """
+        stdout, stderr, exit_code = _run_mlxk_subprocess(
+            ["run", "nonexistent/invalid-model-12345", "test", "--json"],
+            timeout=30
+        )
+
+        # Should return exit code 1
+        assert exit_code == 1, (
+            f"Expected exit code 1 for invalid model, got {exit_code}\n"
+            f"stdout: {stdout}"
+        )
+
+        # JSON should have error status
+        try:
+            data = json.loads(stdout)
+            assert "status" in data, "Missing 'status' field in error JSON"
+            assert data["status"] == "error", (
+                f"Expected status=error, got {data['status']}"
+            )
+            assert "error" in data, "Missing 'error' field in error JSON"
+        except json.JSONDecodeError:
+            # If JSON parsing fails, check stderr
+            assert "error" in stderr.lower(), (
+                f"Expected error in stderr, got: {stderr}"
+            )
+
+
+class TestRunCommandStopTokens:
+    """Specific stop token filtering validation."""
+
+    @pytest.mark.live_e2e
+    def test_run_no_visible_stop_tokens_mxfp4(self, portfolio_models):
+        """Validate MXFP4 model has no visible stop tokens via CLI.
+
+        Specific regression test for Issue #32 at CLI level.
+        """
+        # Find MXFP4 model in portfolio (or skip)
+        mxfp4_model = None
+        for model_key, model_info in portfolio_models.items():
+            if "mxfp4" in model_key.lower() or "gpt-oss" in model_info["id"].lower():
+                # Check RAM
+                should_skip, skip_reason = should_skip_model(model_key, portfolio_models)
+                if not should_skip:
+                    mxfp4_model = model_info["id"]
+                    break
+
+        if mxfp4_model is None:
+            pytest.skip("MXFP4 model not available in portfolio or exceeds RAM")
+
+        print(f"\nTesting MXFP4: {mxfp4_model}")
+
+        args = ["run", mxfp4_model, TEST_PROMPT, "--max-tokens", str(MAX_TOKENS), "--temperature", str(TEST_TEMPERATURE)]
+        stdout, stderr, exit_code = _run_mlxk_subprocess(args, timeout=90)
+
+        assert exit_code == 0, f"Command failed with exit code {exit_code}"
+
+        # MXFP4-specific stop tokens
+        mxfp4_stop_tokens = ["<|end|>", "<|return|>"]
+        found_tokens = [t for t in mxfp4_stop_tokens if t in stdout]
+
+        assert not found_tokens, (
+            f"MXFP4 should filter stop tokens. Found: {found_tokens}\n"
+            f"Output: {stdout!r}"
+        )
+
+        print(f"✓ MXFP4: No visible stop tokens via CLI")
@@ -0,0 +1,381 @@
+"""Server E2E tests with real models (ADR-011).
+
+Validates server/HTTP API endpoints across model portfolio:
+- Health check endpoint
+- Model listing endpoint
+- Chat completions (batch and streaming)
+- Text completions (batch and streaming)
+- Stop token filtering (Issue #20/#32)
+- Error envelopes (ADR-004)
+
+Test Strategy:
+- Uses Portfolio Discovery (ADR-009) for model selection
+- RAM-aware testing (progressive budget: 40%-70%)
+- Subprocess-based server lifecycle (true E2E)
+- OpenAI-compatible API validation
+
+Opt-in via: pytest -m live_e2e
+Requires: HF_HOME set to model cache, httpx installed
+"""
+
+from __future__ import annotations
+
+import pytest
+from typing import Dict, Any
+
+try:
+    import httpx
+except ImportError:
+    httpx = None
+
+# Import test utilities
+from .server_context import LocalServer
+from .sse_parser import parse_sse_stream, collect_sse_content, validate_sse_format
+from .test_utils import (
+    should_skip_model,
+    TEST_PROMPT,
+    MAX_TOKENS,
+)
+# portfolio_models fixture is provided by conftest.py
+
+# Opt-in markers
+pytestmark = [
+    pytest.mark.live_e2e,
+    pytest.mark.slow,
+    pytest.mark.skipif(
+        httpx is None,
+        reason="httpx required for E2E tests (pip install httpx)"
+    )
+]
+
+
+class TestServerHealthEndpoints:
+    """Basic health and metadata endpoints."""
+
+    @pytest.mark.live_e2e
+    def test_health_endpoint(self, portfolio_models):
+        """Validate /health endpoint returns 200 OK.
+
+        Tests server basic liveness without model dependency.
+        Uses first available model from portfolio to start server.
+        """
+        # Use first model that fits in RAM
+        test_model = None
+        for model_key, model_info in portfolio_models.items():
+            should_skip, _ = should_skip_model(model_key, portfolio_models)
+            if not should_skip:
+                test_model = model_info["id"]
+                break
+
+        if test_model is None:
+            pytest.skip("No models available within RAM budget")
+
+        with LocalServer(test_model) as server_url:
+            response = httpx.get(f"{server_url}/health")
+
+            assert response.status_code == 200
+            data = response.json()
+            assert data.get("status") == "healthy"
+
+    @pytest.mark.live_e2e
+    def test_v1_models_list(self, portfolio_models):
+        """Validate /v1/models returns loaded model.
+
+        Tests model metadata endpoint with pre-loaded model.
+        """
+        # Use first model that fits in RAM
+        test_model = None
+        for model_key, model_info in portfolio_models.items():
+            should_skip, _ = should_skip_model(model_key, portfolio_models)
+            if not should_skip:
+                test_model = model_info["id"]
+                break
+
+        if test_model is None:
+            pytest.skip("No models available within RAM budget")
+
+        with LocalServer(test_model) as server_url:
+            response = httpx.get(f"{server_url}/v1/models")
+
+            assert response.status_code == 200
+            data = response.json()
+
+            # Validate OpenAI-compatible structure
+            assert "data" in data
+            assert isinstance(data["data"], list)
+            assert len(data["data"]) > 0
+
+            # Validate loaded model is listed
+            model_ids = [m["id"] for m in data["data"]]
+            assert test_model in model_ids
+
+
+class TestChatCompletionsBatch:
+    """Non-streaming chat completion tests across portfolio.
+
+    Tests are parametrized per model via pytest_generate_tests hook.
+    Each test runs with its own server instance for clean isolation.
+    """
+
+    @pytest.mark.live_e2e
+    def test_chat_completions_batch(self, portfolio_models, model_key):
+        """Validate non-streaming chat completions.
+
+        Parametrized test (one instance per model in portfolio).
+
+        Tests:
+        - Response structure (OpenAI-compatible)
+        - Stop token filtering (Issue #32)
+        - Error handling
+        """
+        model_info = portfolio_models[model_key]
+        model_id = model_info["id"]
+
+        # RAM gating: skip if model exceeds budget
+        should_skip, skip_reason = should_skip_model(model_key, portfolio_models)
+        if should_skip:
+            pytest.skip(skip_reason)
+
+        print(f"\nTesting {model_key}: {model_id}")
+
+        with LocalServer(model_id, port=8765) as server_url:
+            # Non-streaming chat completion
+            response = httpx.post(
+                f"{server_url}/v1/chat/completions",
+                json={
+                    "model": model_id,
+                    "messages": [
+                        {"role": "user", "content": TEST_PROMPT}
+                    ],
+                    "max_tokens": MAX_TOKENS,
+                    "stream": False
+                },
+                timeout=30.0
+            )
+
+            assert response.status_code == 200, f"Expected 200, got {response.status_code}"
+            data = response.json()
+
+            # Validate OpenAI structure
+            assert "id" in data, "Missing 'id' field"
+            assert "object" in data, "Missing 'object' field"
+            assert data["object"] == "chat.completion"
+            assert "choices" in data, "Missing 'choices' field"
+            assert len(data["choices"]) > 0, "Empty choices array"
+
+            # Extract response text
+            choice = data["choices"][0]
+            assert "message" in choice, "Missing 'message' field"
+            assert "content" in choice["message"], "Missing 'content' field"
+            content = choice["message"]["content"]
+
+            # Validate stop token filtering (Issue #32)
+            stop_tokens = [
+                "<|end|>", "<|eot_id|>", "<|im_end|>",
+                "<|endoftext|>", "</s>", "<|end_of_text|>"
+            ]
+            found_tokens = [t for t in stop_tokens if t in content]
+            assert not found_tokens, (
+                f"Model {model_id} has visible stop tokens: {found_tokens}\n"
+                f"Content: {content!r}"
+            )
+
+            print(f"✓ {model_key}: Passed (output: {len(content)} chars)")
+
+
+class TestChatCompletionsStreaming:
+    """SSE streaming chat completion tests across portfolio.
+
+    Tests are parametrized per model via pytest_generate_tests hook.
+    Each test runs with its own server instance for clean isolation.
+    """
+
+    @pytest.mark.live_e2e
+    def test_chat_completions_streaming(self, portfolio_models, model_key):
+        """Validate SSE streaming chat completions.
+
+        Parametrized test (one instance per model in portfolio).
+
+        Tests:
+        - SSE format compliance (data: lines, [DONE] sentinel)
+        - Chunk structure (OpenAI-compatible)
+        - Stop token filtering (Issue #32)
+        - Stream completion
+        """
+        model_info = portfolio_models[model_key]
+        model_id = model_info["id"]
+
+        # RAM gating
+        should_skip, skip_reason = should_skip_model(model_key, portfolio_models)
+        if should_skip:
+            pytest.skip(skip_reason)
+
+        print(f"\nTesting {model_key}: {model_id}")
+
+        with LocalServer(model_id, port=8765) as server_url:
+            # Streaming chat completion
+            with httpx.stream(
+                "POST",
+                f"{server_url}/v1/chat/completions",
+                json={
+                    "model": model_id,
+                    "messages": [
+                        {"role": "user", "content": TEST_PROMPT}
+                    ],
+                    "max_tokens": MAX_TOKENS,
+                    "stream": True
+                },
+                timeout=30.0
+            ) as response:
+                assert response.status_code == 200
+
+                # Validate SSE format
+                valid, error_msg = validate_sse_format(response)
+                if not valid:
+                    # Response consumed by validation, need to restart
+                    raise AssertionError(f"SSE format invalid: {error_msg}")
+
+            # Re-run to collect content (validation consumed stream)
+            with httpx.stream(
+                "POST",
+                f"{server_url}/v1/chat/completions",
+                json={
+                    "model": model_id,
+                    "messages": [
+                        {"role": "user", "content": TEST_PROMPT}
+                    ],
+                    "max_tokens": MAX_TOKENS,
+                    "stream": True
+                },
+                timeout=30.0
+            ) as response:
+                content = collect_sse_content(response)
+
+            # Validate stop token filtering
+            stop_tokens = [
+                "<|end|>", "<|eot_id|>", "<|im_end|>",
+                "<|endoftext|>", "</s>", "<|end_of_text|>"
+            ]
+            found_tokens = [t for t in stop_tokens if t in content]
+            assert not found_tokens, (
+                f"Model {model_id} has visible stop tokens in stream: {found_tokens}\n"
+                f"Content: {content!r}"
+            )
+
+            print(f"✓ {model_key}: Passed (streamed: {len(content)} chars)")
+
+
+class TestCompletionsBatch:
+    """Non-streaming text completion tests."""
+
+    @pytest.mark.live_e2e
+    def test_completions_batch_basic(self, portfolio_models):
+        """Validate non-streaming text completions.
+
+        Tests basic /v1/completions endpoint with first available model.
+        """
+        # Use first model that fits in RAM
+        test_model = None
+        test_model_key = None
+        for model_key, model_info in portfolio_models.items():
+            should_skip, _ = should_skip_model(model_key, portfolio_models)
+            if not should_skip:
+                test_model = model_info["id"]
+                test_model_key = model_key
+                break
+
+        if test_model is None:
+            pytest.skip("No models available within RAM budget")
+
+        print(f"\nTesting {test_model_key}: {test_model}")
+
+        with LocalServer(test_model) as server_url:
+            response = httpx.post(
+                f"{server_url}/v1/completions",
+                json={
+                    "model": test_model,
+                    "prompt": TEST_PROMPT,
+                    "max_tokens": MAX_TOKENS,
+                    "stream": False
+                },
+                timeout=30.0
+            )
+
+            assert response.status_code == 200
+            data = response.json()
+
+            # Validate structure
+            assert "id" in data
+            assert "object" in data
+            assert data["object"] == "text_completion"
+            assert "choices" in data
+            assert len(data["choices"]) > 0
+
+            # Validate content
+            choice = data["choices"][0]
+            assert "text" in choice
+            content = choice["text"]
+
+            # Check stop tokens
+            stop_tokens = ["<|end|>", "<|eot_id|>", "<|im_end|>"]
+            found_tokens = [t for t in stop_tokens if t in content]
+            assert not found_tokens, f"Visible stop tokens: {found_tokens}"
+
+            print(f"✓ Passed (output: {len(content)} chars)")
+
+
+class TestCompletionsStreaming:
+    """SSE streaming text completion tests."""
+
+    @pytest.mark.live_e2e
+    def test_completions_streaming_basic(self, portfolio_models):
+        """Validate SSE streaming text completions.
+
+        Tests /v1/completions with stream=True.
+        """
+        # Use first model that fits in RAM
+        test_model = None
+        test_model_key = None
+        for model_key, model_info in portfolio_models.items():
+            should_skip, _ = should_skip_model(model_key, portfolio_models)
+            if not should_skip:
+                test_model = model_info["id"]
+                test_model_key = model_key
+                break
+
+        if test_model is None:
+            pytest.skip("No models available within RAM budget")
+
+        print(f"\nTesting {test_model_key}: {test_model}")
+
+        with LocalServer(test_model) as server_url:
+            with httpx.stream(
+                "POST",
+                f"{server_url}/v1/completions",
+                json={
+                    "model": test_model,
+                    "prompt": TEST_PROMPT,
+                    "max_tokens": MAX_TOKENS,
+                    "stream": True
+                },
+                timeout=30.0
+            ) as response:
+                assert response.status_code == 200
+
+                # Collect content
+                content_parts = []
+                for chunk in parse_sse_stream(response):
+                    if "choices" in chunk:
+                        for choice in chunk["choices"]:
+                            text = choice.get("text", "")
+                            if text:
+                                content_parts.append(text)
+
+                content = "".join(content_parts)
+
+                # Check stop tokens
+                stop_tokens = ["<|end|>", "<|eot_id|>", "<|im_end|>"]
+                found_tokens = [t for t in stop_tokens if t in content]
+                assert not found_tokens, f"Visible stop tokens: {found_tokens}"
+
+                print(f"✓ Passed (streamed: {len(content)} chars)")
@@ -0,0 +1,61 @@
+"""Convenience: Show E2E test portfolio.
+
+Usage: pytest -m show_model_portfolio -s
+"""
+
+from __future__ import annotations
+import pytest
+
+
+@pytest.mark.show_model_portfolio
+@pytest.mark.live_e2e
+def test_show_portfolio(portfolio_models):
+    """Display E2E test portfolio models (no actual testing).
+
+    Shows which models would be tested by live_e2e tests.
+    Uses same portfolio_models fixture as E2E tests.
+
+    Usage:
+        HF_HOME=/path/to/cache pytest -m show_model_portfolio -s
+    """
+    from .test_utils import should_skip_model
+
+    print("\n" + "="*90)
+    print("E2E TEST PORTFOLIO (live_e2e)")
+    print("="*90)
+    print(f"\nTotal models discovered: {len(portfolio_models)}\n")
+
+    # Table header
+    print(f"{'Key':<15} {'Status':<10} {'RAM (GB)':<10} {'Model ID':<50}")
+    print("-"*90)
+
+    # Count testable vs skipped
+    testable = 0
+    skipped = 0
+
+    for key in sorted(portfolio_models.keys()):
+        info = portfolio_models[key]
+        model_id = info['id']
+        ram = info.get('ram_needed_gb', 0)
+
+        # Check if would be tested or skipped
+        should_skip, skip_reason = should_skip_model(key, portfolio_models)
+
+        if should_skip:
+            status = "⏭️  SKIP"
+            skipped += 1
+            # Truncate model_id if too long
+            display_id = model_id if len(model_id) <= 45 else model_id[:42] + "..."
+            print(f"{key:<15} {status:<10} {ram:>6.1f}     {display_id}")
+        else:
+            status = "✅ TEST"
+            testable += 1
+            display_id = model_id if len(model_id) <= 45 else model_id[:42] + "..."
+            print(f"{key:<15} {status:<10} {ram:>6.1f}     {display_id}")
+
+    print("-"*90)
+    print(f"\nSummary: {testable} testable, {skipped} skipped")
+    print("="*90)
+
+    # Always pass - this is just for display
+    assert True
@@ -0,0 +1,273 @@
+"""Streaming vs. Non-Streaming Parity Tests (Issue #20, ADR-011).
+
+Validates that streaming and non-streaming modes produce identical output.
+
+Issue #20 Background:
+- In 1.x, non-streaming had visible stop tokens while streaming did not
+- Root cause: Different code paths for stop token filtering
+- ADR-009 fixed Runner-level detection (eos_token_id → eos_token_ids)
+- These tests ensure parity is maintained across:
+  * MLXRunner direct usage
+  * Server API endpoints
+  * CLI commands
+
+Test Strategy:
+- Use 3 representative models (not full portfolio to save time)
+- Same prompt + max_tokens → byte-for-byte identical output
+- Test all three interfaces: Runner, Server, CLI
+- RAM-aware testing
+
+Opt-in via: pytest -m live_e2e
+Requires: HF_HOME set to model cache
+"""
+
+from __future__ import annotations
+
+import pytest
+from typing import Dict, Any
+
+try:
+    import httpx
+except ImportError:
+    httpx = None
+
+# Import test utilities
+from .server_context import LocalServer
+from .sse_parser import collect_sse_content
+from .test_utils import (
+    should_skip_model,
+    TEST_PROMPT,
+    MAX_TOKENS,
+)
+# portfolio_models fixture is provided by conftest.py
+
+# Opt-in markers
+pytestmark = [
+    pytest.mark.live_e2e,
+    pytest.mark.slow,
+    pytest.mark.skipif(
+        httpx is None,
+        reason="httpx required for E2E tests"
+    )
+]
+
+
+# Representative test models for parity validation
+# Uses hardcoded subset (not full portfolio) to keep test time reasonable
+PARITY_TEST_MODELS = {
+    # "mxfp4": Skipped - Reasoning model (gpt-oss) has batch/stream inconsistency
+    #   Batch output: Raw reasoning text
+    #   Stream output: Adds **[Reasoning]** headers via StreamingReasoningParser
+    #   Known issue, will be fixed in ADR-010 implementation
+    "qwen25": {
+        "id": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
+        "ram_needed_gb": 1.0,
+        "description": "Qwen 2.5 (self-conversation prevention)"
+    },
+    "llama32": {
+        "id": "mlx-community/Llama-3.2-3B-Instruct-4bit",
+        "ram_needed_gb": 4.0,
+        "description": "Llama 3.2 (control baseline)"
+    }
+}
+
+
+class TestRunnerStreamingParity:
+    """MLXRunner direct streaming vs. batch parity.
+
+    Tests are parametrized over PARITY_TEST_MODELS (3 models).
+    Each test runs independently for clean isolation.
+    """
+
+    @pytest.mark.live_e2e
+    @pytest.mark.parametrize("parity_model_key", list(PARITY_TEST_MODELS.keys()))
+    def test_runner_streaming_batch_identical(self, _use_real_mlx_modules, parity_model_key):
+        """Validate MLXRunner streaming and batch produce identical output.
+
+        Parametrized test (one instance per parity test model).
+
+        Issue #20: Previously, batch output had visible stop tokens while
+        streaming did not. This validates the ADR-009 fix at Runner level.
+
+        Requires real MLX modules (not stubs) since we use MLXRunner directly.
+        """
+        from mlxk2.core.runner import MLXRunner
+
+        model_info = PARITY_TEST_MODELS[parity_model_key]
+        model_id = model_info["id"]
+
+        # RAM gating
+        should_skip, skip_reason = should_skip_model(parity_model_key, PARITY_TEST_MODELS)
+        if should_skip:
+            pytest.skip(skip_reason)
+
+        print(f"\nTesting {parity_model_key}: {model_id}")
+
+        with MLXRunner(model_id, verbose=False) as runner:
+            # Batch generation (temperature=0 for deterministic output)
+            batch_output = runner.generate_batch(
+                prompt=TEST_PROMPT,
+                max_tokens=MAX_TOKENS,
+                temperature=0.0
+            )
+
+            # Streaming generation (temperature=0 for deterministic output)
+            stream_tokens = []
+            for token in runner.generate_streaming(
+                prompt=TEST_PROMPT,
+                max_tokens=MAX_TOKENS,
+                temperature=0.0
+            ):
+                stream_tokens.append(token)
+            stream_output = "".join(stream_tokens)
+
+            # Validate byte-for-byte parity
+            assert batch_output == stream_output, (
+                f"Streaming/batch parity failure (Issue #20 regression)\n"
+                f"Model: {model_id}\n"
+                f"Batch ({len(batch_output)} chars):  {batch_output!r}\n"
+                f"Stream ({len(stream_output)} chars): {stream_output!r}"
+            )
+
+            print(f"✓ {parity_model_key}: Parity verified ({len(batch_output)} chars)")
+
+
+class TestServerStreamingParity:
+    """Server API streaming vs. batch parity.
+
+    Tests are parametrized over PARITY_TEST_MODELS (3 models).
+    Each test runs independently for clean isolation.
+    """
+
+    @pytest.mark.live_e2e
+    @pytest.mark.parametrize("parity_model_key", list(PARITY_TEST_MODELS.keys()))
+    def test_server_api_streaming_batch_identical(self, parity_model_key):
+        """Validate Server API streaming and batch produce identical output.
+
+        Parametrized test (one instance per parity test model).
+
+        Tests parity at HTTP API level (closest to production usage).
+        """
+        model_info = PARITY_TEST_MODELS[parity_model_key]
+        model_id = model_info["id"]
+
+        # RAM gating
+        should_skip, skip_reason = should_skip_model(parity_model_key, PARITY_TEST_MODELS)
+        if should_skip:
+            pytest.skip(skip_reason)
+
+        print(f"\nTesting {parity_model_key}: {model_id}")
+
+        with LocalServer(model_id, port=8765) as server_url:
+            # Batch request (temperature=0 for deterministic output)
+            batch_response = httpx.post(
+                f"{server_url}/v1/chat/completions",
+                json={
+                    "model": model_id,
+                    "messages": [{"role": "user", "content": TEST_PROMPT}],
+                    "max_tokens": MAX_TOKENS,
+                    "temperature": 0.0,
+                    "stream": False
+                },
+                timeout=30.0
+            )
+            assert batch_response.status_code == 200
+            batch_data = batch_response.json()
+            batch_output = batch_data["choices"][0]["message"]["content"]
+
+            # Streaming request (temperature=0 for deterministic output)
+            with httpx.stream(
+                "POST",
+                f"{server_url}/v1/chat/completions",
+                json={
+                    "model": model_id,
+                    "messages": [{"role": "user", "content": TEST_PROMPT}],
+                    "max_tokens": MAX_TOKENS,
+                    "temperature": 0.0,
+                    "stream": True
+                },
+                timeout=30.0
+            ) as stream_response:
+                assert stream_response.status_code == 200
+                stream_output = collect_sse_content(stream_response)
+
+            # Validate parity
+            assert batch_output == stream_output, (
+                f"Server API parity failure (Issue #20 regression)\n"
+                f"Model: {model_id}\n"
+                f"Batch ({len(batch_output)} chars):  {batch_output!r}\n"
+                f"Stream ({len(stream_output)} chars): {stream_output!r}"
+            )
+
+            print(f"✓ {parity_model_key}: Parity verified ({len(batch_output)} chars)")
+
+
+class TestCrossInterfaceParity:
+    """Parity across different interfaces (Runner vs Server)."""
+
+    @pytest.mark.live_e2e
+    def test_runner_vs_server_consistency(self, _use_real_mlx_modules):
+        """Validate MLXRunner and Server API produce consistent output.
+
+        Tests that direct Runner usage and Server HTTP API yield
+        the same results (validates no server-specific transformations).
+
+        Requires real MLX modules (not stubs) since we use MLXRunner directly.
+        """
+        from mlxk2.core.runner import MLXRunner
+
+        # Use smallest model for faster testing
+        test_model_key = "qwen25"
+        model_info = PARITY_TEST_MODELS[test_model_key]
+        model_id = model_info["id"]
+
+        # RAM check
+        should_skip, skip_reason = should_skip_model(test_model_key, PARITY_TEST_MODELS)
+        if should_skip:
+            pytest.skip(skip_reason)
+
+        print(f"\nTesting cross-interface parity: {model_id}")
+
+        # Runner output (temperature=0 for deterministic output)
+        with MLXRunner(model_id, verbose=False) as runner:
+            runner_output = runner.generate_batch(
+                prompt=TEST_PROMPT,
+                max_tokens=MAX_TOKENS,
+                temperature=0.0
+            )
+
+        print(f"Runner output ({len(runner_output)} chars): {runner_output!r}")
+
+        # Server output (temperature=0 for deterministic output)
+        with LocalServer(model_id, port=8780) as server_url:
+            response = httpx.post(
+                f"{server_url}/v1/chat/completions",
+                json={
+                    "model": model_id,
+                    "messages": [{"role": "user", "content": TEST_PROMPT}],
+                    "max_tokens": MAX_TOKENS,
+                    "temperature": 0.0,
+                    "stream": False
+                },
+                timeout=30.0
+            )
+            assert response.status_code == 200
+            server_output = response.json()["choices"][0]["message"]["content"]
+
+        print(f"Server output ({len(server_output)} chars): {server_output!r}")
+
+        # Note: Runner and Server may differ due to chat template application
+        # Server applies chat template, Runner uses raw prompt
+        # This test validates that both:
+        # 1. Produce clean output (no visible stop tokens)
+        # 2. Are internally consistent (streaming = batch for each)
+
+        # Validate no stop tokens in either
+        stop_tokens = ["<|end|>", "<|eot_id|>", "<|im_end|>"]
+        runner_found = [t for t in stop_tokens if t in runner_output]
+        server_found = [t for t in stop_tokens if t in server_output]
+
+        assert not runner_found, f"Runner has visible stop tokens: {runner_found}"
+        assert not server_found, f"Server has visible stop tokens: {server_found}"
+
+        print(f"✓ Cross-interface consistency verified (both clean)")
@@ -0,0 +1,47 @@
+"""Shared utilities for live E2E tests (ADR-011).
+
+Provides:
+- Portfolio discovery functions (reused from test_stop_tokens_live.py)
+- RAM gating utilities
+- Common test constants
+"""
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+from typing import Dict, Any, Tuple
+
+# Import portfolio discovery infrastructure from test_stop_tokens_live.py
+_parent_dir = Path(__file__).parent.parent
+sys.path.insert(0, str(_parent_dir))
+
+try:
+    from test_stop_tokens_live import (
+        discover_mlx_models_in_user_cache,
+        get_safe_ram_budget_gb,
+        get_system_ram_gb,
+        should_skip_model,
+        TEST_MODELS,
+    )
+finally:
+    sys.path.remove(str(_parent_dir))
+
+
+# Re-export for convenience
+__all__ = [
+    "discover_mlx_models_in_user_cache",
+    "get_safe_ram_budget_gb",
+    "get_system_ram_gb",
+    "should_skip_model",
+    "TEST_MODELS",
+    "TEST_PROMPT",
+    "MAX_TOKENS",
+    "TEST_TEMPERATURE",
+]
+
+
+# Standard test constants (shared across all E2E tests)
+TEST_PROMPT = "Write one sentence about cats."
+MAX_TOKENS = 50
+TEST_TEMPERATURE = 0.0  # Deterministic sampling for reproducible tests
@@ -165,81 +165,74 @@ def get_safe_ram_budget_gb() -> float:


 def discover_mlx_models_in_user_cache() -> List[Dict[str, Any]]:
-    """Discover MLX chat models in user HF cache (Category 2: read-only).
+    """Discover MLX chat models via mlxk list --json (production command).

-    Uses production infrastructure (mlxk2.operations.common) to filter:
+    Uses production CLI instead of duplicating cache scanning logic.
+    Leverages official JSON API (docs/json-api-schema.json modelObject).
+
+    Filters for:
    - Framework: MLX only (not GGUF/PyTorch)
-    - Health: healthy only (not broken/incomplete)
+    - Health: healthy only
    - Runtime: runtime_compatible only (mlx-lm can load)
    - Type: chat models only (for stop token testing)
-    - RAM: estimated from file sizes (filtering in should_skip_model)

    Returns:
        List of dicts with keys: model_id, ram_needed_gb, snapshot_path, weight_count
+        Note: snapshot_path and weight_count set to None (not needed for tests)
    """
-    hf_home = os.environ.get("HF_HOME")
-    if not hf_home:
+    import subprocess
+    import json
+
+    # Check HF_HOME is set (required for mlxk list)
+    env = os.environ.copy()
+    if not env.get("HF_HOME"):
        return []

-    hub_path = Path(hf_home) / "hub"
-    if not hub_path.exists():
+    try:
+        # Call production mlxk list command
+        result = subprocess.run(
+            [sys.executable, "-m", "mlxk2.cli", "list", "--json"],
+            capture_output=True,
+            text=True,
+            timeout=30,
+            env=env  # Pass environment with HF_HOME
+        )
+
+        if result.returncode != 0:
+            return []
+
+        # Parse JSON response (docs/json-api-schema.json)
+        data = json.loads(result.stdout)
+
+        # Extract models array from response
+        models = data.get("data", {}).get("models", [])
+
+        # Filter per schema modelObject fields
+        discovered = []
+        for model in models:
+            # Filter: MLX + healthy + runtime_compatible + chat
+            if (model.get("framework") == "MLX" and
+                model.get("health") == "healthy" and
+                model.get("runtime_compatible") is True and
+                model.get("model_type") == "chat"):
+
+                # RAM estimation: size_bytes * 1.2 overhead
+                size_bytes = model.get("size_bytes", 0)
+                ram_gb = (size_bytes / (1024**3)) * 1.2 if size_bytes else 0
+
+                discovered.append({
+                    "model_id": model["name"],  # Per schema: name is the model ID
+                    "ram_needed_gb": ram_gb,
+                    "snapshot_path": None,      # Not provided by list, not needed
+                    "weight_count": None        # Not provided by list, not needed
+                })
+
+        return discovered
+
+    except Exception:
+        # Robust: return empty list on any error (keeps tests runnable)
        return []

-    # Use production infrastructure for filtering
-    from mlxk2.operations.common import build_model_object
-    from mlxk2.core.cache import cache_dir_to_hf
-
-    discovered = []
-
-    # Scan models--org--name directories
-    for model_dir in hub_path.glob("models--*--*"):
-        try:
-            # Parse model_id from directory name
-            model_id = cache_dir_to_hf(model_dir.name)
-
-            # Find latest snapshot
-            snapshots_dir = model_dir / "snapshots"
-            if not snapshots_dir.exists():
-                continue
-
-            snapshot_dirs = [d for d in snapshots_dir.iterdir() if d.is_dir()]
-            if not snapshot_dirs:
-                continue
-
-            # Most recent snapshot (by mtime)
-            latest_snapshot = max(snapshot_dirs, key=lambda p: p.stat().st_mtime)
-
-            # Use production filter logic (health + runtime + framework + type)
-            model_obj = build_model_object(model_id, model_dir, latest_snapshot)
-
-            # Filter: MLX + healthy + runtime_compatible + chat only
-            if (model_obj.get("framework") != "MLX" or
-                model_obj.get("health") != "healthy" or
-                model_obj.get("runtime_compatible") is not True or
-                model_obj.get("model_type") != "chat"):
-                continue  # Skip non-chat/unhealthy/incompatible models
-
-            # Estimate RAM (safetensors file sizes + 20% overhead)
-            weight_files = list(latest_snapshot.glob("*.safetensors"))
-            if not weight_files:
-                continue
-
-            total_bytes = sum(f.stat().st_size for f in weight_files if f.is_file())
-            ram_gb = (total_bytes / (1024**3)) * 1.2
-
-            discovered.append({
-                "model_id": model_id,
-                "ram_needed_gb": ram_gb,
-                "snapshot_path": latest_snapshot,
-                "weight_count": len(weight_files)
-            })
-
-        except Exception:
-            # Skip broken models silently (keep portfolio discovery robust)
-            continue
-
-    return discovered
-

 # Test models from ADR-009 with RAM requirements
 # RAM estimates from TESTING.md: "RAM-Aware Model Selection Strategy"