diff --git a/.claude/agents/code-reviewer.md b/.claude/agents/code-reviewer.md
new file mode 100644
index 0000000..13bba63
--- /dev/null
+++ b/.claude/agents/code-reviewer.md
@@ -0,0 +1,34 @@
+---
+name: code-reviewer
+description: Use this agent when you need to review recently written code for quality, best practices, potential bugs, and improvements. This agent should be called after completing a logical chunk of code development, such as implementing a new function, class, or feature. Examples: <example>Context: The user has just implemented a new function and wants it reviewed. user: "I just wrote this function to calculate prime numbers: def is_prime(n): if n < 2: return False; for i in range(2, int(n**0.5) + 1): if n % i == 0: return False; return True" assistant: "Let me use the code-reviewer agent to analyze this implementation for correctness and potential improvements." <commentary>Since the user has written code and wants it reviewed, use the code-reviewer agent to provide detailed feedback on the prime number function.</commentary></example> <example>Context: User has completed a class implementation and wants feedback. user: "Here's my new UserManager class with authentication methods" assistant: "I'll use the code-reviewer agent to review your UserManager class implementation for security best practices and code quality." <commentary>The user has implemented a class and needs review, so use the code-reviewer agent to examine the authentication logic and overall design.</commentary></example>
+tools: MultiEdit, Write, NotebookEdit, Grep, LS, Read
+model: sonnet
+color: blue
+---
+
+You are an expert code reviewer with deep knowledge across multiple programming languages, frameworks, and software engineering best practices. Your role is to provide thorough, constructive code reviews that help developers write better, more maintainable, and more secure code.
+
+When reviewing code, you will:
+
+1. **Analyze Code Quality**: Examine the code for readability, maintainability, and adherence to language-specific conventions and best practices. Look for proper naming conventions, appropriate code organization, and clear logic flow.
+
+2. **Identify Potential Issues**: Scan for bugs, logic errors, edge cases that aren't handled, potential security vulnerabilities, performance bottlenecks, and resource management issues (memory leaks, unclosed resources, etc.).
+
+3. **Assess Architecture and Design**: Evaluate whether the code follows solid design principles (SOLID, DRY, KISS), has appropriate separation of concerns, uses suitable design patterns, and maintains good abstraction levels.
+
+4. **Check Error Handling**: Verify that the code properly handles exceptions, validates inputs, provides meaningful error messages, and fails gracefully when appropriate.
+
+5. **Review Testing Considerations**: Identify areas that need testing, suggest test cases for edge conditions, and evaluate whether the code is written in a testable manner.
+
+6. **Provide Specific Recommendations**: Offer concrete, actionable suggestions for improvement with code examples when helpful. Prioritize recommendations by impact and importance.
+
+7. **Consider Context**: Take into account the project's coding standards, technology stack, performance requirements, and any specific constraints mentioned in project documentation (like CLAUDE.md files).
+
+Your review format should include:
+- **Summary**: Brief overall assessment
+- **Strengths**: What the code does well
+- **Issues Found**: Categorized by severity (Critical, Major, Minor)
+- **Recommendations**: Specific improvements with examples
+- **Additional Considerations**: Testing, documentation, or architectural suggestions
+
+Be constructive and educational in your feedback. Explain the 'why' behind your suggestions to help the developer learn. When code is well-written, acknowledge the good practices used. Always maintain a professional, helpful tone that encourages improvement rather than criticism.
diff --git a/.gitignore b/.gitignore
index caeaa1a..7ac97e4 100644
--- a/.gitignore
+++ b/.gitignore
@@ -5,9 +5,13 @@ test_env*/
 test_results*.log
 mypy_*.log
 ruff_*.log
-__pycache__/
+*/__pycache__/*
+__pycache__
+mlx_knife/*
 *.pyc
 .DS_Store
+.claude/
+mymodel_test_workspace/
 build/
 dist/
 *.egg-info/
@@ -15,4 +19,9 @@ CLAUDE.md
 TODO_REAL_TESTS.md
 server.log
 install_*.log
-mymodel_test_workspace/
\ No newline at end of file
+.claude/
+openwebui311/bin/
+.gitignore
+
+# Test artifacts (generated reports)
+*_report.json
diff --git a/CHANGELOG.md b/CHANGELOG.md
index f9cedb4..56ae9d3 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,60 +1,358 @@
 # Changelog
 
-## [1.1.1] - 2025-09-14
+## 2.0.0 — 2025-11-06
 
-### 🆕 **Major New Features**
-- **MXFP4 Quantization Support**: Full compatibility with MLX ≥0.29.0 and MLX-LM ≥0.27.0
-  - Support for FP4 quantized models (e.g., `gpt-oss-20b-MXFP4-Q8`)
-  - Tested with `gpt-oss-20b-MXFP4-Q8` from mlx-community
+**Stable Release**: MLX Knife 2.0 replaces 1.x as the primary version. Full feature parity with 1.1.1 achieved plus major enhancements.
 
-- **GPT-OSS Reasoning Model Support**: Pattern-based implementation for MXFP4 models (Issue #32 - partial)
-  - Real-time streaming with `**[Reasoning]**` and `**[Answer]**` sections
-  - Reverse-engineered from native `<|channel|>analysis<|message|>...` token patterns
-  - Uses `<|return|>` as stop token, `<|end|>` as reasoning/answer separator
-  - Token-by-token streaming parser in `reasoning_utils.py`
-  - Model detection based on `gpt-oss` in model name/path
-  - Intelligent control token filtering during streaming
-  - Consistent behavior between CLI (`mlxk run`) and server API
-  - **`--hide-reasoning` flag**: Show only final answers without reasoning steps
-    - Usage: `mlxk run gpt-oss-model "prompt" --hide-reasoning`
-    - Works in both streaming and batch modes
-  - **Note**: GPT-OSS specific implementation, not generic reasoning model support
+### License Change
 
-### 🔧 **Core Improvements**
-- **Enhanced Streaming Parser**: New generic streaming parser architecture
-  - Token-by-token processing with live formatting
-  - Configurable patterns for different model types
-  - Single source of truth in `reasoning_utils.py`
+- **MIT → Apache License 2.0**: Better patent protection, industry-standard licensing
+- See [MIGRATION.md](MIGRATION.md) for details on license change and user impact
 
-- **Enhanced Show Command**: Improved quantization display for MXFP4 models
-  - New `get_quantization_info()` function for detailed quantization parsing
-  - Compact display format: "mode: X, Y-bit, group_size: Z"
-  - MLX version requirement warnings: "Advanced mode 'X' (requires MLX ≥0.29.0, MLX-LM ≥0.27.0)"
-  - Better handling of complex quantization schemes in config.json
+### Highlights
 
-- **Stop Token Management**: Refined stop token handling (Issue #32)
-  - GPT-OSS models correctly use `<|return|>` as stop token, not `<|end|>`
-  - `<|end|>` treated as reasoning separator for proper flow
-  - Model-specific stop token detection
+- **Full 1.x Feature Parity**: All commands from 1.1.1 available (`list`, `show`, `pull`, `rm`, `run`, `server`, `health`)
+- **JSON API**: Machine-readable output for automation (`--json` flag on all commands)
+- **Enhanced Error Handling**: Structured errors with request IDs, logging levels, JSON logs
+- **Runtime Compatibility Checks**: Pre-flight validation prevents loading incompatible models
+- **Improved Stop Token Detection**: Multi-EOS support (MXFP4, Qwen, Llama)
+- **Better Human Output**: Improved formatting, relative timestamps, runtime status
 
-### 🐛 **Bug Fixes**
-- **GPT-OSS Token Handling**: Fixed streaming end token detection for GPT-OSS models (partial Issue #32)
-- **Consistent Output**: Server and CLI now produce identical formatted responses for GPT-OSS
-- **Control Token Filtering**: Eliminated unwanted GPT-OSS control tokens in final output
+### Package Changes
 
-### 🏗️ **Architecture & Dependencies**
-- **MLX Compatibility**: Updated for MLX ≥0.29.0 and MLX-LM ≥0.27.0
-- **Modular Design**: Reasoning logic centralized for maintainability
-- **Enhanced Test Coverage**: Comprehensive end token and reasoning model tests
+- **Package name**: `mlx-knife` (unchanged from 1.x)
+- **Primary command**: `mlxk` (replaces `mlxk2` from beta)
+- **Aliases**: `mlxk-json`, `mlxk2` (backwards compatibility)
 
-### 📋 **Requirements**
-- MLX ≥0.29.0 (required for MXFP4 support)
-- MLX-LM ≥0.27.0 (required for FP4 quantization)
+### Breaking Changes
 
-### 🔮 **Future Development Notes**
-- Pattern-based reasoning architecture designed for extensibility
-- `reasoning_utils.py` structured to support additional reasoning model types  
-- Current implementation: GPT-OSS only, other models (DeepSeek R1, QwQ) remain open
+- **Lock file handling**: `mlxk rm` requires `--force` flag when models have active locks (safety improvement)
+- See [MIGRATION.md](MIGRATION.md) for complete migration guide from 1.x
+
+### Installation
+
+```bash
+# PyPI (recommended)
+pip install mlx-knife
+
+# GitHub release
+pip install https://github.com/mzau/mlx-knife/releases/download/v2.0.0/mlx_knife-2.0.0-py3-none-any.whl
+
+# Upgrade from 1.x
+pip install --upgrade mlx-knife
+```
+
+### Testing
+
+- 297 passed, 20 skipped (317 total tests)
+- Python 3.9-3.13 compatibility verified
+- Apple Silicon (M1/M2) tested
+
+---
+
+## 2.0.0-beta.6 — 2025-10-22
+
+### Fixed
+- **Stop token detection for multi-EOS models** (Issue #32, ADR-009): MXFP4 and Qwen models no longer generate visible stop tokens (`<|end|>`) or chat template markers in output
+- **Private/org MLX model detection** (Issue #37): `mlxk run` now correctly detects MLX models outside `mlx-community/*` namespace
+- **Commit-pinned compatibility checks**: Models with `@commit_hash` syntax now correctly validated before inference
+- **Packaging dependencies** (P0): `pip install -e .` now installs all required dependencies (`mlx-lm`, `mlx`, `fastapi`, etc.) via `pyproject.toml`
+
+### Documentation
+- Simplified installation instructions in README.md and TESTING.md (consistent `pip install -e ".[dev,test]"` recommendation)
+
+### Testing
+- 297 passed, 20 skipped (317 total)
+- Added 6 new tests: 4 stop token validation tests (opt-in), 2 compatibility check tests
+
+## 2.0.0-beta.5 — 2025-10-20
+
+**Enhanced Error Handling & Logging (ADR-004)**: Unified error envelope, structured logging with JSON support, and request correlation.
+
+**Legacy Model Format Detection**: Models with outdated weight file formats are detected and marked as runtime-incompatible (Issue #37).
+
+### Added
+
+- **Error envelope and structured logging** (ADR-004 Phase 1):
+  - Unified error envelope for CLI/Server: `{"status": "error", "error": {"type", "message", "detail", "retryable"}, "request_id"}`
+  - Request correlation via `request_id` (UUID4) in all server responses and logs
+  - HTTP status mapping: 400 (validation), 403 (access denied), 404 (not found), 500 (internal), 503 (shutdown)
+  - Structured logging with INFO/WARN/ERROR/DEBUG levels (replaces ad-hoc print statements)
+  - Optional JSON logs via `MLXK2_LOG_JSON=1` for machine-readable output
+  - **Log-level control**: `--log-level` (debug/info/warning/error) controls MLXKLogger, root logger, and Uvicorn access logs
+  - **`--log-json` CLI flag**: User-friendly alternative to `MLXK2_LOG_JSON=1` environment variable
+  - **Uvicorn JSON formatting**: Access logs (`GET /v1/models`, etc.) also formatted as JSON when `--log-json` is used
+  - **Root logger JSON formatting**: External libraries (mlx-lm, transformers) also log as JSON in JSON mode
+  - Automatic redaction of sensitive data (HF tokens, user paths)
+  - Error rate limiting (max 1 error per 5s for duplicate errors)
+  - New modules: `mlxk2/errors.py`, `mlxk2/logging.py`, `mlxk2/context.py`
+  - FastAPI middleware: Request ID injection, custom exception handler
+  - **User documentation**: README.md "Logging & Debugging" section (log levels, JSON format, redaction examples)
+  - Test coverage: 22 new tests in `test_adr004_error_logging.py`
+
+- **Legacy format detection in runtime compatibility check** (Issue #37):
+  - Gate 2 in `check_runtime_compatibility()`: Validates weight file naming conventions
+  - Detects legacy patterns: `weights.*.safetensors` (e.g., `weights.00.safetensors`), `pytorch_model-*.safetensors`
+  - Accepts modern patterns: `model.safetensors`, `model-XXXXX-of-YYYYY.safetensors`
+  - Clear error message: `"Legacy format not supported by mlx-lm"`
+- **Pre-flight check in `run` command**:
+  - Validates runtime compatibility before attempting model load
+  - Prevents cryptic mlx-lm errors: `"ERROR:root:No safetensors found in..."`
+  - Returns user-friendly error: `"Model 'X' is not compatible: Legacy format not supported by mlx-lm"`
+  - Best-effort check: gracefully skips if model not in cache (preserves test compatibility)
+
+### Changed
+- **Runtime compatibility validation extended**:
+  - Gate 1: Framework check (MLX vs GGUF/PyTorch) - from Beta.4
+  - Gate 2: **NEW** - Weight file format check (modern vs legacy patterns)
+  - Gate 3: Model type support check (mlx-lm compatibility) - from Beta.4
+- **CLI description**: "HuggingFace model management for MLX" (removed "JSON-first" and version number)
+- **README reorganization**: Better section flow, merged duplicate sections, removed beta-specific content (550 lines)
+
+### Fixed
+- **Legacy format detection** (Issue #37, bug):
+  - Models with legacy weight file formats (`weights.*.safetensors`, `pytorch_model-*.safetensors`) now correctly detected as runtime-incompatible
+  - Health output: `healthy` (file integrity OK) but `runtime_compatible: false`
+  - `reason` field describes incompatibility: `"Legacy format not supported by mlx-lm"`
+  - Human output: `healthy*` in compact mode, `healthy | no | Legacy format...` in verbose mode
+  - Pre-flight check in `run` command prevents cryptic mlx-lm errors
+- **CLI error handling** (regression since 19a6667): Running `mlxk2` without arguments now shows help text (like git/docker) instead of JSON error, `--json` flag properly respected for automation
+- **Code quality**: Removed 7 unused imports, ruff checks pass
+
+### Implementation
+- `mlxk2/operations/health.py`:
+  - `check_runtime_compatibility()` Gate 2 implementation (lines 272-304)
+  - Regex patterns for legacy format detection
+  - Mixed legacy/modern: prefers modern if both present
+- `mlxk2/operations/run.py`:
+  - Pre-flight runtime compatibility check (lines 45-89)
+  - Clear error messages before mlx-lm loading
+
+### Testing
+- **Current Status**: 293 passed, 14 skipped, 1 warning (urllib3/LibreSSL)
+- **New Tests** (25 total):
+  - `tests_2.0/test_adr004_error_logging.py` (22 tests):
+    - Error envelope structure and serialization
+    - Error type to HTTP status mapping (8 error types validated)
+    - Request ID generation and propagation (UUID4 validation, context nesting)
+    - Log redaction (HF tokens, home directory paths)
+    - Structured logging (plain text vs JSON modes, log levels, rate limiting)
+  - `tests_2.0/test_legacy_formats.py` (3 tests):
+    - `test_weights_numeric_safetensors_is_runtime_incompatible`: Validates `weights.00.safetensors` detection
+    - `test_pytorch_model_numeric_safetensors_is_runtime_incompatible`: Validates `pytorch_model-*.safetensors` detection
+    - `test_modern_model_safetensors_passes_legacy_gate`: Ensures modern formats are not rejected
+- **Regression**: All existing tests pass (zero breaking changes)
+
+### Known Issues
+- **Missing tests for Issue #36** (Beta.4 gap):
+  - No dedicated tests for Gate 1 (framework check)
+  - No dedicated tests for Gate 3 (model_type support)
+  - Runtime compatibility tested indirectly via Issue #37 tests and schema validation
+  - TODO: Add explicit tests for Beta.4 runtime compatibility feature
+
+### User Experience Example
+```bash
+# Before (Beta.4): Cryptic mlx-lm error
+$ mlxk2 run TinyLlama-1.1B-Chat-v1.0-4bit "Hello"
+ERROR:root:No safetensors found in /Volumes/.../snapshots/01a7088...
+
+# After (Beta.5): Clear error message
+$ mlxk2 run TinyLlama-1.1B-Chat-v1.0-4bit "Hello"
+Error: Model 'mlx-community/TinyLlama-1.1B-Chat-v1.0-4bit' is not compatible: Legacy format not supported by mlx-lm
+
+# Health status shows details
+$ mlxk2 show TinyLlama-1.1B-Chat-v1.0-4bit
+Health: healthy (files OK, runtime incompatible)
+Reason: Legacy format not supported by mlx-lm
+```
+
+### Notes
+- Legacy models are file-complete (healthy integrity) but use outdated naming conventions incompatible with modern mlx-lm
+- Pre-flight check improves UX by catching incompatibility before expensive model loading
+
+---
+
+## 2.0.0-beta.4 — 2025-10-18
+
+**Health Check Enhancement**: Separate integrity and runtime compatibility validation (Issue #36).
+
+### Changed
+- **JSON API 0.1.5 specification**:
+  - Added `runtime_compatible: boolean` field to `modelObject` (always present)
+  - Added `reason: string | null` field to `modelObject` (describes first problem found)
+  - `list`/`show` JSON output performs both integrity and runtime compatibility checks
+  - Gate logic: Runtime check requires integrity check first; `reason` shows first problem (integrity > runtime priority)
+- **Health check concepts documented**:
+  - Integrity Check (`health` field): File-level validation (required files, no LFS pointers, valid JSON)
+  - Runtime Compatibility Check (`runtime_compatible` field): MLX framework + architecture validation with mlx-lm
+  - Framework detection: GGUF/PyTorch models marked as runtime-incompatible
+  - Architecture detection: Unsupported model types (e.g., `qwen3_next` with mlx-lm < 0.28.0) detected
+  - Respects `MODEL_REMAPPING` for aliased architectures (e.g., `mistral` → `llama`)
+
+### Implementation Status
+- ✅ **Phase 1 Complete**: JSON API Specification 0.1.5
+  - `docs/json-api-schema.json` updated with new fields
+  - `docs/json-api-specification.md` extended with health check concepts and examples
+- ✅ **Phase 2 Complete**: JSON Implementation
+  - `mlxk2/spec.py` bumped to 0.1.5
+  - `mlxk2/operations/health.py`: `check_runtime_compatibility()` with gate logic
+  - `mlxk2/operations/common.py`: `build_model_object()` always computes `runtime_compatible` + `reason`
+  - mlx-lm API compatibility: Supports both 0.27.x (`mlx_lm.utils._get_classes`) and 0.28.x APIs
+  - Log suppression: mlx-lm ERROR logs redirected to `reason` field only
+- ✅ **Phase 3 Complete**: Human Output Specification
+  - Compact mode: `healthy` / `healthy*` / `unhealthy` (single column)
+  - Verbose mode: "Integrity" | "Runtime" | "Reason" (split columns)
+  - ASCII-only output (no UTF-8 symbols for parsing compatibility)
+  - README.md fully documented with examples and design philosophy
+  - JSON examples verified for consistency with schema and code
+- ✅ **Phase 4 Complete**: Human Output Implementation in `mlxk2/output/human.py`
+
+### Dependencies
+- **mlx-lm requirement updated**: `>=0.27.0` → `>=0.28.3`
+  - Now uses official mlx-lm 0.28.3 release with Python 3.9 compatibility fixes for `qwen3_next`
+  - Adds support for newer architectures (Klear, qwen3_next, etc.)
+  - Git pin removed in favor of stable PyPI release
+
+### Validation
+- ✅ All 256 tests pass (9 skipped)
+- ✅ Runtime compatibility correctly detects:
+  - GGUF/PyTorch models → `runtime_compatible: false` (framework mismatch)
+  - Supported MLX models → `runtime_compatible: true`
+  - Unsupported architectures → `runtime_compatible: false` with descriptive `reason`
+  - Klear-46B verified working with mlx-lm 0.28.2
+
+
+### Notes
+- Human output columns controlled by CLI flags (documentation in README.md, separate from JSON spec)
+- This addresses the root cause discovered in Issue #36: GGUF models show "healthy" but are not executable with mlx-lm
+
+## 2.0.0-beta.3 — 2025-09-18
+
+**Feature Complete**: Full 1.1.1 parity achieved with Clone implementation (ADR-007 Phase 1) and APFS filesystem detection fixes.
+
+### Added
+- **Clone command implementation** (MAJOR):
+  - Complete `mlxk2 clone` with ADR-007 Phase 1: Same-Volume APFS strategy
+  - APFS Copy-on-Write optimization for instant cloning
+  - Isolated temp cache with user cache safety
+  - Health check integration via `health_from_cache`
+  - Feature-gated behind `MLXK2_ENABLE_ALPHA_FEATURES=1`
+- **JSON API 0.1.4 specification**:
+  - Clone operation schema and documentation
+  - Complete schema validation coverage for all 10 JSON commands
+  - Schema tests for `list`, `show`, `health`, `pull`, `rm`, `clone`, `version`, `push`, `run`, `server`
+
+### Fixed
+- **APFS filesystem detection**: SMB/network mounts now correctly detected as Non-APFS
+- **Push APFS warnings**: Non-APFS cache setups now display filesystem warnings
+
+### Testing
+- **Comprehensive test coverage**: 254/254 tests passing, 11 skipped
+- **Clone operation tests**: 43 tests covering APFS, volume detection, health integration
+- **Live validation**: 3 live clone + push tests with real HuggingFace models
+
+## 2.0.0-beta.3-local — 2025-09-14
+
+**Feature Complete Beta**: 1.x parity achieved. All core functionality implemented with clean experimental separation.
+
+### Added
+- **Run command implementation** (MAJOR):
+  - Complete `mlxk2 run` with interactive and single-shot modes
+  - Streaming and batch generation with parameter controls (`--temperature`, `--top-p`, `--max-tokens`)
+  - Chat template integration and conversation history tracking
+  - Interrupt handling (Ctrl-C) with graceful recovery and session reset
+  - Enhanced run with future features (system prompts, reasoning model support)
+- **MLXRunner core engine** (ported from 1.x):
+  - `mlxk2.core.runner` package with modular architecture
+  - Dynamic token limits (full context for run, half-context for server)
+  - Stop token filtering and reasoning model detection
+  - Thread-safe model loading, memory management, and cleanup
+- **Server implementation**:
+  - OpenAI-compatible endpoints (`/v1/completions`, `/v1/chat/completions`, `/v1/models`, `/health`)
+  - SSE streaming with SIGINT-robust supervisor mode (deterministic shutdown/restart)
+  - Model hot-swapping and thread-safe memory management
+  - Half-context token limits for DoS protection
+- **Experimental feature separation**:
+  - Push command hidden behind `MLXK2_ENABLE_EXPERIMENTAL_PUSH=1` environment variable
+  - Clean beta/experimental boundaries for stable release classification
+
+### Changed
+- **Feature status**: All core commands now complete
+  - README/docs updated: Run status "Pending" → "Complete"
+  - Feature parity with 1.x stable releases achieved
+  - Stable version reference updated to 1.1.1
+- **Test architecture**:
+  - Default suite: **184 passed, 30 skipped** (stable features only)
+  - Experimental: **205 passed, 9 skipped** (with `MLXK2_ENABLE_EXPERIMENTAL_PUSH=1`)
+  - Clean separation ensures beta testing covers stable features only
+- **Runner architecture**:
+  - Modular design with focused helpers: `token_limits.py`, `chat_format.py`, `reasoning_format.py`, `stop_tokens.py`
+  - API compatibility preserved for existing integrations and test patches
+
+### Fixed
+- **Pull operation cache pollution (Issue #30)**:
+  - Added preflight access check with `preflight_repo_access()` to validate repository accessibility
+  - Prevents cache pollution from attempting downloads of gated/private/missing repos
+  - Surfaces clear "Access denied" guidance with `HF_TOKEN` hints before any download
+  - Robust error handling across different `huggingface_hub` versions
+- **Test stability**:
+  - Pull network timeout test fixed for environments without `HF_TOKEN`
+  - All push tests now properly gated behind environment variable (no unexpected failures)
+  - Default test runs require no external dependencies or credentials
+- **Documentation accuracy**:
+  - Feature status corrected across README/TESTING to reflect actual implementation
+  - Test count documentation updated to reflect stable vs experimental separation
+
+### Implementation Milestones
+- **Complete 1.x parity**: All core functionality (list, health, show, pull, rm, run, serve) fully implemented
+- **Production ready**: Comprehensive testing across Python 3.9-3.13 with isolated cache system
+- **Clean architecture**: Experimental features properly isolated, beta definition clarified
+- **GitHub issues resolved**: Run implementation, interactive mode, streaming support, feature parity
+
+### Tests & Docs
+- **Comprehensive test coverage**: 31+ tests for run command (interactive, parameters, error handling)
+- **TESTING.md**: Clear guidance on stable (184) vs experimental (+21) test runs
+- **Multi-Python verification**: All tests passing across supported Python versions
+- **Skip breakdown documented**: 21 push tests, 1 live test, 8 other opt-in tests
+
+### Notes
+- 2.0.0-beta.3 represents **complete feature parity** with 1.x stable releases
+- Ready for production use as comprehensive 1.x alternative
+- Experimental features cleanly separated for future development
+
+## 2.0.0-alpha.3 — 2025-09-08
+
+Port Issue #31 (lenient MLX detection) to 2.0; refine human list behavior.
+
+Hard split: 1.x code and tests have been removed from this branch to avoid confusion and license duality. Use the `main` branch for 1.x (MIT).
+
+### Added
+- Detection helpers (README front‑matter + tokenizer):
+  - Framework=MLX when README front‑matter `tags` includes `mlx` or `library_name: mlx`, in addition to `mlx-community/*`.
+  - Type=chat when tokenizer has `chat_template`, or name hints (`instruct`/`chat`), or `config.model_type == 'chat'`.
+  - Unified `build_model_object(...)` used by `list` and `show` to ensure consistent fields.
+- Tests:
+  - Offline: front‑matter and tokenizer detection for both `list` and `show`.
+  - Human output: verifies default/verbose/all filtering semantics.
+  - Live (opt-in): `tests_2.0/live/test_list_human_live.py` checks human list variants against a real HF cache (marker `-m live_list`).
+  - Push (offline): branch-missing tolerance and retry on "Invalid rev id" with `--create`.
+
+### Changed
+- Human list (default): shows only MLX chat models (safer for run/server selection).
+- Human list `--verbose`: shows all MLX models (chat + base).
+- Human list `--all`: shows all frameworks (MLX, GGUF, PyTorch).
+- `show` uses the same detection helpers as `list`; respects `HF_HOME` via `get_current_model_cache()`.
+
+### Docs
+- SECURITY.md: clarified experimental push scope and network behavior (explicit only; no background traffic).
+- README.md: added “Privacy & Network” bullet; updated version strings to alpha.3.
+ - README.md: noted hard split — 1.x lives on `main` (MIT), this branch is 2.x (Apache‑2.0).
+
+### Notes
+- No JSON API schema changes; spec remains 0.1.3.
+ 
+### Fixed
+- Push: tolerate missing target branches; with `--create`, proactively create the branch and retry the upload once. No‑op uploads still create the branch when `--create` is provided.
 
 ## [1.1.1-beta.2] - 2025-09-06
 
@@ -92,6 +390,26 @@
 
 Note: GitHub tag/version uses `1.1.1-beta.2`. PyPI release uses PEP 440 `1.1.1b2`.
 
+## 2.0.0-alpha.2 — 2025-09-05
+
+Experimental `push` (upload only) and documentation/testing refinements.
+
+### Added
+- `push` (experimental, M0): Upload a local folder to Hugging Face using `upload_folder`.
+  - Safety: `--private` required in alpha.
+  - Quiet JSON: With `--json` (without `--verbose`) suppress progress bars/console logs; hub logs are captured in `data.hf_logs`.
+  - No-op detection: Prefer hub signal (“No files have been modified… Skipping…”). Sets `no_changes: true`, clears `commit_sha/commit_url`, and sets `uploaded_files_count: 0`.
+  - Offline preflight: `--check-only` analyzes the local workspace and returns `data.workspace_health` (index/weights/LFS/partials) without network.
+  - Dry-run planning: `--dry-run` computes a plan vs remote (uses `list_repo_files`), returns `dry_run: true`, `dry_run_summary {added, modified:null, deleted}`, and sample `added_files`/`deleted_files` (up to 20). Honors default ignores and merges `.hfignore`.
+  - Uploaded file count: Remains `null` when hub does not return per-file operations; no heuristic guessing.
+
+### Docs
+- TESTING.md: Added “Reference: Push CLI and JSON”, `--dry-run` examples, and a mini matrix (default vs markers/opt-in).
+- CLAUDE.md: Updated Current Focus/Decisions + session summary for push quiet mode, no-op, `--dry-run`.
+
+### Tests
+- Offline push tests added/extended, including dry-run planning; live push remains opt-in via `wet`/`live_push` markers and required env vars.
+
 ## [1.1.1-beta.1] - 2025-09-01
 
 ### Fix: Strict Health Completeness for Multi‑Shard Models (Issue #27)
@@ -109,6 +427,13 @@ Note: GitHub tag/version uses `1.1.1-beta.2`. PyPI release uses PEP 440 `1.1.1b2
 
 Note: GitHub tag/version uses `1.1.1-beta.1`. PyPI release uses PEP 440 `1.1.1b1`.
 
+## 2.0.0-alpha.1 — 2025-08-31
+
+- New JSON-first CLI (`mlxk2`, `mlxk-json`); `--json` for machine-readable output (new vs 1.0.0).
+- Human output by default: improved formatting, new Type column, relative Modified; MLX-only compact view with `--all`, `--health`, `--verbose` flags.
+- Stricter health checks for sharded models (Issue #27); robust model resolution (fuzzy, `@hash`); `rm` cleans whole model and locks.
+- Packaging/tooling: dynamic versioning; multi-Python test script; Python 3.9–3.13; timezone-aware datetimes.
+- **Not included yet: server and run** (use 1.x).
 
 ## [1.1.0] - 2025-08-26 - **STABLE RELEASE** 🚀
 
@@ -187,90 +512,31 @@ Note: GitHub tag/version uses `1.1.1-beta.1`. PyPI release uses PEP 440 `1.1.1b1
   - **Root Cause**: `generate_batch()` lacked End-Token filtering present in `generate_streaming()`
   - **Fix**: Ported filtering logic with new `_filter_end_tokens_from_response()` method
   - **Affected**: `mlxk run model "prompt" --no-stream` and Server API `"stream": false`
-  - **Impact**: Professional clean output - no visible `</s>`, `<|im_end|>`, `<|end|>` tokens
-  - **Test Coverage**: 47/48 comprehensive tests validate fix across all model architectures
+  - **Impact**: No more end tokens appearing in the final output in non-streaming mode
 
-### Test Infrastructure Improvements 🧪
-- **New Test Suite**: `tests/integration/test_end_token_issue.py` with 48 systematic tests
-- **RAM-Aware Testing**: Automatic model selection based on available system memory
-- **Flaky Test Fix**: Improved server lifecycle management with proper port cleanup
-- **Blocking Read Fix**: Fixed timeout issues in server startup validation tests
-- **Test Count**: 132/132 standard tests + 48 server tests (180 total)
+### Enhanced
+- Better default for `--max-tokens`: `None` → model-aware limits
+- Improved consistency between streaming and non-streaming generation
+- Clearer server logs indicating active token policies
 
-### Documentation Updates 📚
-- **TESTING.md**: New server test procedures, updated test counts (132/132), comprehensive server test guide
-- **Test Categories**: Clear separation of standard tests vs resource-intensive server tests
-- **Server Test Documentation**: RAM requirements, timing expectations, model compatibility
-
-### Architecture Quality 🏗️
-- **End-Token Consistency**: Streaming and non-streaming pipelines now identical in behavior
-- **Clean Code**: Unified filtering logic eliminates code duplication between pipelines  
-- **Regression Prevention**: Comprehensive test coverage prevents future End-Token issues
-- **Professional Output**: All models and modes produce clean, professional responses
-- **Test Stability**: Eliminated flaky tests and timeouts for reliable CI/CD
+### Technical
+- 15 new tests across server and CLI to validate token policies
+- Internal refactoring for token handling to avoid duplication
 
 ## [1.1.0-beta1] - 2025-08-21
 
-### Major Features 🚀
-- **Issues #15 & #16**: Dynamic Model-Aware Token Limits
-  - Eliminated hardcoded 500/2000 token defaults with intelligent model-based limits
-  - **Phi-3-mini**: 4096 context → 2048 server tokens, 4096 interactive (8x improvement)
-  - **Qwen2.5-30B**: 262,144 context → 131,072 server tokens, 262,144 interactive (524x improvement!)
-  - Context-aware policies: Interactive mode uses full context, server mode uses context/2 for DoS protection
-  - Automatic adaptation to new models with larger context windows (future-proof)
+### Added
+- Dynamic model-aware token limits (context-length sensitive)
+- CLI `--max-tokens` default changed to `None` (was 2000)
+- Server leverages the same dynamic limits
 
-### Enhanced Web Client 🌐  
-- **Model Token Capacity Display**: Shows "Ready with Mistral-7B (32,768 tokens)" in header
-- **Enhanced `/v1/models` API**: Now exposes `context_length` field for model capabilities
-- **Button State Management**: Clear Chat properly disabled during streaming with CSS styling
-- **Streaming Status Tracking**: Added `isStreaming` flag with "Generating response..." feedback
+### Improved
+- End-token filtering consistency across streaming and non-streaming modes
+- Robustness in model loading and memory management
 
-### Interactive Mode Improvements 💡
-- **Smart CLI Defaults**: `mlxk run <model> "prompt"` automatically uses optimal token limits per model
-- **No Configuration Needed**: Users benefit immediately without changing usage patterns
-- **Explicit Control Preserved**: `--max-tokens` arguments still respected and capped at model context
-- **Clean Type Safety**: Proper `Optional[int]` handling eliminates fragile CLI guessing
-
-### Technical Architecture 🏗️
-- **`get_model_context_length()` function**: Extracts context length from model configs with multiple fallback keys
-- **Enhanced MLXRunner**: `get_effective_max_tokens()` method for context-aware token limiting
-- **Server API Updates**: All endpoints use model-aware limits with DoS protection
-- **Unified Token Logic**: Single source of truth through MLXRunner eliminates duplicate code
-- **Backward Compatible**: All existing CLI arguments and APIs work unchanged
-
-### Performance Impact 📊
-- **Modern Models Unleashed**: Large-context models can now use their full capabilities
-- **Real-World Benefits**: No more artificial 500-token truncation for 100K+ context models  
-- **Smart Server Limits**: Automatic DoS protection while maximizing usable context
-- **Zero Magic Numbers**: Clean architecture with clear `None` vs explicit value semantics
-
-### Testing & Quality Assurance ✅
-- **Comprehensive Coverage**: 131/131 tests passing (expansion from 114 tests)
-- **20 new unit tests**: Covering CLI None-handling, model context extraction, effective token calculation
-- **5 server integration tests**: Real-world validation with actual MLX models
-- **Extreme Model Testing**: Validated with models from 1B to 30B parameters, up to 256K context
-- **Edge Case Handling**: Unknown models, missing configs, CLI argument combinations
-
-### Issue #14 Model Compatibility Validation
-**Chat Self-Conversation Fix tested across model spectrum:**
-
-| Model | Size | RAM (GB) | Context | Status | Architecture |
-|-------|------|----------|---------|--------|-------------|
-| **Llama-3.2-1B-Instruct-4bit** | 1B | 2 | 131,072 | ✅ PASSED | Llama |
-| **Llama-3.2-3B-Instruct-4bit** | 3B | 4 | 131,072 | ✅ PASSED | Llama |
-| **Phi-3-mini-4k-instruct-4bit** | 4B | 5 | 4,096 | ✅ PASSED | Phi-3 |
-| **Mistral-7B-Instruct-v0.2-4bit** | 7B | 8 | 32,768 | ✅ PASSED | Mistral |
-| **Mixtral-8x7B-Instruct-v0.1-4bit** | 8x7B | 16 | 32,768 | ✅ PASSED | Mixtral MoE |
-| **Mistral-Small-3.2-24B-Instruct-2506-4bit** | 24B | 20 | 32,768 | ✅ PASSED | Mistral |
-| **Qwen3-30B-A3B-Instruct-2507-4bit** | 30B | 24 | 262,144 | ✅ PASSED | Qwen |
-
-**Validation Results**: 7/7 models passed - comprehensive coverage from 1B to 30B parameters across all major MLX architectures ensures robust chat stop token handling.
-
-### Beta Status Notes ⚠️
-- **Core Functionality**: Solid foundation with comprehensive test coverage
-- **Known Limitation**: Server deadlock possible under extreme concurrent model loading stress
-- **Workaround**: Avoid simultaneous heavy model operations (normal usage unaffected)  
-- **Real-World Ready**: Significant improvements ready for community testing and feedback
+### Tests
+- 114/114 tests passing
+- Server tests behind `@pytest.mark.server` (opt-in)
 
 ## [1.0.4] - 2025-08-19
 
@@ -290,7 +556,7 @@ Note: GitHub tag/version uses `1.1.1-beta.1`. PyPI release uses PEP 440 `1.1.1b1
   - 🔄 Smart model switching: Choice to keep or clear chat history when switching models
   - 🌐 Responsive design: Full viewport height utilization, optimized screen space usage
   - 🎯 Clear UX: "Clear Chat" instead of ambiguous "Clear" button
-  - 🏴󠁧󠁢󠁥󠁮󠁧󠁿 English dialogs: Custom modal dialogs replace German OS dialogs
+  - 🏴 English dialogs: Custom modal dialogs replace German OS dialogs
 
 ### Added
 - **Automated Server Testing Infrastructure**:
@@ -411,3 +677,21 @@ Note: GitHub tag/version uses `1.1.1-beta.1`. PyPI release uses PEP 440 `1.1.1b1
 
 ## Known Issues
 - See GitHub Issues for tracking
+ 
+## 2.0.0‑beta.3 (local)
+
+- Server robustness and API polish
+  - Supervisor default: Uvicorn runs as subprocess in its own process group; Ctrl‑C terminates deterministically and allows immediate restart.
+  - HTTP mapping: 404 for unknown/failed model loads; 503 during shutdown; preserve HTTPException codes from helpers.
+  - Streaming (SSE):
+    - Happy path: initial chunk, per‑token chunks, final chunk, then `[DONE]`.
+    - Interrupt path: on `KeyboardInterrupt` emit clear interrupt marker and close promptly.
+  - Token limits: server mode uses half of context length; explicit `max_tokens` respected.
+  - Noise reduction: chat streaming debug prints gated behind `MLXK2_DEBUG`.
+
+- Testing
+  - Added focused server API tests for `/v1/models`, 404/503 mapping, SSE happy/interrupt, and server‑side token limit propagation.
+  - Global suppression of macOS Python 3.9 `urllib3` LibreSSL warning in tests; runtime already suppressed.
+
+- Docs
+  - README/TESTING touch‑ups pending flip; CLAUDE.md tracks SSE UX roadmap (anti‑buffering headers, optional heartbeats, status/interrupt endpoints).
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index d0e077d..704eb44 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -6,6 +6,19 @@ First off, thank you for considering contributing to MLX Knife! It's people like
 
 We're a small team passionate about making MLX models accessible and easy to use on Apple Silicon. We welcome contributions from everyone who shares this vision.
 
+## 2.0 Stable – Contributor Notes
+
+- **Code path:** `mlxk2/` (entry points: `mlxk`, `mlxk-json`, `mlxk2`)
+- **Default output:** Human-friendly tables/text; pass `--json` for machine-readable JSON API
+- **Full feature parity:** All commands available (`list`, `health`, `show`, `pull`, `rm`, `run`, `serve`)
+- **Tests:** Primary suite is `tests_2.0/` (see `pytest.ini`)
+- **Human output options:**
+  - `list`: `--all` (all frameworks), `--health` (add column), `--verbose` (full org/model names)
+  - Compact default: MLX-only, compact names (strip `mlx-community/`), no Framework column
+- **Cache safety:** Tests use isolated temp caches; read-only ops are safe; coordinate `pull`/`rm` when using a shared user cache
+- **Spec discipline:** JSON schema/spec changes require a version bump in `mlxk2/spec.py` (see docs/)
+
+
 ## How Can I Contribute?
 
 ### Reporting Bugs
@@ -31,8 +44,8 @@ Enhancement suggestions are tracked as GitHub issues. When creating an enhanceme
 
 1. Fork the repository and create your branch from `main`
 2. If you've added code, add tests that cover your changes
-3. Ensure the test suite passes locally: `pytest tests/`
-4. Make sure your code follows the existing style: `ruff check mlx_knife/ --fix`
+3. Ensure the test suite passes locally: `pytest tests_2.0/ -v`
+4. Make sure your code follows the existing style: `ruff check mlxk2/ --fix`
 5. Write a clear commit message
 6. Open a Pull Request with a clear title and description
 
@@ -43,18 +56,18 @@ Enhancement suggestions are tracked as GitHub issues. When creating an enhanceme
 git clone https://github.com/mzau/mlx-knife.git
 cd mlx-knife
 
-# Install in development mode with all dependencies
-pip install -e ".[dev,test]"
+# Install in development mode
+pip install -e .
 
 # Download a test model (required for full test suite)
 mlxk pull mlx-community/Phi-3-mini-4k-instruct-4bit
 
-# Run tests
-pytest
+# Run tests (2.0 default)
+pytest tests_2.0/ -v
 
-# Check code style
-ruff check mlx_knife/
-mypy mlx_knife/
+# Check code style (2.0)
+ruff check mlxk2/
+mypy mlxk2/
 
 # Test with a real model
 mlxk run Phi-3-mini "Hello world"
@@ -66,18 +79,18 @@ Understanding what goes where:
 
 ```
 Repository structure:
-├── mlx_knife/              # Python package (→ PyPI)
-├── tests/                  # Test suite
-├── simple_chat.html        # Web interface (GitHub only)
-├── README.md               # User documentation  
-├── CONTRIBUTING.md         # This file
-├── TESTING.md              # Testing guide
-├── pyproject.toml          # Build configuration
-└── requirements.txt        # Dependencies
+├── mlxk2/                       # 2.0 implementation (→ PyPI via mlxk-json)
+├── tests_2.0/                   # 2.0 test suite
+├── docs/                        # Documentation / ADRs
+├── README.md                    # User documentation
+├── CONTRIBUTING.md              # This file
+├── TESTING.md                   # Testing guide
+├── pyproject.toml               # Build configuration (dynamic version, optional test deps)
+└── requirements.txt             # Dev/test dependencies
 ```
 
 **What goes where:**
-- **PyPI Package**: Only `mlx_knife/` + build files (`pyproject.toml`, `requirements.txt`)
+- **PyPI Package**: Only `mlxk2/` + `pyproject.toml` (optional dependencies excluded from release wheel)
 - **GitHub Repository**: Everything else (documentation, tests, web interface)
 
 This helps ensure contributors commit files to the right place and understand the package vs. repository distinction.
@@ -116,9 +129,9 @@ For detailed testing options, troubleshooting, and advanced workflows, see **[TE
 Please ensure all tests pass locally:
 ```bash
 # Complete test workflow
-ruff check mlx_knife/ --fix    # Fix code style
-mypy mlx_knife/                 # Check types
-pytest tests/                   # Run all tests
+ruff check mlxk2/ --fix         # Fix code style
+mypy mlxk2/                     # Check types
+pytest -v                       # Run all 2.0 tests
 ```
 
 Since we don't have CI/CD (MLX requires Apple Silicon), we rely on contributors to verify their changes locally. Please mention in your PR:
@@ -155,8 +168,8 @@ Mention your Python version in the PR description.
    - Update documentation if needed
 
 3. **Before submitting:**
-   - Run the full test suite locally: `pytest tests/`
-   - Run code quality checks: `ruff check mlx_knife/ --fix`
+   - Run the full test suite locally: `pytest -v`
+   - Run code quality checks: `ruff check mlxk2/ --fix`
    - Test with YOUR Python version (3.9+ required)
    - Update README.md if you've added features
 
@@ -166,6 +179,35 @@ MLX Knife has comprehensive test coverage. For detailed testing documentation in
 
 **When adding new tests**: Please update the test structure documentation in **[TESTING.md](TESTING.md)** if you add new test files or categories.
 
+### Spec Version Discipline (JSON API)
+
+If you change the JSON API spec or schema, bump the spec version and keep code/tests in sync.
+
+- Spec files: `docs/json-api-specification.md`, `docs/json-api-schema.json`
+- Version constant: `mlxk2/spec.py` → `JSON_API_SPEC_VERSION`
+- Guard script: `scripts/check-spec-bump.sh`
+
+Usage examples:
+
+```bash
+# Local check against main
+scripts/check-spec-bump.sh origin/main
+
+# Bypass for editorial-only changes
+SPEC_BUMP_BYPASS=1 scripts/check-spec-bump.sh origin/main
+```
+
+CI suggestion (GitHub Actions step):
+
+```bash
+- name: Check JSON API spec bump
+  run: |
+    git fetch origin main --depth=1
+    scripts/check-spec-bump.sh origin/main
+```
+
+Bypass tokens (commit message): `[no-spec-bump]` or `[skip-spec-bump]` for formatting-only edits.
+
 ## Code Style
 
 - We use `ruff` for formatting and linting
@@ -196,7 +238,19 @@ Feel free to open an issue with the "question" label or start a discussion. We'r
 
 ## License
 
-By contributing, you agree that your contributions will be licensed under the MIT License.
+**Important:** MLX Knife 2.0+ is licensed under the **Apache License, Version 2.0**.
+
+By contributing to MLX Knife, you agree that:
+1. Your contributions will be licensed under the Apache License, Version 2.0
+2. You have the right to contribute the code under these terms
+3. You grant the project maintainers a perpetual, worldwide, non-exclusive, royalty-free license to use, reproduce, modify, and distribute your contributions
+
+**Legacy 1.x versions** (MIT License) are maintained in the `1.x-legacy` branch for reference only. All new contributions go to the main branch (Apache 2.0).
+
+We recommend including a Developer Certificate of Origin (DCO) "Signed-off-by" line in your commits:
+```bash
+git commit -s -m "Your commit message"
+```
 
 ---
 
diff --git a/LICENSE b/LICENSE
index 2ec4020..1e32dfc 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,21 +1,201 @@
-MIT License
+Apache License
+Version 2.0, January 2004
+http://www.apache.org/licenses/
 
-Copyright (c) 2025 The BROKE team 🦫
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
 
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
+1. Definitions.
 
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
+"License" shall mean the terms and conditions for use, reproduction,
+and distribution as defined by Sections 1 through 9 of this document.
 
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
\ No newline at end of file
+"Licensor" shall mean the copyright owner or entity authorized by
+the copyright owner that is granting the License.
+
+"Legal Entity" shall mean the union of the acting entity and all
+other entities that control, are controlled by, or are under common
+control with that entity. For the purposes of this definition,
+"control" means (i) the power, direct or indirect, to cause the
+direction or management of such entity, whether by contract or
+otherwise, or (ii) ownership of fifty percent (50%) or more of the
+outstanding shares, or (iii) beneficial ownership of such entity.
+
+"You" (or "Your") shall mean an individual or Legal Entity
+exercising permissions granted by this License.
+
+"Source" form shall mean the preferred form for making modifications,
+including but not limited to software source code, documentation
+source, and configuration files.
+
+"Object" form shall mean any form resulting from mechanical
+transformation or translation of a Source form, including but not
+limited to compiled object code, generated documentation, and
+conversions to other media types.
+
+"Work" shall mean the work of authorship, whether in Source or
+Object form, made available under the License, as indicated by a
+copyright notice that is included in or attached to the work
+(an example is provided in the Appendix below).
+
+"Derivative Works" shall mean any work, whether in Source or Object
+form, that is based on (or derived from) the Work and for which the
+editorial revisions, annotations, elaborations, or other modifications
+represent, as a whole, an original work of authorship. For the purposes
+of this License, Derivative Works shall not include works that remain
+separable from, or merely link (or bind by name) to the interfaces of,
+the Work and Derivative Works thereof.
+
+"Contribution" shall mean any work of authorship, including
+the original version of the Work and any modifications or additions
+to that Work or Derivative Works thereof, that is intentionally
+submitted to Licensor for inclusion in the Work by the copyright owner
+or by an individual or Legal Entity authorized to submit on behalf of
+the copyright owner. For the purposes of this definition, "submitted"
+means any form of electronic, verbal, or written communication sent
+to the Licensor or its representatives, including but not limited to
+communication on electronic mailing lists, source code control systems,
+and issue tracking systems that are managed by, or on behalf of, the
+Licensor for the purpose of discussing and improving the Work, but
+excluding communication that is conspicuously marked or otherwise
+designated in writing by the copyright owner as "Not a Contribution."
+
+"Contributor" shall mean Licensor and any individual or Legal Entity
+on behalf of whom a Contribution has been received by Licensor and
+subsequently incorporated within the Work.
+
+2. Grant of Copyright License. Subject to the terms and conditions of
+this License, each Contributor hereby grants to You a perpetual,
+worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+copyright license to reproduce, prepare Derivative Works of,
+publicly display, publicly perform, sublicense, and distribute the
+Work and such Derivative Works in Source or Object form.
+
+3. Grant of Patent License. Subject to the terms and conditions of
+this License, each Contributor hereby grants to You a perpetual,
+worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+(except as stated in this section) patent license to make, have made,
+use, offer to sell, sell, import, and otherwise transfer the Work,
+where such license applies only to those patent claims licensable
+by such Contributor that are necessarily infringed by their
+Contribution(s) alone or by combination of their Contribution(s)
+with the Work to which such Contribution(s) was submitted. If You
+institute patent litigation against any entity (including a
+cross-claim or counterclaim in a lawsuit) alleging that the Work
+or a Contribution incorporated within the Work constitutes direct
+or contributory patent infringement, then any patent licenses
+granted to You under this License for that Work shall terminate
+as of the date such litigation is filed.
+
+4. Redistribution. You may reproduce and distribute copies of the
+Work or Derivative Works thereof in any medium, with or without
+modifications, and in Source or Object form, provided that You
+meet the following conditions:
+
+(a) You must give any other recipients of the Work or
+Derivative Works a copy of this License; and
+
+(b) You must cause any modified files to carry prominent notices
+stating that You changed the files; and
+
+(c) You must retain, in the Source form of any Derivative Works
+that You distribute, all copyright, patent, trademark, and
+attribution notices from the Source form of the Work,
+excluding those notices that do not pertain to any part of
+the Derivative Works; and
+
+(d) If the Work includes a "NOTICE" text file as part of its
+distribution, then any Derivative Works that You distribute must
+include a readable copy of the attribution notices contained
+within such NOTICE file, excluding those notices that do not
+pertain to any part of the Derivative Works, in at least one
+of the following places: within a NOTICE text file distributed
+as part of the Derivative Works; within the Source form or
+documentation, if provided along with the Derivative Works; or,
+within a display generated by the Derivative Works, if and
+wherever such third-party notices normally appear. The contents
+of the NOTICE file are for informational purposes only and
+do not modify the License. You may add Your own attribution
+notices within Derivative Works that You distribute, alongside
+or as an addendum to the NOTICE text from the Work, provided
+that such additional attribution notices cannot be construed
+as modifying the License.
+
+You may add Your own copyright statement to Your modifications and
+may provide additional or different license terms and conditions
+for use, reproduction, or distribution of Your modifications, or
+for any such Derivative Works as a whole, provided Your use,
+reproduction, and distribution of the Work otherwise complies with
+the conditions stated in this License.
+
+5. Submission of Contributions. Unless You explicitly state otherwise,
+any Contribution intentionally submitted for inclusion in the Work
+by You to the Licensor shall be under the terms and conditions of
+this License, without any additional terms or conditions.
+Notwithstanding the above, nothing herein shall supersede or modify
+the terms of any separate license agreement you may have executed
+with Licensor regarding such Contributions.
+
+6. Trademarks. This License does not grant permission to use the trade
+names, trademarks, service marks, or product names of the Licensor,
+except as required for reasonable and customary use in describing the
+origin of the Work and reproducing the content of the NOTICE file.
+
+7. Disclaimer of Warranty. Unless required by applicable law or
+agreed to in writing, Licensor provides the Work (and each
+Contributor provides its Contributions) on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+implied, including, without limitation, any warranties or conditions
+of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+PARTICULAR PURPOSE. You are solely responsible for determining the
+appropriateness of using or redistributing the Work and assume any
+risks associated with Your exercise of permissions under this License.
+
+8. Limitation of Liability. In no event and under no legal theory,
+whether in tort (including negligence), contract, or otherwise,
+unless required by applicable law (such as deliberate and grossly
+negligent acts) or agreed to in writing, shall any Contributor be
+liable to You for damages, including any direct, indirect, special,
+incidental, or consequential damages of any character arising as a
+result of this License or out of the use or inability to use the
+Work (including but not limited to damages for loss of goodwill,
+work stoppage, computer failure or malfunction, or any and all
+other commercial damages or losses), even if such Contributor
+has been advised of the possibility of such damages.
+
+9. Accepting Warranty or Additional Liability. While redistributing
+the Work or Derivative Works thereof, You may choose to offer,
+and charge a fee for, acceptance of support, warranty, indemnity,
+or other liability obligations and/or rights consistent with this
+License. However, in accepting such obligations, You may act only
+on Your own behalf and on Your sole responsibility, not on behalf
+of any other Contributor, and only if You agree to indemnify,
+defend, and hold each Contributor harmless for any liability
+incurred by, or claims asserted against, such Contributor by reason
+of your accepting any such warranty or additional liability.
+
+END OF TERMS AND CONDITIONS
+
+APPENDIX: How to apply the Apache License to your work.
+
+To apply the Apache License to your work, attach the following
+boilerplate notice, with the fields enclosed by brackets "[]"
+replaced with your own identifying information. (Don't include
+the brackets!)  The text should be enclosed in the appropriate
+comment syntax for the file format. We also recommend that a
+file or class name and description of purpose be included on the
+same "printed page" as the copyright notice for easier
+identification within third-party archives.
+
+Copyright [2025] [The BROKE team]
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
diff --git a/MIGRATION.md b/MIGRATION.md
new file mode 100644
index 0000000..d2dd28d
--- /dev/null
+++ b/MIGRATION.md
@@ -0,0 +1,285 @@
+# Migration Guide: MLX Knife 1.x → 2.0
+
+This guide helps you transition from MLX Knife 1.x (MIT License) to 2.0 (Apache License 2.0).
+
+## License Change: MIT → Apache 2.0
+
+**Important:** MLX Knife 2.0 changes the license from MIT to Apache License 2.0.
+
+### What This Means for Users
+
+**Practical Impact:**
+- ✅ Still **free and open source**
+- ✅ Still **commercial use allowed**
+- ✅ Still **modification allowed**
+- ⚠️ **Attribution required** (include NOTICE file in distributions)
+- ⚠️ **Patent grant** (explicit patent protection for users)
+
+**Key Differences:**
+| Aspect | MIT (1.x) | Apache 2.0 (2.0+) |
+|--------|-----------|-------------------|
+| Use | ✅ Free | ✅ Free |
+| Modify | ✅ Allowed | ✅ Allowed |
+| Commercial | ✅ Allowed | ✅ Allowed |
+| Attribution | Optional | **Required** (NOTICE file) |
+| Patent Grant | Implicit | **Explicit** (better protection) |
+
+### Why Apache 2.0?
+
+1. **Better Patent Protection:** Explicit patent grant protects users from patent litigation
+2. **Industry Standard:** Used by major projects (Kubernetes, TensorFlow, Apache projects)
+3. **Clear Contribution Terms:** Explicit contributor licensing for future contributions
+
+### For Users: What You Need to Do
+
+**If you just use MLX Knife CLI:**
+- ✅ Nothing! Just upgrade: `pip install --upgrade mlx-knife`
+- The license change doesn't affect CLI usage
+
+**If you distribute MLX Knife in your product:**
+- ⚠️ Include the `LICENSE` file from the mlx-knife package
+- ⚠️ Include the `mlxk2/NOTICE` file in your distribution
+- Example: If bundling mlx-knife in a Mac app, include both files in your "Licenses" folder
+
+**If you're a 1.x user and prefer MIT:**
+- 🔒 Version 1.1.1 remains available under MIT License
+- Install specific version: `pip install mlx-knife==1.1.1`
+- Security updates will focus on 2.x (Apache 2.0)
+
+## Behavior Changes
+
+### `rm` Command: Lock File Handling
+
+The `rm` command now handles HuggingFace lock files more safely:
+
+**1.x Behavior:**
+```bash
+$ mlxk rm Phi-3-mini
+Delete entire model mlx-community/Phi-3-mini-4k-instruct-4bit? [y/N] y
+Model mlx-community/Phi-3-mini-4k-instruct-4bit completely deleted.
+Clean up cache files? [Y/n] y
+Cache files cleaned up (3 files).
+```
+
+**2.0 Behavior:**
+```bash
+# Models with active locks require --force
+$ mlxk rm Phi-3-mini
+Error: Model has active locks. Use --force to override.
+
+# With --force, deletion + lock cleanup happen automatically
+$ mlxk rm Phi-3-mini --force
+rm: mlx-community/Phi-3-mini-4k-instruct-4bit — deleted: Deleted entire model mlx-community/Phi-3-mini-4k-instruct-4bit
+```
+
+**Why the change?**
+- **Safety:** Prevents accidental deletion of models that may be in use
+- **Simplicity:** One confirmation instead of two separate prompts
+- **Clarity:** Explicit `--force` makes automation intent clear
+- **Automatic cleanup:** Lock files are cleaned silently (no separate prompt)
+
+**Note:** Lock file cleanup count is available in `--json` output (`lock_files_cleaned` field).
+
+**Migration for scripts:**
+- Interactive usage: No change needed (you'll get a clear error message if locks exist)
+- Automation: Add `--force` if your scripts delete models programmatically
+
+## New Features in 2.0
+
+### 1. JSON API for Automation
+
+All commands now support `--json` for machine-readable output:
+
+```bash
+mlxk list --json
+mlxk show Phi-3-mini --json
+mlxk rm Phi-3-mini --force --json
+```
+
+**Example:**
+```bash
+# Extract model names for scripting
+mlxk list --json | jq -r '.data.models[] | .name'
+
+# Check lock cleanup count
+mlxk rm test-model --force --json | jq '.data.lock_files_cleaned'
+```
+
+### 2. Enhanced Error Handling & Logging
+
+- **Structured errors** with request IDs for debugging
+- **Log levels:** `--log-level debug|info|warning|error`
+- **JSON logs:** `--log-json` or `MLXK2_LOG_JSON=1`
+- **Auto-redaction:** HF tokens and user paths automatically hidden
+
+```bash
+# Debug mode
+mlxk run Phi-3-mini "test" --log-level debug
+
+# JSON logs for production
+mlxk server --log-json --log-level info
+```
+
+### 3. Runtime Compatibility Checks
+
+Pre-flight validation catches issues before model loading:
+
+```bash
+$ mlxk show Phi-3-mini
+Model: Phi-3-mini-4k-instruct-4bit
+Health: healthy
+Runtime: compatible
+
+$ mlxk show legacy-model
+Model: legacy-model
+Health: healthy (files OK)
+Runtime: incompatible
+Reason: Legacy format not supported by mlx-lm
+```
+
+### 4. Better Stop Token Detection
+
+Fixed issues with multi-EOS models:
+- No more visible stop tokens (`<|end|>`)
+- No more "self-conversation" (model continuing after response)
+- Works with MXFP4, Qwen, Llama models
+
+### 5. Improved Human Output Formatting
+
+- Shorter model names (strip `mlx-community/` prefix by default)
+- Relative timestamps ("2 days ago")
+- Better alignment and readability
+- Use `--verbose` for full names and details
+
+## Command Compatibility
+
+| Command | 1.x | 2.0 | Notes |
+|---------|-----|-----|-------|
+| `mlxk list` | ✅ | ✅ | Improved formatting, add `--verbose` for full names |
+| `mlxk show <model>` | ✅ | ✅ | Added `runtime_compatible` field |
+| `mlxk pull <model>` | ✅ | ✅ | Better error messages |
+| `mlxk rm <model>` | ✅ | ⚠️ | Lock files require `--force` (safer) |
+| `mlxk run <model>` | ✅ | ✅ | Better stop token handling |
+| `mlxk server` | ✅ | ✅ | Added `--log-level`, `--log-json` |
+| `mlxk health` | ✅ | ✅ | Added runtime compatibility checks |
+
+## Package & Command Names
+
+- **PyPI Package:** `mlx-knife` (unchanged)
+- **Primary Command:** `mlxk` (unchanged)
+- **Aliases:** `mlxk-json`, `mlxk2` (for backwards compatibility)
+
+```bash
+# All three commands are identical
+mlxk --version       # → 2.0.0
+mlxk-json --version  # → 2.0.0
+mlxk2 --version      # → 2.0.0
+```
+
+## Installation & Upgrade
+
+### Upgrade from 1.x
+
+```bash
+# Simple upgrade
+pip install --upgrade mlx-knife
+
+# Verify version
+mlxk --version  # Should show: mlxk 2.0.0
+```
+
+### Fresh Installation
+
+```bash
+# Install from PyPI
+pip install mlx-knife
+
+# Or from GitHub release
+pip install https://github.com/mzau/mlx-knife/releases/download/v2.0.0/mlx_knife-2.0.0-py3-none-any.whl
+```
+
+### Staying on 1.x (MIT License)
+
+```bash
+# Pin to 1.x version
+pip install mlx-knife==1.1.1
+
+# Or in requirements.txt
+mlx-knife==1.1.1
+```
+
+## Data & Cache Compatibility
+
+✅ **Your model cache is 100% compatible**
+
+- Same HuggingFace cache: `~/.cache/huggingface/hub`
+- All 1.x models work in 2.0 immediately
+- No re-download required
+- No migration needed
+
+## Testing Before Upgrade
+
+```bash
+# Test 2.0 in a virtual environment
+python3 -m venv test-mlxk2
+source test-mlxk2/bin/activate
+pip install mlx-knife
+
+# Verify your workflow
+mlxk list
+mlxk run YourFavoriteModel "test prompt"
+mlxk rm test-model --force  # Note: --force for locks
+
+# If satisfied, upgrade
+deactivate
+pip install --upgrade mlx-knife
+```
+
+## Upgrade Checklist for Automation Scripts
+
+If you have scripts using mlxk:
+
+- [ ] **Add `--force` to `mlxk rm` commands** (if deleting programmatically)
+- [ ] **Replace output parsing with `--json`** (don't parse human output)
+- [ ] **Test in virtual environment first**
+- [ ] **Update error handling** for structured error responses
+- [ ] **Consider `--log-json`** for production logging
+
+## FAQ
+
+**Q: Will my 1.x scripts break?**
+A: Only `mlxk rm` scripts that delete models with active locks without `--force`. This is a safety improvement.
+
+**Q: Why does `rm` need `--force` for locks?**
+A: To prevent accidental deletion of models that may be in use. Locks indicate active downloads or usage.
+
+**Q: Can I still use interactive deletion?**
+A: Yes! Without `--force`, you'll get clear error messages about locks and can decide whether to use `--force`.
+
+**Q: Do I need to re-download models?**
+A: No! All cached models from 1.x work immediately.
+
+**Q: What about the web chat interface?**
+A: `simple_chat.html` works with both 1.x and 2.0 (OpenAI API unchanged).
+
+**Q: Can I contribute to 1.x?**
+A: 1.x is in maintenance mode. New contributions go to 2.x (Apache 2.0).
+
+**Q: Where's the 1.x source code?**
+A: The `1.x-legacy` branch contains the final MIT version (1.1.1).
+
+## Need Help?
+
+- **Issues:** https://github.com/mzau/mlx-knife/issues
+- **Discussions:** https://github.com/mzau/mlx-knife/discussions
+- **Documentation:** https://github.com/mzau/mlx-knife
+
+## Timeline
+
+- **1.1.1 (MIT):** Final 1.x release - September 2025
+- **2.0.0 (Apache 2.0):** Stable release - November 2025
+- **1.x Support:** Security fixes only
+
+---
+
+**The BROKE Team** 🦫
diff --git a/README.md b/README.md
index 095397d..c75b3c2 100644
--- a/README.md
+++ b/README.md
@@ -1,28 +1,17 @@
-# <img src="https://github.com/mzau/mlx-knife/raw/main/broke-logo.png" alt="BROKE Logo" width="60" style="vertical-align: middle;"> MLX Knife
+# <img src="https://github.com/mzau/mlx-knife/raw/main/broke-logo.png" alt="BROKE Logo" width="60" style="vertical-align: middle;"> MLX-Knife 2.0
 
 <p align="center">
-  <img src="https://github.com/mzau/mlx-knife/raw/main/mlxk-demo.gif" alt="MLX Knife Demo" width="1000">
+  <img src="https://github.com/mzau/mlx-knife/raw/main/mlxk-demo.gif" alt="MLX Knife Demo" width="900">
 </p>
 
-A lightweight, ollama-like CLI for managing and running MLX models on Apple Silicon. **CLI-only tool designed for personal, local use** - perfect for individual developers and researchers working with MLX models.
+**Current Stable Version: 2.0.0**
 
-> **Note**: MLX Knife is designed as a command-line interface tool only. While some internal functions are accessible via Python imports, only CLI usage is officially supported.
-
-**Current Version**: 1.1.1 (September 2025) - **STABLE RELEASE** 🚀
-- Features in 1.1.1 — MXFP4 support and GPT-OSS reasoning models:
-  - Full MXFP4 quantization support (MLX ≥0.29.0, MLX-LM ≥0.27.0),
-  - GPT-OSS reasoning model formatting with `--hide-reasoning` flag,
-  - Enhanced quantization display in `show` command,
-  - Tested with `gpt-oss-20b-MXFP4-Q8` from mlx-community.
-  - Details: see CHANGELOG.md. Install with `pip install mlx-knife`.
-- **Reliable Test System**: 166/166 tests passing across Python 3.9–3.13  
-- **Python 3.9-3.13**: Full compatibility verified across all Python versions
-- **Key Issues Resolved**: Issues #21, #22, #23 fixed and thoroughly tested
-
-[![GitHub Release](https://img.shields.io/github/v/release/mzau/mlx-knife)](https://github.com/mzau/mlx-knife/releases)
-[![MLX](https://img.shields.io/badge/MLX-macOS-green.svg)](https://github.com/ml-explore/mlx-lm)
-[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![GitHub Release](https://img.shields.io/badge/version-2.0.0-green.svg)](https://github.com/mzau/mlx-knife/releases)
+[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
 [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
+[![Apple Silicon](https://img.shields.io/badge/Apple%20Silicon-green.svg)](https://support.apple.com/en-us/HT211814)
+[![MLX](https://img.shields.io/badge/MLX-Latest-orange.svg)](https://github.com/ml-explore/mlx)
+
 
 ## Features
 
@@ -31,309 +20,549 @@ A lightweight, ollama-like CLI for managing and running MLX models on Apple Sili
 - **Model Information**: Detailed model metadata including quantization info
 - **Download Models**: Pull models from HuggingFace with progress tracking
 - **Run Models**: Native MLX execution with streaming and chat modes
-- **Health Checks**: Verify model integrity and completeness
+- **Health Checks**: Verify model integrity and MLX runtime compatibility
 - **Cache Management**: Clean up and organize your model storage
-
-### Local Server & Web Interface
-- **OpenAI-Compatible API**: Local REST API with `/v1/chat/completions`, `/v1/completions`, `/v1/models`
-- **Web Chat Interface**: Built-in HTML chat interface with markdown rendering  
-- **Single-User Design**: Optimized for personal use, not multi-user production environments
-- **Conversation Context**: Full chat history maintained for follow-up questions
-- **Streaming Support**: Real-time token streaming via Server-Sent Events
-- **Configurable Limits**: Set default max tokens via `--max-tokens` parameter
-- **Model Hot-Swapping**: Switch between models per conversation
-- **Tool Integration**: Compatible with OpenAI-compatible clients (Cursor IDE, etc.)
-
-### Run Experience
-- **Direct MLX Integration**: Models load and run natively without subprocess overhead
-- **Real-time Streaming**: Watch tokens generate with proper spacing and formatting
-- **Interactive Chat**: Full conversational mode with history tracking
-- **Memory Insights**: See GPU memory usage after model loading and generation
-- **Dynamic Stop Tokens**: Automatic detection and filtering of model-specific stop tokens
-- **Customizable Generation**: Control temperature, max_tokens, top_p, and repetition penalty
-- **Context-Managed Memory**: Context manager pattern ensures automatic cleanup and prevents memory leaks
-- **Exception-Safe**: Robust error handling with guaranteed resource cleanup
-
-## Installation
-
-### Via PyPI (Recommended)
-```bash
-pip install mlx-knife
-```
+- **Privacy & Network**: No background network or telemetry; only explicit Hugging Face interactions when you run pull or the experimental push.
 
 ### Requirements
-- macOS with Apple Silicon (M1/M2/M3)
+- macOS with Apple Silicon
 - Python 3.9+ (native macOS version or newer)
 - 8GB+ RAM recommended + RAM to run LLM
 
 ### Python Compatibility
 MLX Knife has been comprehensively tested and verified on:
 
-✅ **Python 3.9.6** (native macOS) - Primary target  
-✅ **Python 3.10-3.13** - Fully compatible  
+✅ **Python 3.9.6** (native macOS) - Primary target
+✅ **Python 3.10-3.13** - Fully compatible
 
-All versions include full MLX model execution testing with real models.
 
-### Install from Source
+
+## Installation
+
+### Via PyPI (Recommended)
 
 ```bash
-# Clone the repository
+# Install stable release from PyPI
+pip install mlx-knife
+
+# Verify installation
+mlxk --version  # → mlxk 2.0.0
+```
+
+### Development Installation
+
+```bash
+# Clone and install from source
 git clone https://github.com/mzau/mlx-knife.git
 cd mlx-knife
 
-# Install in development mode
-pip install -e .
-
-# Or install normally
-pip install .
-
-# Install with development tools (ruff, mypy, tests)
+# Install with all development dependencies (required for testing and code quality)
 pip install -e ".[dev,test]"
+
+# Verify installation
+mlxk --version  # → mlxk 2.0.0
+
+# Run tests and quality checks (before committing)
+pytest -v
+ruff check mlxk2/ --fix
+mypy mlxk2/
 ```
 
-### Install Dependencies Only
+**Note:** For minimal user installation without dev tools: `pip install -e .`
+
+### Migrating from 1.x
+
+If you're upgrading from MLX Knife 1.x, see [MIGRATION.md](MIGRATION.md) for important information about the license change (MIT → Apache 2.0) and behavior changes.
 
-```bash
-pip install -r requirements.txt
-```
 
 ## Quick Start
 
-### CLI Usage
 ```bash
-# List all MLX models in your cache
+# List models (human-readable)
 mlxk list
+mlxk list --health
+mlxk list --verbose --health
 
-# Show detailed info about a model
-mlxk show Phi-3-mini-4k-instruct-4bit
-
-# Download a new model
-mlxk pull mlx-community/Mistral-7B-Instruct-v0.3-4bit
-
-# Run a model with a prompt
-mlxk run Phi-3-mini "What is the capital of France?"
-
-# GPT-OSS reasoning model with formatted output
-mlxk run gpt-oss-20b-MXFP4-Q8 "Explain quantum computing"
-
-# Hide reasoning steps, show only final answer (GPT-OSS models)
-mlxk run gpt-oss-20b-MXFP4-Q8 "What is 2+2?" --hide-reasoning
-
-# Start interactive chat
-mlxk run Phi-3-mini
-
-# Check model health
+# Check cache health
 mlxk health
+
+# Show model details
+mlxk show "mlx-community/Phi-3-mini-4k-instruct-4bit"
+
+# Pull a model
+mlxk pull "mlx-community/Llama-3.2-3B-Instruct-4bit"
+
+# Run interactive chat
+mlxk run "Phi-3-mini" -c
+
+# Start OpenAI-compatible server
+mlxk serve --port 8080
 ```
 
-### Web Chat Interface
 
-MLX Knife includes a built-in web interface for easy model interaction:
+## Commands
+
+| Command | Description |
+|---------|-------------|
+| `server`/`serve` | OpenAI-compatible API server; SIGINT-robust (Supervisor); SSE streaming |
+| `run` | Interactive and single-shot model execution with streaming/batch modes |
+| `list` | Model discovery with JSON output |
+| `health` | Corruption detection and cache analysis |
+| `show` | Detailed model information with --files, --config |
+| `pull` | HuggingFace model downloads with corruption detection |
+| `rm` | Model deletion with lock cleanup and fuzzy matching |
+| 🔒 `push` | **Alpha feature** - Upload to HuggingFace Hub; requires `MLXK2_ENABLE_ALPHA_FEATURES=1` |
+| 🔒 `clone` | **Alpha feature** - Model workspace cloning; requires `MLXK2_ENABLE_ALPHA_FEATURES=1` |
+
+
+
+## JSON API
+
+> **📋 Complete API Specification**: See [JSON API Specification](docs/json-api-specification.md) for comprehensive schema, error codes, and examples.
+
+All commands support both human-readable and JSON output (`--json` flag) for automation and scripting, enabling seamless integration with CI/CD pipelines and cluster management systems.
+
+### Command Structure
+
+All commands support JSON output via `--json` flag:
 
 ```bash
-# Start the OpenAI-compatible API server
-mlxk server --port 8000 --max-tokens 4000
-
-# Get web chat interface from GitHub
-curl -O https://raw.githubusercontent.com/mzau/mlx-knife/main/simple_chat.html
-
-# Open web chat interface in your browser
-open simple_chat.html
+mlxk list --json | jq '.data.models[].name'
+mlxk health --json | jq '.data.summary'
+mlxk show "Phi-3-mini" --json | jq '.data.model'
 ```
 
-**Features:**
-- **No installation required** - Pure HTML/CSS/JS
-- **Real-time streaming** - Watch tokens appear as they're generated
-- **Model selection** - Choose any MLX model from your cache
-- **Conversation history** - Full context for follow-up questions
-- **Markdown rendering** - Proper formatting for code, lists, tables
-- **Mobile-friendly** - Responsive design works on all devices
+**Response Format:**
+```json
+{
+    "status": "success|error",
+    "command": "list|health|show|pull|rm|clone|version|push|run|server",
+    "data": { /* command-specific data */ },
+    "error": null | { "type": "...", "message": "..." }
+}
+```
 
-### Local API Server Integration
+### Examples
 
-The MLX Knife server provides OpenAI-compatible endpoints for **local development and personal use**:
+#### List Models
+```bash
+mlxk list --json
+# Output:
+{
+  "status": "success",
+  "command": "list",
+  "data": {
+    "models": [
+      {
+        "name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+        "hash": "a5339a41b2e3abcdef1234567890ab12345678ef",
+        "size_bytes": 4613734656,
+        "last_modified": "2024-10-15T08:23:41Z",
+        "framework": "MLX",
+        "model_type": "chat",
+        "capabilities": ["text-generation", "chat"],
+        "health": "healthy",
+        "runtime_compatible": true,
+        "reason": null,
+        "cached": true
+      }
+    ],
+    "count": 1
+  },
+  "error": null
+}
+```
+
+#### Health Check
+```bash
+mlxk health --json
+# Output:
+{
+  "status": "success",
+  "command": "health",
+  "data": {
+    "healthy": [
+      {
+        "name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+        "status": "healthy",
+        "reason": "Model is healthy"
+      }
+    ],
+    "unhealthy": [],
+    "summary": { "total": 1, "healthy_count": 1, "unhealthy_count": 0 }
+  },
+  "error": null
+}
+```
+
+#### Show Model Details
+```bash
+mlxk show "Phi-3-mini" --json --files
+# Output (simplified):
+{
+  "status": "success",
+  "command": "show",
+  "data": {
+    "model": {
+      "name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+      "hash": "a5339a41b2e3abcdefgh1234567890ab12345678",
+      "size_bytes": 4613734656,
+      "framework": "MLX",
+      "model_type": "chat",
+      "capabilities": ["text-generation", "chat"],
+      "last_modified": "2024-10-15T08:23:41Z",
+      "health": "healthy",
+      "runtime_compatible": true,
+      "reason": null,
+      "cached": true
+    },
+    "files": [
+      {"name": "config.json", "size": "1.2KB", "type": "config"},
+      {"name": "model.safetensors", "size": "2.3GB", "type": "weights"}
+    ],
+    "metadata": null
+  },
+  "error": null
+}
+```
+
+### Hash Syntax Support
+
+All commands support `@hash` syntax for specific model versions:
 
 ```bash
-# Start local server (single-user, no authentication)
-mlxk server --host 127.0.0.1 --port 8000
-
-# Test with curl
-curl -X POST "http://localhost:8000/v1/chat/completions" \
-  -H "Content-Type: application/json" \
-  -d '{"model": "Phi-3-mini-4k-instruct-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'
-
-# Integration with development tools (community-tested):
-# - Cursor IDE: Set API URL to http://localhost:8000/v1
-# - LibreChat: Configure as custom OpenAI endpoint  
-# - Open WebUI: Add as local OpenAI-compatible API
-# - SillyTavern: Add as OpenAI API with custom URL
+mlxk health "Qwen3@e96" --json     # Check specific hash
+mlxk show "model@3df9bfd" --json   # Short hash matching
+mlxk rm "Phi-3@e967" --json --force  # Delete specific version
 ```
 
-**Note**: Tool integrations are community-tested. Some tools may require specific configuration or have compatibility limitations. Please report issues via GitHub.
+### Integration Examples
 
-## Command Reference
-
-### Available Commands
-
-#### `list` - Browse Models
+#### Broke-Cluster Integration
 ```bash
-mlxk list                    # Show chat-capable MLX models (strict view)
-mlxk list --verbose          # Show MLX models with full paths
-mlxk list --all              # Show all models with framework and TYPE
-mlxk list --all --verbose    # All models with full paths
-mlxk list --health           # Include health status
-mlxk list Phi-3              # Filter by model name
-mlxk list --verbose Phi-3    # Show detailed info (same as show)
+# Get available model names for scheduling
+MODELS=$(mlxk list --json | jq -r '.data.models[].name')
+
+# Check cache health before deployment
+HEALTH=$(mlxk health --json | jq '.data.summary.healthy_count')
+if [ "$HEALTH" -eq 0 ]; then
+    echo "No healthy models available"
+    exit 1
+fi
+
+# Download required models
+mlxk pull "mlx-community/Phi-3-mini-4k-instruct-4bit" --json
 ```
 
-#### `show` - Model Details
+#### CI/CD Pipeline Usage
 ```bash
-mlxk show <model>            # Display model information
-mlxk show <model> --files    # Include file listing
-mlxk show <model> --config   # Show config.json content
+# Verify model integrity in CI
+mlxk health --json | jq -e '.data.summary.unhealthy_count == 0'
+
+# Clean up CI artifacts
+mlxk rm "test-model-*" --json --force
+
+# Pre-warm cache for deployment
+mlxk pull "production-model" --json
 ```
 
-#### `pull` - Download Models
+#### Model Management Automation
 ```bash
-mlxk pull <model>            # Download from HuggingFace
-mlxk pull <org>/<model>      # Full model path
+# Find models by pattern
+LARGE_MODELS=$(mlxk list --json | jq -r '.data.models[] | select(.name | contains("30B")) | .name')
+
+# Show detailed info for analysis
+for model in $LARGE_MODELS; do
+    mlxk show "$model" --json --config | jq '.data.model_config'
+done
 ```
 
-#### `run` - Execute Models
-```bash
-mlxk run <model> "prompt"              # Single prompt (minimal output)
-mlxk run <model> "prompt" --verbose    # Show loading, memory, and stats
-mlxk run <model>                       # Interactive chat
-mlxk run <model> "prompt" --no-stream  # Batch output
-mlxk run <model> --max-tokens 1000     # Custom length
-mlxk run <model> --temperature 0.9     # Higher creativity
-mlxk run <model> --no-chat-template    # Raw completion mode
-mlxk run <model> --hide-reasoning      # Hide reasoning (GPT-OSS models only)
-```
 
-#### `rm` - Remove Models
-```bash
-mlxk rm <model>              # Delete model with cache cleanup confirmation  
-mlxk rm <model>@<hash>       # Delete specific version (removes entire model)
-mlxk rm <model> --force      # Skip confirmations, auto-cleanup cache files
-```
+## Human Output
 
-**Features:**
-- Removes entire model directory (not just snapshots)
-- Cleans up orphaned HuggingFace lock files  
-- Handles corrupted models gracefully
-- Smart prompting (only asks about cache cleanup if needed)
+MLX Knife provides rich human-readable output by default (without `--json` flag).
 
-#### `health` - Check Integrity
-```bash
-mlxk health                  # Check all models
-mlxk health <model>          # Check specific model
-```
-
-#### `server` - Start API Server
-```bash
-mlxk server                           # Start on localhost:8000
-mlxk server --port 8001               # Custom port
-mlxk server --host 0.0.0.0 --port 8000  # Allow external access
-mlxk server --max-tokens 4000         # Set default max tokens (default: 2000)
-mlxk server --reload                  # Development mode with auto-reload
-```
-
-### Command Aliases
-After installation, these commands are equivalent:
-- `mlxk` (recommended)
-- `mlx-knife`
-- `mlx_knife`
-
-## Configuration
-
-### Cache Location
-By default, models are stored in `~/.cache/huggingface/hub`. Configure with:
+### Basic Usage
 
 ```bash
-# Set custom cache location
-export HF_HOME="/path/to/your/cache"
-
-# Example: External SSD
-export HF_HOME="/Volumes/ExternalSSD/models"
+mlxk list
+mlxk list --health
+mlxk health
+mlxk show "mlx-community/Phi-3-mini-4k-instruct-4bit"
 ```
 
-### Model Name Expansion
-Short names are automatically expanded for MLX models:
-- `Phi-3-mini-4k-instruct-4bit` → `mlx-community/Phi-3-mini-4k-instruct-4bit`
-- Models already containing `/` are used as-is
+### List Filters
 
-## Advanced Usage
+- `list`: Shows MLX chat models only (compact names, safe default)
+- `list --verbose`: Shows all MLX models (chat + base) with full org/names and Framework column
+- `list --all`: Shows all frameworks (MLX, GGUF, PyTorch)
+- Flags are combinable: `--all --verbose`, `--all --health`, `--verbose --health`
 
-### Generation Parameters
+### Health Status Display (--health flag)
+
+The `--health` flag adds health status information to the output:
+
+**Compact mode** (default, `--all`):
+- Shows single "Health" column with values:
+  - `healthy` - File integrity OK and MLX runtime compatible
+  - `healthy*` - File integrity OK but not MLX runtime compatible (use `--verbose` for details)
+  - `unhealthy` - File integrity failed or unknown format
+
+**Verbose mode** (`--verbose --health`):
+- Splits into "Integrity" and "Runtime" columns:
+  - **Integrity:** `healthy` / `unhealthy`
+  - **Runtime:** `yes` / `no` / `-` (dash = gate blocked by failed integrity)
+  - **Reason:** Explanation when problems detected (wrapped at 26 chars for readability)
+
+**Examples:**
 
 ```bash
-# Creative writing (high temperature, diverse output)
-mlxk run Mistral-7B "Write a story" --temperature 0.9 --top-p 0.95
+# Compact health view
+mlxk list --health
+# Output:
+# Name                    | Hash    | Size   | Modified | Type | Health
+# Llama-3.2-3B-Instruct   | a1b2c3d | 2.1GB  | 2d ago   | chat | healthy
+# Qwen2-7B-Instruct       | 1a2b3c4 | 4.8GB  | 3d ago   | chat | healthy*
 
-# Precise tasks (low temperature, focused output)
-mlxk run Phi-3-mini "Extract key points" --temperature 0.3 --top-p 0.9
+# Verbose health view with details
+mlxk list --verbose --health
+# Output:
+# Name                    | Hash    | Size   | Modified | Framework | Type | Integrity | Runtime | Reason
+# Llama-3.2-3B-Instruct   | a1b2c3d | 2.1GB  | 2d ago   | MLX       | chat | healthy   | yes     | -
+# Qwen2-7B-Instruct       | 1a2b3c4 | 4.8GB  | 3d ago   | PyTorch   | chat | healthy   | no      | Incompatible: PyTorch
 
-# Long-form generation
-mlxk run Mixtral-8x7B "Explain quantum computing" --max-tokens 2000
-
-# Reduce repetition
-mlxk run model "prompt" --repetition-penalty 1.2
+# All frameworks with health status
+mlxk list --all --health
+# Output:
+# Name                    | Hash    | Size   | Modified | Framework | Type    | Health
+# Llama-3.2-3B-Instruct   | a1b2c3d | 2.1GB  | 2d ago   | MLX       | chat    | healthy
+# llama-3.2-gguf-q4       | b2c3d4e | 1.8GB  | 3d ago   | GGUF      | unknown | healthy*
+# broken-download         | -       | 500MB  | 1h ago   | Unknown   | unknown | unhealthy
 ```
 
-### Working with Specific Commits
+**Design Philosophy:**
+- `unhealthy` is a catch-all for anything not understood/supported (broken downloads, unknown formats, creative HuggingFace structures)
+- `healthy` guarantees the model will work with `mlxk2 run`
+- `healthy*` means files are intact but MLX runtime can't execute them (e.g., GGUF/PyTorch models, incompatible model_type, or mlx-lm version too old)
+
+Note: JSON output is unaffected by these human-only filters and always includes full health/runtime data.
+
+
+## Logging & Debugging
+
+MLX Knife 2.0 provides structured logging with configurable output formats and levels.
+
+### Log Levels
+
+Control verbosity with `--log-level` (server mode):
 
 ```bash
-# Use specific model version
-mlxk show model@commit_hash
-mlxk run model@commit_hash "prompt"
+# Default: Show startup, model loading, and errors
+mlxk serve --log-level info
+
+# Quiet: Only warnings and errors
+mlxk serve --log-level warning
+
+# Silent: Only errors
+mlxk serve --log-level error
+
+# Verbose: All logs including HTTP requests
+mlxk serve --log-level debug
 ```
 
-### Non-MLX Model Handling
+**Log Level Behavior:**
+- `debug`: All logs + Uvicorn HTTP access logs (`GET /v1/models`, etc.)
+- `info`: Application logs (startup, model switching, errors) + HTTP access logs
+- `warning`: Only warnings and errors (no startup messages, no HTTP access logs)
+- `error`: Only error messages
+
+### JSON Logs (Machine-Readable)
+
+Enable structured JSON output for log aggregation tools:
 
-The tool automatically detects framework compatibility:
 ```bash
-# Attempting to run PyTorch model
-mlxk run bert-base-uncased
-# Error: Model bert-base-uncased is not MLX-compatible (Framework: PyTorch)!
-# Use MLX-Community models: https://huggingface.co/mlx-community
+# JSON logs (recommended - CLI flag)
+mlxk serve --log-json
+
+# JSON logs (alternative - environment variable)
+MLXK2_LOG_JSON=1 mlxk serve
 ```
 
-## Troubleshooting
+**Note:** `--log-json` also formats Uvicorn access logs as JSON for consistent output.
 
-### Model Not Found
+**JSON Format:**
+```json
+{"ts": 1760830072.96, "level": "INFO", "msg": "MLX Knife Server 2.0 starting up..."}
+{"ts": 1760830073.14, "level": "INFO", "msg": "Switching to model: mlx-community/...", "model": "..."}
+{"ts": 1760830074.52, "level": "ERROR", "msg": "Model type bert not supported.", "logger": "root"}
+```
+
+**Fields:**
+- `ts`: Unix timestamp
+- `level`: Log level (INFO, WARN, ERROR, DEBUG)
+- `msg`: Log message (HF tokens and user paths automatically redacted)
+- `logger`: Source logger (`mlxk2` = application, `root` = external libraries like mlx-lm)
+- Additional fields: `model`, `request_id`, `detail`, `duration_ms` (context-dependent)
+
+### Security: Automatic Redaction
+
+**Sensitive data is automatically removed from logs:**
+- HuggingFace tokens (`hf_...`) → `[REDACTED_TOKEN]`
+- User home paths (`/Users/john/...`) → `~/...`
+
+**Example:**
 ```bash
-# If model isn't found, try full path
-mlxk pull mlx-community/Model-Name-4bit
+# Original (unsafe):
+Using token hf_AbCdEfGhIjKlMnOpQrStUvWxYz123456 from /Users/john/models
 
-# List available models
-mlxk list --all
+# Logged (safe):
+Using token [REDACTED_TOKEN] from ~/models
 ```
 
-### Performance Issues
-- Ensure sufficient RAM for model size
-- Close other applications to free memory
-- Use smaller quantized models (4-bit recommended)
+### Environment Variables
+
+- `MLXK2_LOG_JSON=1`: Enable JSON log format (alternative to `--log-json` flag)
+- `MLXK2_LOG_LEVEL`: Override log level (used internally for subprocess mode)
+
+
+## HuggingFace Cache Safety
+
+MLX-Knife 2.0 respects standard HuggingFace cache structure and practices:
+
+### Best Practices for Shared Environments
+- **Read operations** (`list`, `health`, `show`) always safe with concurrent processes
+- **Write operations** (`pull`, `rm`) coordinate during maintenance windows
+- **Lock cleanup** automatic but avoid during active downloads
+- **Your responsibility:** Coordinate with team, use good timing
+
+### Example Safe Workflow
+```bash
+# Check what's in cache (always safe)
+mlxk list --json | jq '.data.count'
+
+# Maintenance window - coordinate with team
+mlxk rm "corrupted-model" --json --force
+mlxk pull "replacement-model" --json
+
+# Back to normal operations
+mlxk health --json | jq '.data.summary'
+```
+
+
+## Hidden Alpha Features: `clone` and `push`
+
+### `clone` - Model Workspace Creation
+
+`mlxk clone` is a hidden alpha feature. Enable with `MLXK2_ENABLE_ALPHA_FEATURES=1`. It creates a local workspace from a cached model for modification and development.
+
+- Creates isolated workspace from cached models
+- Supports APFS copy-on-write optimization on same-volume scenarios
+- Includes health check integration for workspace validation
+- Use case: Fork-modify-push workflows
+
+Example:
+```bash
+# Enable alpha features
+export MLXK2_ENABLE_ALPHA_FEATURES=1
+
+# Clone model to workspace
+mlxk clone org/model ./workspace
+```
+
+### `push` - Upload to Hub
+
+`mlxk push` is a hidden alpha feature. Enable with `MLXK2_ENABLE_ALPHA_FEATURES=1`. It uploads a local folder to a Hugging Face model repository using `huggingface_hub/upload_folder`.
+
+- Requires `HF_TOKEN` (write-enabled).
+- Default branch: `main` (explicitly override with `--branch`).
+- Safety: `--private` is required to avoid accidental public uploads.
+- No validation or manifests. Basic hard excludes are applied by default: `.git/**`, `.DS_Store`, `__pycache__/`, common virtualenv folders (`.venv/`, `venv/`), and `*.pyc`.
+- `.hfignore` (gitignore-like) in the workspace is supported and merged with the defaults.
+- Repo creation: use `--create` if the target repo does not exist; harmless on existing repos. Missing branches are created during upload.
+- JSON output: includes `commit_sha`, `commit_url`, `no_changes`, `uploaded_files_count` (when available), `local_files_count` (approx), `change_summary` and a short `message`.
+- Quiet JSON by default: with `--json` (without `--verbose`) progress bars/console logs are suppressed; hub logs are still captured in `data.hf_logs`.
+- Human output: derived from JSON; add `--verbose` to include extras such as the commit URL or a short message variant. JSON schema is unchanged.
+- Local workspace check: use `--check-only` to validate a workspace without uploading. Produces `workspace_health` in JSON (no token/network required).
+- Dry-run planning: use `--dry-run` to compute a plan vs remote without uploading. Returns `dry_run: true`, `dry_run_summary {added, modified:null, deleted}`, and sample `added_files`/`deleted_files`.
+- Testing: see TESTING.md ("Push Testing (2.0)") for offline tests and opt-in live checks with markers/env.
+- Intended for early testers only. Carefully review the result on the Hub after pushing.
+- Responsibility: **You are responsible for complying with Hugging Face Hub policies and applicable laws (e.g., copyright/licensing) for any uploaded content.**
+
+Example:
+```bash
+# Enable alpha features
+export MLXK2_ENABLE_ALPHA_FEATURES=1
+
+# Use push command
+mlxk push --private ./workspace org/model --create --commit "init"
+```
+
+These features are not final and may change or be removed in future releases.
+
+
+## Testing
+
+The 2.0 test suite runs by default (pytest discovery points to `tests_2.0/`):
+
+```bash
+# Run 2.0 tests (default)
+pytest -v
+
+# Explicitly run legacy 1.x tests (not maintained on this branch)
+pytest tests/ -v
+
+# Test categories (2.0 example):
+# - ADR-002 edge cases
+# - Integration scenarios
+# - Model naming logic
+# - Robustness testing
+
+# Current status: all current 2.0 tests pass (some optional schema tests may be skipped without extras)
+```
+
+**Test Architecture:**
+- **Isolated Cache System** - Zero risk to user data
+- **Atomic Context Switching** - Production/test cache separation
+- **Mock Models** - Realistic test scenarios
+- **Edge Case Coverage** - All documented failure modes tested
+
+
+## Compatibility Notes
+
+- Streaming note: Some UIs buffer SSE; verify real-time with `curl -N`. Server sends clear interrupt markers on abort.
 
-### Streaming Issues
-- Some models may have spacing issues - this is handled automatically
-- Use `--no-stream` for batch output if needed
 
 ## Contributing
 
-Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines.
+This branch follows the established MLX-Knife development patterns:
 
-## Security
+```bash
+# Run quality checks
+python test-multi-python.sh  # Tests across Python 3.9-3.13
+./run_linting.sh             # Code quality validation
 
-For security concerns, please see [SECURITY.md](SECURITY.md) or contact us at broke@gmx.eu.
+# Key files:
+mlxk2/                       # 2.0.0 implementation
+tests_2.0/                   # 2.0 test suite
+docs/ADR/                    # Architecture decision records
+```
+
+See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed guidelines.
+
+
+## Support & Feedback
+
+- **Issues**: [GitHub Issues](https://github.com/mzau/mlx-knife/issues)
+- **Discussions**: [GitHub Discussions](https://github.com/mzau/mlx-knife/discussions)
+- **API Specification**: [JSON API Specification](docs/json-api-specification.md)
+- **Documentation**: See `docs/` directory for technical details
+- **Security Policy**: See [SECURITY.md](SECURITY.md)
 
-MLX Knife runs entirely locally - no data is sent to external servers except when downloading models from HuggingFace.
 
 ## License
 
-MIT License - see [LICENSE](LICENSE) file for details
+Apache License 2.0 — see `LICENSE` (root) and `mlxk2/NOTICE`.
 
-Copyright (c) 2025 The BROKE team 🦫
 
 ## Acknowledgments
 
@@ -345,6 +574,6 @@ Copyright (c) 2025 The BROKE team 🦫
 
 <p align="center">
   <b>Made with ❤️ by The BROKE team <img src="broke-logo.png" alt="BROKE Logo" width="30" style="vertical-align: middle;"></b><br>
-  <i>Version 1.1.1 | September 2025</i><br>
+  <i>Version 2.0.0 | November 2025</i><br>
   <a href="https://github.com/mzau/broke-cluster">🔮 Next: BROKE Cluster for multi-node deployments</a>
 </p>
diff --git a/SECURITY.md b/SECURITY.md
index 33f18cb..9fce10e 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -2,7 +2,7 @@
 
 ## Overview
 
-MLX Knife is designed to run locally on your Apple Silicon Mac. It prioritizes user privacy and security by keeping all model execution local. The only network activity is downloading models from HuggingFace (a trusted source).
+MLX Knife is designed to run locally on your Apple Silicon Mac. It prioritizes user privacy and security by keeping all model execution local. Network activity is limited to explicit interactions with Hugging Face: downloading models (pull) and, in 2.0 alpha, an opt‑in alpha upload (push) when you run it explicitly. No background network traffic.
 
 ## Security Model
 
@@ -11,13 +11,16 @@ MLX Knife is designed to run locally on your Apple Silicon Mac. It prioritizes u
 - ✅ Downloads models only from HuggingFace (trusted repository)
 - ✅ API server binds to localhost by default
 - ✅ No telemetry or usage tracking
-- ✅ No external API calls (except HuggingFace for downloads)
+- ✅ No external API calls (except explicit Hugging Face interactions: downloads via pull; optional upload via experimental push)
+- ✅ Can upload a local workspace to Hugging Face only when you explicitly run `mlxk2 push` (alpha feature, opt‑in)
 
 ### What MLX Knife Doesn't Do
-- ❌ No data is sent to external servers
+- ❌ No data is sent to external servers automatically or in the background
 - ❌ No model outputs are logged or transmitted
 - ❌ No user tracking or analytics
 - ❌ No automatic updates or phone-home features
+  
+  Note: The alpha `push` command will upload files from a user‑selected local folder to Hugging Face only when you run it explicitly and provide credentials. It never runs implicitly.
 
 ## Reporting Security Vulnerabilities
 
@@ -76,7 +79,46 @@ mlxk server --host 0.0.0.0 --port 8000
 ### File System Access
 - **Cache Location**: `~/.cache/huggingface/hub` or `$HF_HOME`
 - **Permissions**: Standard user permissions apply
-- **Cleanup**: Use `mlxk rm <model>` to safely remove models
+- **Cleanup**: Use `mlxk rm <model>` to safely remove models; avoid manual deletion in the user cache
+
+### Hugging Face Cache Integrity
+- Separate contexts: use an isolated test cache for automated tests; keep the user cache for manual/production work
+- HF_HOME: set explicitly for user work if needed; tests should not override user HF_HOME by default
+- Safe operations: reads (`list`, `health`, `show`) are always safe; coordinate writes (`pull`, `rm`) in maintenance windows
+- Test safeguards: the test suite places a sentinel in the test cache and enforces deletion guards to prevent accidental user-cache modification
+
+### Alpha Push (`mlxk2 push`)
+
+The 2.0 alpha introduces an alpha upload capability. Treat it as opt‑in, with explicit user control.
+
+#### Scope and defaults
+- Upload‑only: pushes a specified local folder to a Hugging Face model repo via `huggingface_hub.upload_folder`.
+- Requires `HF_TOKEN`; in alpha, `--private` is required to reduce accidental exposure.
+- Default branch is `main` (overridable with `--branch`). No manifests or content validation yet.
+- Honors default ignore patterns and merges project `.hfignore` when present (e.g., excludes `.git/`, `.venv/`, `__pycache__/`, `.DS_Store`).
+
+#### Privacy and boundaries
+- Only files under the path you provide are considered; push does not scan your global caches or home directory.
+- No prompts, logs, or runtime telemetry are uploaded.
+- No background activity: nothing is sent unless you invoke `mlxk2 push`.
+
+#### Safety controls
+- Preflight without network: `--check-only` analyzes the local folder for obvious issues (e.g., missing shards, LFS pointers).
+- Plan without committing: `--dry-run` lists prospective adds/deletes vs remote (no upload performed).
+- Use restricted tokens and test repos when validating; prefer `--private` and organization/user repos you control.
+
+#### Risks and mitigations
+- Risk: Accidental upload of sensitive files included in the folder.
+  - Mitigate with a minimal, dedicated workspace, `.hfignore`, and `--check-only`/`--dry-run` before pushing.
+- Risk: Pushing incomplete or corrupted weights.
+  - Mitigate by reviewing `workspace_health` from `--check-only` and model card requirements before uploading.
+
+#### User responsibility
+**You are responsible for complying with Hugging Face Hub policies and applicable laws (e.g., copyright/licensing) for any uploaded content.** Review all content before uploading and ensure you have appropriate rights to distribute the models and associated files.
+
+#### Network and logging
+- Network egress targets only Hugging Face over HTTPS; no third‑party endpoints.
+- In `--json` mode, hub logs may be captured in output for diagnostics; they are not transmitted elsewhere by MLX Knife.
 
 ## Security Best Practices
 
@@ -96,11 +138,13 @@ mlxk server --host 0.0.0.0 --port 8000
 
 ## Supported Versions
 
-| Version | Supported          |
+We provide security updates for these versions:
+
+| Version | Security Support   |
 | ------- | ------------------ |
-| 1.1.1   | :white_check_mark: |
-| 2.0.0-beta.3 | :white_check_mark: (Development) |
-| < 1.1.1 | :x:                |
+| 2.0.0-beta.3 | :white_check_mark: Current development |
+| 1.1.1   | :white_check_mark: Current stable |
+| < 1.1.1 | :x: Upgrade recommended |
 
 ## Additional Resources
 
@@ -110,4 +154,4 @@ mlxk server --host 0.0.0.0 --port 8000
 
 ---
 
-**Remember**: Security is everyone's responsibility. If something doesn't feel right, please report it! 🦫
\ No newline at end of file
+**Remember**: Security is everyone's responsibility. If something doesn't feel right, please report it! 🦫
diff --git a/TESTING.md b/TESTING.md
index 8e33a4e..684e2ef 100644
--- a/TESTING.md
+++ b/TESTING.md
@@ -2,131 +2,486 @@
 
 ## Current Status
 
-✅ **166/166 tests passing** (September 2025) - **STABLE RELEASE 1.1.1** 🚀  
-✅ **Apple Silicon verified** (M1/M2/M3)  
-✅ **Python 3.9-3.13 compatible**  
-✅ **Stable for development use** - comprehensive testing with real model execution
+✅ **297/317 tests passing** (November 2025) — 2.0.0 Stable Release; 20 skipped (opt-in)
+✅ **Test environment:** macOS 14.x, M2 Max, Python 3.9-3.13
+✅ **Production verified & reported:** M1, M1 Max, M2 Max in real-world use
+✅ **License:** Apache 2.0 (was MIT in 1.x)
 ✅ **Isolated test system** - user cache stays pristine with temp cache isolation
 ✅ **3-category test strategy** - optimized for performance and safety
 
-## Quick Start
+### Skipped Tests Breakdown (20 total, standard run without HF_HOME)
+- **4 Live Stop Tokens tests** - Stop token validation with real models (requires `pytest -m live_stop_tokens`, ADR-009)
+- **1 Live Run test** - Private/org model detection (requires `pytest -m live_run`, Issue #37)
+- **3 Live Clone tests** - APFS same-volume clone workflow (requires `MLXK2_LIVE_CLONE=1`)
+- **1 Live List test** - Tests against user cache (requires HF_HOME with models)
+- **1 Live Push test** - Real HuggingFace push (requires `MLXK2_LIVE_PUSH=1`)
+- **7 Issue #27 tests** - Real-model health validation (requires HF_HOME or MLXK2_USER_HF_HOME setup)
+- **3 Additional opt-in tests** - Various live validation scenarios
+
+## Quick Start (2.0 Default)
 
 ```bash
-# Install with test dependencies
-pip install -e ".[test]"
+# Install package + development tools (required for ruff/mypy/pytest)
+pip install -e ".[dev,test]"
 
-# Download test model (optional - most tests use isolated cache)
-mlxk pull mlx-community/Phi-3-mini-4k-instruct-4bit
+# Download test model (optional; most 2.0 tests use isolated cache)
+# Only needed for opt-in live tests or local experiments
+# mlxk pull mlx-community/Phi-3-mini-4k-instruct-4bit
 
-# Run all tests
-pytest
+# Run 2.0 tests (default discovery: tests_2.0/)
+pytest -v  # Runs ~300 tests (isolated, no live downloads)
 
-# Fast unit tests only
-pytest tests/unit/
+# Optional: Enable alpha push and clone tests
+MLXK2_ENABLE_ALPHA_FEATURES=1 pytest -v  # Activates alpha features (clone/push)
+
+# Live tests (opt-in; not part of default suite):
+# - Live stop tokens (ADR-009 - requires models in HF_HOME):
+#   pytest -m live_stop_tokens
+#   # Tests: MXFP4, Qwen 2.5, Llama 3.2 stop token behavior
+# - Live run (requires models in HF_HOME):
+#   pytest -m live_run
+#   # Tests: Issue #37 private/org model detection
+# - Live push (requires alpha features + env):
+#   export MLXK2_ENABLE_ALPHA_FEATURES=1
+#   export MLXK2_LIVE_PUSH=1
+#   export HF_TOKEN=...; export MLXK2_LIVE_REPO=org/model; export MLXK2_LIVE_WORKSPACE=/abs/path
+#   pytest -m live_push
+# - Live clone (ADR-007 Phase 1 - requires alpha features + env + same volume):
+#   export MLXK2_ENABLE_ALPHA_FEATURES=1
+#   export MLXK2_LIVE_CLONE=1
+#   export HF_TOKEN=...
+#   export MLXK2_LIVE_CLONE_MODEL="mlx-community/small-model"
+#   export MLXK2_LIVE_CLONE_WORKSPACE="/path/on/same/volume/as/HF_HOME"  # APFS + same volume required
+#   pytest -m live_clone
+# - Live list (uses your HF_HOME; requires at least one MLX chat + one MLX base in cache):
+#   export HF_HOME=/path/to/huggingface/cache
+#   pytest -m live_list
 
 # Before committing
-ruff check mlx_knife/ --fix && mypy mlx_knife/ && pytest
+ruff check mlxk2/ --fix && mypy mlxk2/ && pytest -v
 ```
 
+Notes
+- Reference environment: venv39 (Apple‑native Python 3.9) is the recommended dev base.
+- Extras `[dev,test]` install ruff/mypy (code quality) and pytest/jsonschema (testing).
+- For release smoke across multiple Python versions: `./test-multi-python.sh` (logs: `test_results_3_9.log`, `test_results_3_10.log`, ...).
+- The macOS Python 3.9 LibreSSL warning from urllib3 is suppressed in tests via `pytest.ini`, and at runtime via package init.
+
 ## Why Local Testing?
 
-MLX Knife requires **Apple Silicon hardware** and **real MLX models** for comprehensive testing:
+MLX Knife tests fall into three categories for 2.0:
 
-- **Hardware Requirement**: MLX framework only runs on Apple Silicon (M1/M2/M3)
-- **Model Requirement**: Tests use actual models (4GB+) for realistic validation
-- **Industry Standard**: Local testing is normal for MLX projects
-- **Quality Assurance**: Real hardware testing ensures actual functionality
+- **Stable CLI/JSON tests (default)**: Run on any supported Python on macOS; no model inference required; use an isolated HF cache (no network). **206 tests**
+- **Alpha features (opt-in)**: Hidden alpha features like `push` and `clone` require environment variables to enable. **+21 tests**
+- **Live/Inference tests (opt-in)**: Network-dependent or requiring real models/cache setup. **Various markers/env vars**
 
-This approach ensures our tests reflect real-world usage, not mocked behavior.
+**Default test run** covers all stable 2.0 features without experimental or live dependencies.
 
 ## Test Structure
 
+### 2.0 Test Structure
+
+Legend
+- spec/: JSON API spec/contract validation; stays in sync with docs/schema.
+- live/: Opt‑in tests requiring env/config; skipped by default.
+- stubs/: Lightweight MLX/MLX‑LM replacements used only in unit/spec tests.
+- conftest.py: Isolated HF cache (temp), safety sentinel, core fixtures/helpers.
+- conftest_runner.py: Runner‑focused fixtures/mocks for generation tests.
+- *.py.disabled: Intentionally disabled suites (WIP/expanded scenarios, not run).
+
 ```
-tests/
-├── conftest.py                         # Shared fixtures and utilities
-├── integration/                        # System-level integration tests (78 tests)
-│   ├── test_core_functionality.py          # Basic CLI operations (isolated cache)
-│   ├── test_health_checks.py               # Model corruption detection (isolated cache)
-│   ├── test_lock_cleanup_bug.py            # Issue #23: Lock cleanup (isolated cache)
-│   ├── test_process_lifecycle.py           # Process management (isolated cache)
-│   ├── test_real_model_lifecycle.py        # Full model lifecycle (isolated cache)
-│   ├── test_run_command_advanced.py        # Run command edge cases (isolated cache)
-│   ├── test_server_functionality.py        # Server lifecycle tests
-│   ├── test_end_token_issue.py             # Issue #20: End-token filtering (@server)
-│   ├── test_issue_14.py                    # Issue #14: Chat self-conversation (@server)
-│   └── test_issue_15_16.py                 # Issues #15/#16: Dynamic token limits (@server)
-└── unit/                              # Module-level unit tests (88 tests)
-    ├── test_cache_utils.py                 # Cache management & Issue #21/#23 tests
-    ├── test_cli.py                         # CLI argument parsing
-    ├── test_health_multishard.py           # Strict multi-shard/index health (Issue #27)
-    ├── test_mlx_runner_memory.py           # Memory management tests
-    └── test_model_card_detection.py        # Issue #31: README/tokenizer hints for framework/type
+tests_2.0/
+├── __init__.py
+├── conftest.py                        # Isolated test cache (HF_HOME override), safety sentinel, core fixtures
+├── conftest_runner.py                 # Runner-specific fixtures/mocks
+├── stubs/                             # Minimal mlx/mlx_lm stubs for unit/spec tests
+│   ├── mlx/
+│   │   └── core.py
+│   └── mlx_lm/
+│       ├── __init__.py
+│       ├── generate.py
+│       └── sample_utils.py
+├── spec/                              # JSON API spec/contract validation
+│   ├── test_cli_commands_json_flag.py         # CLI JSON flag behavior
+│   ├── test_cli_version_output.py             # Version command JSON shape
+│   ├── test_code_outputs_validate_against_schema.py  # Code outputs validate against schema
+│   ├── test_push_error_matches_schema.py      # Push error output matches schema
+│   ├── test_push_output_matches_schema.py     # Push success output matches schema
+│   ├── test_spec_doc_examples_validate.py     # Docs examples validate against JSON schema
+│   └── test_spec_version_sync.py              # Code/docs version consistency check
+├── live/                              # Opt-in live tests (markers)
+│   ├── test_clone_live.py                     # Live clone flow (requires MLXK2_LIVE_CLONE, HF_TOKEN)
+│   ├── test_list_human_live.py                # Live list/health against user cache (requires HF_HOME)
+│   └── test_push_live.py                      # Live push flow (requires MLXK2_LIVE_PUSH, HF_TOKEN)
+├── test_cli_push_args.py              # Push CLI args and JSON error/output handling (offline)
+├── test_clone_operation.py            # Clone operations with APFS optimization
+├── test_ctrl_c_handling.py            # SIGINT handling during run/interactive flows
+├── test_detection_readme_tokenizer.py # README/tokenizer-based framework detection
+├── test_edge_cases_adr002.py          # Naming/health edge cases (ADR-002)
+├── test_health_multifile.py           # Multi-file health completeness (index vs pattern)
+├── test_human_output.py               # Human rendering of list/health views
+├── test_integration.py                # Model resolution and health integration
+├── test_interactive_mode.py           # Interactive CLI mode prompts/history/streaming
+├── test_interruption_recovery.py      # Recovery semantics after interruption (flag reset)
+├── test_issue_27.py                   # Health policy exploration with real models (marker: issue27)
+├── test_issue_30_preflight.py         # Preflight for gated/private/not-found repos (Issue #30)
+├── test_json_api_list.py              # JSON API list contract (shape/fields)
+├── test_json_api_show.py              # JSON API show contract (base/files/config)
+├── test_legacy_formats.py             # Legacy model format detection (Issue #37)
+├── test_model_naming.py               # Conversion rules, bijection, parsing
+├── test_push_dry_run.py               # Push dry-run diff planning (added/modified/deleted)
+├── test_push_extended.py              # Extended push: no-op vs commit, branch/retry, .hfignore
+├── test_push_minimal.py               # Minimal push scenarios (offline)
+├── test_push_workspace_check.py       # Push check-only: workspace validation without network
+├── test_robustness.py                 # Robustness for rm/pull/disk/timeout/concurrency
+├── test_run_complete.py               # End-to-end run command (stream/batch/params)
+├── test_runner_core.py                # MLXRunner core generation/memory/stop tokens
+├── test_runtime_compatibility_reason_chain.py  # Runtime compatibility reason field decision chain (Issue #36)
+├── test_server_api_minimal.py         # Minimal OpenAI-compatible server endpoints (SSE, JSON)
+├── test_server_api.py.disabled        # Disabled server API tests (WIP/expanded scenarios)
+├── test_server_models_and_errors.py   # Server model loading and error handling
+├── test_server_streaming_minimal.py   # Server SSE streaming functionality
+├── test_server_token_limits_api.py    # Server token limit enforcement
+└── test_token_limits.py               # Dynamic token calculation; server vs run policies
 ```
 
+Note: Live tests are opt-in via markers (`-m live_push`, `-m live_list`) and environment. Default `pytest` discovery runs only the offline suite above.
 
-## 3-Category Test Strategy (MLX Knife 1.1.0+)
+### MLX/MLX‑LM Stubs (fast offline tests)
+- Purpose: Unit/spec tests run platform‑neutral and without real MLX/MLX‑LM runtime.
+- Mechanics: `tests_2.0/conftest.py` prepends `tests_2.0/stubs/` to `sys.path`, so `import mlx`/`mlx_lm` resolve to minimal stubs.
+- Effect: Fast, deterministic tests without GPU/large RAM footprint; live/heavy path remains opt‑in.
+- Production: CLI/server still use the real packages; stubs are not installed.
+- **Stub Limitations:** Tests requiring real mlx-lm integration (e.g., `_get_classes` API) use `@requires_mlx_lm` marker and skip on CI. Guideline: Avoid mocks when Mock:Production LOC ratio >30% (maintenance burden > value). [Note: Details will be documented in Issue #36 closing comment]
 
-MLX Knife uses a **3-category test strategy** to balance test isolation, performance, and user cache protection:
+## Push Testing (2.0)
 
-### 🏠 CATEGORY 1: ISOLATED CACHE (Most Tests)
-**✅ User cache stays pristine** - Tests use temporary isolated caches with automatic cleanup
+This section summarizes what our test suite covers for the experimental `push` feature and what still requires live/manual checks.
 
-**Implemented Tests (78 tests):**
-- ✅ `test_real_model_lifecycle.py` - Full model lifecycle with `tiny-random-gpt2` (~12MB download)
-- ✅ `test_core_functionality.py` - Basic CLI operations with `patch_model_cache` isolation  
-- ✅ `test_process_lifecycle.py` - Process management with isolated cache + MODEL_CACHE patching
-- ✅ `test_run_command_advanced.py` - Run command edge cases with `mock_model_cache` in isolation
-- ✅ `test_lock_cleanup_bug.py` - Lock cleanup testing with temporary MODEL_CACHE override
-- ✅ `test_health_checks.py` - Mock corruption testing with isolated `temp_cache_dir`
+### Reference: Push CLI and JSON
 
-**Technical Pattern:**
+- Usage: `mlxk2 push <local_dir> <org/model> --private [--create] [--branch main] [--commit <msg>] [--check-only] [--json] [--verbose]`
+- Args:
+  - `--private` (required in alpha): Safety gate to avoid public uploads.
+  - `--create`: Create the repository if it does not exist (model repo).
+- `--branch`: Target branch, default `main`. Missing branches are tolerated; with `--create`, the branch is proactively created (and upload retried once if the hub initially rejects the revision).
+  - `--commit`: Commit message, default `"mlx-knife push"`.
+  - `--check-only`: Analyze workspace locally; no network call; returns `data.workspace_health`.
+  - `--dry-run`: Compare local workspace to the remote branch and summarize changes without uploading (requires repo read access).
+  - `--json`: Print JSON response; in JSON mode, logs/progress are suppressed by default.
+  - `--verbose`: Human mode — append details (e.g., commit URL). In JSON mode, only toggles console log verbosity; the JSON payload is unchanged.
+
+- JSON fields (`data`):
+  - `repo_id: string` — target `org/model`.
+  - `branch: string` — target branch.
+  - `commit_sha: string|null` — commit id; null when `no_changes:true` or on noop.
+  - `commit_url: string|null` — link to commit; null when no commit created.
+  - `repo_url: string` — `https://huggingface.co/<org/model>`.
+  - `uploaded_files_count: int|null` — number of changed files; set to `0` on `no_changes:true`.
+  - `local_files_count: int|null` — approximate local file count scanned.
+  - `no_changes: boolean` — true when hub reports an empty commit (preferred signal) or no file operations are detected.
+  - `created_repo: boolean` — true when repo was created (with `--create`).
+  - `change_summary: {added:int, modified:int, deleted:int}` — optional; derived from hub response when available.
+  - `message: string|null` — short human hint; mirrors hub on no‑op.
+  - `hf_logs: string[]` — buffered hub log lines (not printed in JSON mode unless `--verbose`).
+  - `experimental: true` and `disclaimer: string` — feature state markers.
+  - `workspace_health: {...}` — present only with `--check-only`:
+    - `healthy: bool`, `anomalies: []`, `config`, `weights.index`, `weights.pattern_complete`, etc.
+  - `dry_run: true` — present only with `--dry-run`.
+  - `dry_run_summary: {added:int, modified:int, deleted:int}` — present with `--dry-run`.
+  - `would_create_repo: bool` / `would_create_branch: bool` — planning hints when target does not exist.
+
+- Error types (`error.type`):
+  - `dependency_missing` — `huggingface-hub` not installed.
+  - `auth_error` — missing `HF_TOKEN` (unless `--check-only`).
+  - `workspace_not_found` — local_dir missing/not a directory.
+  - `repo_not_found` — repo missing without `--create`.
+  - `upload_failed` — hub returned an error (e.g., 403/permission).
+  - `push_operation_failed` — unexpected internal failure wrapper.
+
+- Exit codes: success → `0`; any `status:error` → `1`.
+
+Notes on output verbosity and behavior
+- JSON is quiet by default: only the final JSON object is printed. Use `--verbose` to allow hub logs/progress to reach the console (the JSON payload remains unchanged). For assertions, prefer `data.hf_logs`.
+- Human mode is chatty by default: progress + one‑liner summary. `--verbose` appends the commit URL when present.
+- No‑changes detection: If the hub reports “No files have been modified… Skipping to prevent empty commit.”, JSON sets `no_changes: true`, `uploaded_files_count: 0`, and nulls `commit_sha`/`commit_url`. Human shows “— no changes”. This hub signal is preferred over inferring from file lists.
+ - `--dry-run` human output: prints a concise plan line `dry-run: +A ~M -D` (modifications are an approximation and may be `~?` in rare cases).
+ - Branch creation with `--create`: Even if the push is a no‑op, the target branch is created upfront.
+
+Examples (expected)
+- No‑op re‑push (JSON): `commit_sha: null`, `commit_url: null`, `uploaded_files_count: 0`, `no_changes: true`, `message` mirrors hub text, `hf_logs` contains hub lines.
+- Commit (JSON): `commit_sha`/`commit_url` populated; `uploaded_files_count == sum(change_summary.values())`; `message` summarizes counts.
+
+- Dry-run (existing repo/branch, no changes) — JSON:
+  ```json
+  {
+    "status": "success",
+    "command": "push",
+    "error": null,
+    "data": {
+      "repo_id": "org/model",
+      "branch": "main",
+      "commit_sha": null,
+      "commit_url": null,
+      "repo_url": "https://huggingface.co/org/model",
+      "uploaded_files_count": 0,
+      "local_files_count": 11,
+      "no_changes": true,
+      "created_repo": false,
+      "message": "Dry-run: no changes",
+      "experimental": true,
+      "disclaimer": "Alpha feature (upload only). No validation/filters; review results on the Hub.",
+      "dry_run": true,
+      "dry_run_summary": {"added": 0, "modified": null, "deleted": 0},
+      "change_summary": {"added": 0, "modified": 0, "deleted": 0},
+      "would_create_repo": false,
+      "would_create_branch": false,
+      "added_files": [],
+      "deleted_files": []
+    }
+  }
+  ```
+
+- Dry-run (existing repo/branch, changes present) — JSON:
+  ```json
+  {
+    "status": "success",
+    "command": "push",
+    "error": null,
+    "data": {
+      "repo_id": "org/model",
+      "branch": "main",
+      "commit_sha": null,
+      "commit_url": null,
+      "repo_url": "https://huggingface.co/org/model",
+      "uploaded_files_count": 0,
+      "local_files_count": 11,
+      "no_changes": false,
+      "created_repo": false,
+      "message": "Dry-run: +2 ~? -1",
+      "experimental": true,
+      "disclaimer": "Alpha feature (upload only). No validation/filters; review results on the Hub.",
+      "dry_run": true,
+      "dry_run_summary": {"added": 2, "modified": null, "deleted": 1},
+      "change_summary": {"added": 2, "modified": 0, "deleted": 1},
+      "would_create_repo": false,
+      "would_create_branch": false,
+      "added_files": ["new.txt", "weights/model.safetensors"],
+      "deleted_files": ["old.txt"]
+    }
+  }
+  ```
+
+- Dry-run — Human output:
+  ```
+  push (experimental): org/model@main — dry-run: no changes
+  push (experimental): org/model@main — dry-run: +2 ~? -1
+  ```
+
+Spec/Schema
+- The JSON API spec version and schema live in `mlxk2/spec.py` and `docs/json-api-specification.md`. The docs schema includes support for `command: "push"` and its fields. Keep tests in sync with those sources of truth.
+
+**Automated (offline)**
+- **Token/Workspace errors:** Missing `HF_TOKEN` and missing workspace produce proper JSON errors.
+- **CLI args (JSON mode):** Missing positional args emit JSON errors rather than usage text.
+- **Schema shape:** Push success/error outputs validate against `docs/json-api-schema.json`.
+- **No-op push:** Detects `no_changes: true`, sets `uploaded_files_count: 0`, carries hub message into JSON (`message`/`hf_logs`), and human output shows "no changes" without duplicate logs.
+- **Commit path:** Extracts `commit_sha`, `commit_url`, `change_summary` (+/~/−), correct `uploaded_files_count`; human `--verbose` includes URL.
+- **Repo/Branch handling:** Missing repo requires `--create`; with `--create` sets `created_repo: true`. Missing branch is tolerated; upload attempts proceed. With `--create`, the branch is proactively created and the upload is retried once if the hub rejects the revision (e.g., “Invalid rev id”).
+- **Ignore rules:** `.hfignore` is merged with default ignores and forwarded to the hub.
+
+Files:
+- `tests_2.0/test_cli_push_args.py` (CLI errors and JSON outputs)
+- `tests_2.0/test_push_extended.py` (no-op vs commit, branch/repo, .hfignore, human; includes retry on invalid revision with `--create`)
+- `tests_2.0/spec/test_push_output_matches_schema.py` (schema success path)
+
+Run (venv39):
+- `source venv39/bin/activate && pip install -e .`
+- `pytest -q tests_2.0/test_cli_push_args.py tests_2.0/test_push_extended.py`
+- `pytest -q tests_2.0/spec/test_push_output_matches_schema.py`
+- Targeted retry test: `pytest -q tests_2.0/test_push_extended.py::test_push_retry_creates_branch_on_upload_revision_error`
+
+**Live (opt-in / wet)**
+- Purpose: sanity-check real HF behavior (auth, no-op vs commit, URLs).
+- Defaults: Live tests are skipped. Enable with env vars and markers.
+- Env:
+  - `MLXK2_LIVE_PUSH=1`
+  - `HF_TOKEN` (write-enabled)
+  - `MLXK2_LIVE_REPO='org/model'`
+  - `MLXK2_LIVE_WORKSPACE='/abs/path/to/workspace'`
+- Command:
+  - `pytest -q -m wet tests_2.0/live/test_push_live.py`
+  - or `pytest -q -m live_push`
+
+## Pull/Preflight (Issue #30)
+
+Goal: Gated/private/not‑found repos must not pollute the cache and should fail fast.
+
+- Behavior (2.0):
+  - Preflight uses `huggingface_hub.HfApi.model_info()` (metadata only; no download).
+  - Gated/Forbidden/Unauthorized/NotFound → `access_denied` before download; clear hint to set `HF_TOKEN`.
+  - Network timeouts/unspecific HTTP errors in preflight → degrade to a warning; allow the download layer (to surface meaningful error/timeout paths).
+  - Tokens: prefer `HF_TOKEN` (legacy `HUGGINGFACE_HUB_TOKEN` is read, but not promoted).
+  - Tests use isolated caches; the user cache is never touched.
+
+- Relevant tests: `tests_2.0/test_issue_30_preflight.py`
+  - `test_preflight_private_model_without_token`
+  - `test_preflight_nonexistent_model`
+  - `test_preflight_integration_in_pull`
+  - `test_preflight_prevents_cache_pollution`
+
+- Quick checks:
+  - `pytest -q tests_2.0/test_issue_30_preflight.py`
+  - CLI: `unset HF_TOKEN HUGGINGFACE_HUB_TOKEN; mlxk-json pull meta-llama/Llama-2-7b-hf --json`
+
+## Runner: Interruption & Recovery
+
+- Semantics (2.0): A new generation resets `_interrupted = False` at the start (recovery behavior). A previous Ctrl‑C does not block the next generation.
+- Streaming:
+  - During an active generation, the runner yields a line `"[Generation interrupted by user]"` and stops.
+  - Token diffing in streaming is robust against minimal mocks (no StopIteration due to short `decode` sequences).
+- Batch:
+  - Resets the flag at the start of a new generation; filters stop tokens; chat stop tokens optional via `use_chat_stop_tokens=True`.
+- Relevant tests:
+  - `tests_2.0/test_ctrl_c_handling.py` (SIGINT, interruption behavior, interactive)
+  - `tests_2.0/test_interruption_recovery.py` (resetting the flag for new generations)
+  - `tests_2.0/test_runner_core.py` (consistency/batch/streaming, error handling)
+
+## Server Minimal Tests
+
+- Dependencies: `httpx`, `fastapi`, `uvicorn`, `pydantic` (via `[test]`).
+- Scope: OpenAI‑compatible endpoints (minimal smoke); no real models required.
+- Optional for local verification; in CI currently “nice to have” (Backlog, not part of the 2.0 Guide).
+
+## Known Warnings
+
+- urllib3 LibreSSL notice on macOS Python 3.9
+  - Message: “urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3' …”
+  - Status: Harmless for our usage; suppressed in production code (see `mlxk2/__init__.py`, `warnings.filterwarnings(...)`).
+  - Tests: May still appear in pytest summary if third‑party dependencies import `urllib3` before our package.
+  - Optional suppression in tests: add to `pytest.ini`:
+
+    ```ini
+    filterwarnings =
+        ignore:urllib3 v2 only supports OpenSSL 1.1.1+
+    ```
+- Notes:
+  - Live test does not use `--create` (safety). If the repo does not exist, create it once manually.
+  - Manual create example: `mlxk2 push --private --create "$MLXK2_LIVE_WORKSPACE" "$MLXK2_LIVE_REPO" --json`
+
+**Manual Checklist (Live)**
+- **Create repo (first time):** `--private --create` → expect `created_repo: true`, private repo on HF.
+- **No-op re-push:** identical workspace → `no_changes: true`, `uploaded_files_count: 0`, concise human "no changes".
+- **Commit after change:** edit a small file → push shows `commit_sha`, `commit_url`, `change_summary` matches expectations.
+- **.hfignore behavior:** add ignores (e.g., `.idea/`, `.vscode/`, `*.ipynb`) → verify excluded on HF.
+- Optional errors: invalid token or missing rights → JSON `error` (`upload_failed` / auth error), clear message.
+
+Human vs JSON:
+- Human output is derived from JSON only; hub logs are not printed directly.
+- Use `--verbose` with human output to append the commit URL or short message; JSON content stays the same structurally.
+
+## Manual MLX Chat Model Smoke Test (2.0)
+
+Goal: Pull a small MLX chat model, verify classification, prepare a local workspace, validate it offline, and push to a private repo while preserving chat intent. This helps issuers validate iOS‑focused workflows.
+
+Model choice (example)
+- `mlx-community/Qwen2.5-0.5B-Instruct-4bit` (small, chat‑oriented)
+
+Steps
+- Pull (venv39):
+  - `mlxk2 pull mlx-community/Qwen2.5-0.5B-Instruct-4bit`
+- Verify in cache:
+  - `mlxk2 list --health "Qwen2.5-0.5B-Instruct-4bit"`
+  - Expect: Framework MLX, Type chat, capabilities include chat
+- Prepare local workspace from cache (dereference symlinks):
+  - Ensure `HF_HOME` points to your HF cache (optional, but recommended)
+  - Compute cache path: `$HF_HOME/models--mlx-community--Qwen2.5-0.5B-Instruct-4bit`
+  - Find latest snapshot hash under `snapshots/`
+  - Copy to workspace and dereference symlinks:
+    - `rsync -aL "$HF_HOME/models--mlx-community--Qwen2.5-0.5B-Instruct-4bit/snapshots/<HASH>/" ./mymodel_test_workspace/`
+- Recommended README front‑matter (to preserve intent on push):
+  - Include YAML with tags and pipeline tag, e.g.
+    - `tags: [mlx, chat]`
+    - `pipeline_tag: text-generation`
+    - `base_model: <upstream_base>`
+  - Keep model name containing `Instruct` or `chat` to aid chat detection
+- Offline validation (no network):
+  - `mlxk2 push --check-only ./mymodel_test_workspace <org/model> --json`
+  - Expect: `workspace_health.healthy: true`; ensure tokenizer present (`tokenizer.json` or `tokenizer.model`) and at least one non‑empty weight file
+- Push to private repo:
+  - `mlxk2 push --private --create ./mymodel_test_workspace <org/model> --json`
+  - Re‑push without changes should show `no_changes: true`
+- Post‑push verification:
+  - `mlxk2 list --all --health <org/model>`
+  - Current limitation: Framework may show `PyTorch` for non‑`mlx-community` orgs due to conservative detection. This does not affect content; future M1 will parse model card tags (`mlx`) to classify MLX across orgs.
+
+Notes
+- Ensure tokenizer files exist (tokenizer.json/tokenizer.model) and optional generation_config.json for runnable chat contexts.
+- Avoid pushing unwanted files; use `.hfignore` for project‑specific filters.
+
+## 2.0 Test Strategy
+
+MLX Knife 2.0 uses a **3-category test strategy** with enhanced isolation and sentinel protection:
+
+### 🏠 CATEGORY 1: ISOLATED CACHE (Default Tests - ~230+ tests)
+**✅ User cache stays pristine** - Tests use `isolated_cache` fixture with sentinel protection
+
+**Current 2.0 Test Files:**
+- ✅ `test_json_api_*.py` - JSON API contract validation
+- ✅ `test_human_output.py` - Human output formatting
+- ✅ `test_health_multifile.py` - Multi-file health completeness
+- ✅ `test_push_*.py` - Push operations (offline, dry-run, workspace check)
+- ✅ `test_clone_operation.py` - Clone operations with APFS optimization
+- ✅ `test_run_complete.py` / `test_runner_core.py` - Run command and MLX generation
+- ✅ `test_server_*_minimal.py` - Basic server API endpoints
+- ✅ `spec/test_*.py` - Schema validation and spec compliance
+
+**Technical Pattern (2.0):**
 ```python
-@pytest.mark.usefixtures("temp_cache_dir")
-class TestBasicLifecycle:
-    def test_something(self, temp_cache_dir, patch_model_cache):
-        with patch_model_cache(temp_cache_dir / "hub"):
-            # Test operates in complete isolation
-            # User cache never touched, automatic cleanup
+def test_something(isolated_cache):
+    # Test operates in complete isolation with sentinel protection
+    # isolated_cache fixture ensures TEST_SENTINEL exists
+    # MLX stubs enable platform-neutral testing without real MLX
+    assert_is_test_cache(isolated_cache)  # Safety check
+    # Test implementation here
 ```
 
-**Benefits:** 
+**Benefits:**
 - ✅ **Clean User Cache**: No test artifacts or broken models ever
-- ✅ **Parallel Testing**: No cache conflicts between test runs  
+- ✅ **Parallel Testing**: No cache conflicts between test runs
 - ✅ **Reproducible**: No dependency on existing models in user cache
-- ✅ **Fast CI**: Small models (12MB vs 4GB) for most tests
+- ✅ **Platform Neutral**: MLX stubs enable testing without real MLX hardware
+- ✅ **Sentinel Protection**: `TEST_SENTINEL` prevents accidental user cache modification
 
-### 🏥 CATEGORY 2: USER CACHE (Framework Diversity)
-**📋 Reserved for future** - Real model diversity that cannot be mocked
+### 🌐 CATEGORY 2: LIVE TESTS (Network/User Cache - Opt-in)
+**🔒 Require explicit environment setup** - Located in `live/` directory
 
-**Future Framework Validation Tests:**
-- Multiple framework detection (MLX + PyTorch + Tokenizer-only models)
-- Health check diversity testing with naturally corrupted models
-- Cross-framework model compatibility validation
+**Live Test Files:**
+- 🔒 `live/test_push_live.py` - Real HuggingFace push operations
+- 🔒 `live/test_clone_live.py` - APFS same-volume clone workflows
+- 🔒 `live/test_list_human_live.py` - Tests against user cache models
+- 🔒 `test_issue_27.py` - Real multi-shard model health validation (marker: `issue27`)
 
-**Currently**: All health/framework tests use `mock_model_cache` and are Category 1 (isolated)
+**Markers:** `live_push`, `live_clone`, `live_list`, `wet` (umbrella), `issue27`
 
-### 🖥️ CATEGORY 3: SERVER CACHE (Performance Tests)  
-**🔒 Large models, user cache expected** - Marked with `@pytest.mark.server`
+### 🖥️ CATEGORY 3: SERVER TESTS (2.0 Minimal)
+**✅ Basic server functionality** - Lightweight API validation
 
-**Server Tests (Excluded from default `pytest`):**
-- 🔒 `test_issue_14.py` - Chat self-conversation regression tests
-- 🔒 `test_issue_15_16.py` - Dynamic token limit validation  
-- 🔒 `test_end_token_issue.py` - End-token filtering consistency
-- 🔒 `test_server_functionality.py` - OpenAI API compliance (basic tests only)
+**Server Test Files:**
+- ✅ `test_server_api_minimal.py` - Basic OpenAI-compatible endpoints
+- ✅ `test_server_streaming_minimal.py` - SSE streaming functionality
+- ✅ `test_server_models_and_errors.py` - Model loading and error handling
+- ✅ `test_server_token_limits_api.py` - Token limit enforcement
 
-**Technical Pattern:**
-```python
-@pytest.mark.server  # Excluded from default pytest
-def test_server_feature(mlx_server, model_name: str):
-    # Uses real models in user cache
-    # Requires significant RAM and time
-```
+**Characteristics (2.0):**
+- ✅ **Included by default** - Part of standard test suite
+- 🏠 **Uses isolated cache** - Same safety as Category 1
+- ⚡ **Fast execution** - Uses MLX stubs, no real model loading
+- 🎯 **API compliance focus** - OpenAI compatibility validation
 
-**Characteristics:**
-- 🔒 **Not run by default** - Must use `pytest -m server`
-- 💾 **RAM-aware** - Auto-skip models exceeding available memory
-- ⏱️ **Longer execution** - 20-40 minutes for full suite
-- 🎯 **Model diversity** - Tests across different model sizes/architectures
+**Run specifically:** `pytest -k server -v` (optional, included in default anyway)
+
+**Note:** Heavy server tests with real models documented in "Future" section above
 
 ## Test Prerequisites
 
@@ -134,13 +489,93 @@ def test_server_feature(mlx_server, model_name: str):
 
 1. **Apple Silicon Mac** (M1/M2/M3)
 2. **Python 3.9 or newer**
-3. **Test dependencies installed**:
+3. **Test dependencies installed** (includes jsonschema for Spec tests):
    ```bash
-   pip install -e ".[test]"
+   pip install -e .[test]
    ```
 
+Notes:
+- Spec validation requires `jsonschema`. Installing `.[test]` ensures it is available.
+- Without `jsonschema`, Spec example validation is skipped (you will see one extra SKIPPED test).
+- With `jsonschema` installed, expect one additional PASS in the `-m spec` and `tests_2.0/` totals.
+
 **That's it!** Most tests (Category 1) use isolated caches and download small test models automatically (~12MB).
 
+### Enabling Issue #27 Tests (optional)
+
+Quick start (minimal)
+- Best practice: set your HF cache to an external volume before pytest: `export HF_HOME=/Volumes/your-ssd/huggingface/cache`.
+- Select a model: `export MLXK2_ISSUE27_MODEL="org/model"`.
+  - Tip: choose an upstream repo that provides an index file (`model.safetensors.index.json` or `pytorch_model.bin.index.json`) to avoid SKIPs.
+- Optional: if your cache has no index file for this repo, enable isolated index bootstrap (index‑only, no shards): `export MLXK2_BOOTSTRAP_INDEX=1`.
+- Run: `pytest tests_2.0/test_issue_27.py -v`.
+
+Notes
+- Tests read from your user cache and copy a minimal subset into an isolated test cache.
+- Network is only used when `MLXK2_BOOTSTRAP_INDEX=1` and the index file is not present locally.
+
+- Set your user cache:
+  - EITHER set `MLXK2_USER_HF_HOME=/absolute/path/to/your/huggingface/cache`
+  - OR set `HF_HOME=/absolute/path/to/your/huggingface/cache` before running pytest — the test harness preserves this original value and exposes it to the Issue #27 helpers while still isolating `HF_HOME` for the code under test.
+- Select a specific upstream model that includes an index file (strongly recommended):
+  - `export MLXK2_ISSUE27_MODEL="mistralai/Mixtral-8x7B-Instruct-v0.1"`
+  - or another upstream PyTorch repo that contains `model.safetensors.index.json` or `pytorch_model.bin.index.json`.
+  - Note: Many `mlx-community/...` conversions do not ship the upstream safetensors index; prefer the original upstream repo to avoid SKIPs.
+- Minimize copy size (optional):
+  - `export MLXK2_SUBSET_COUNT=1`  (Default 1; erhöht ggf. Shard‑Anzahl)
+  - `export MLXK2_MIN_FREE_MB=512` (Default 512 MB Sicherheitsmarge)
+- Run the focused tests: `PYTHONPATH=. pytest tests_2.0/test_issue_27.py -v`
+
+Optional bootstrap (opt-in, minimal workflow):
+- Minimal preconditions to run all Issue #27 tests without SKIPs:
+  - Select models to test:
+    - Healthy check model (read-only): `export MLXK2_ISSUE27_MODEL="org/model"` (should be present and healthy in your user cache; single-shard small models are ideal, e.g., `sshleifer/tiny-gpt2`).
+    - Index tests model (optional, can be different): `export MLXK2_ISSUE27_INDEX_MODEL="org/model-with-index"` (upstream repo that lists an index; not required to be fully downloaded locally).
+- Ensure your user cache root is set via `MLXK2_USER_HF_HOME` (or provide it via `HF_HOME` before pytest; the harness maps it across).
+  - Enable index bootstrap: `export MLXK2_BOOTSTRAP_INDEX=1` (fetches only index files into the ISOLATED test cache; never modifies your user cache).
+  - Then: `pytest tests_2.0/test_issue_27.py -v`
+  - Note: Network is only needed if your user cache does not already contain an index file for the chosen repo. If the index exists in your cache, the tests copy it into the isolated cache and no network is required.
+
+If you still see SKIPs:
+- “No safetensors index found” → The chosen model snapshot lacks an index file. Pick a model that has `model.safetensors.index.json` (or `pytorch_model.bin.index.json`).
+- “Not enough free space” → Free disk space; tests create a subset copy into an isolated temp cache.
+- “User model not found” → Verify your model exists in the user cache and `MLXK2_USER_HF_HOME` points to the `.../huggingface/cache` root.
+
+Quick helper to list index‑bearing models in your user cache:
+
+```bash
+find "$MLXK2_USER_HF_HOME/hub" -type f \
+  \( -name 'model.safetensors.index.json' -o -name 'pytorch_model.bin.index.json' \) \
+| sed 's#.*/hub/models--\(.*\)/snapshots/.*#\1#; s#--#/#g' | sort -u
+```
+
+With a suitable model (i.e., one that includes an upstream safetensors index) present and `MLXK2_USER_HF_HOME` set, the Issue #27 tests should run without SKIPs.
+
+### When Issue #27 real‑model tests make sense
+
+Purpose
+- These tests validate the strict health policy against real upstream Hugging Face repositories that ship multi‑shard safetensors with a `model.safetensors.index.json`. They complement the deterministic unit tests by exercising real‑world layouts.
+
+Run them when
+- Your user cache contains at least one upstream PyTorch repo with a safetensors index (not MLX/GGUF conversions). Good candidates:
+  - `mistralai/Mistral-7B-Instruct-v0.2` or `-v0.3`
+  - `Qwen/Qwen1.5-7B-Chat`, `Qwen/Qwen2-7B-Instruct`
+  - `teknium/OpenHermes-2.5-Mistral`
+  - Gated: `meta-llama/Llama-2-7b-chat-hf`, `meta-llama/Llama-3-8B-Instruct`, `google/gemma-7b-it`
+- You want to sanity‑check index‑based completeness, shard deletion/truncation, and LFS pointer detection against real artifacts.
+
+They are not useful when
+- Your cache only has MLX Community models (no `model.safetensors.index.json`) or GGUF models — the index‑based tests will skip by design. In that case, rely on `tests_2.0/test_health_multifile.py` for deterministic coverage.
+
+- Resource considerations
+- Disk: tests copy a minimal subset of files into an isolated cache (index + 1 smallest shard, oder 1 Pattern‑Shard). Optional Tuning:
+  - `export MLXK2_SUBSET_COUNT="1"` (Default 1; erhöhe bei Bedarf)
+  - `export MLXK2_MIN_FREE_MB="512"` (Default 512 MB; erhöhe bei knappem Platz)
+- Network: if you need to fetch a candidate model first, prefer downloading only `config.json`, `model.safetensors.index.json`, and 1–2 small shards to keep it light.
+
+Summary
+- If you have a suitable upstream PyTorch chat/instruct model with an index in your user cache, enable the env vars above and run `tests_2.0/test_issue_27.py` for an extra layer of real‑model assurance. Otherwise, the deterministic tests already validate the policy thoroughly.
+
 ### Optional Setup (Server Tests Only)
 
 For server tests (`@pytest.mark.server` - **excluded by default**):
@@ -154,6 +589,22 @@ mlxk pull mlx-community/Mistral-7B-Instruct-v0.3-4bit
 
 **Note**: Server tests are excluded from default `pytest` and require manual execution with `pytest -m server`.
 
+## Environment & Caches
+
+To keep results reproducible and caches safe on Apple Silicon:
+
+- Preferred Python/venv: Apple‑native 3.9 in a dedicated env
+  - Example: `python3.9 -m venv venv39 && source venv39/bin/activate && pip install -e .[test]`
+- User cache (persistent): shared, real cache for manual ops and certain advanced/server tests
+  - Example (external SSD): `export HF_HOME="/Volumes/SomeExternalSSD/models"`
+  - Safe ops: `list`, `health`, `show`; Coordinate `pull`/`rm` (maintenance window)
+- Test cache (isolated/default): ephemeral via fixtures; default `pytest` runs must not force the user cache
+  - Category 1 tests use temporary caches and should not depend on `HF_HOME`
+  - Only server/advanced tests may require user cache and are excluded by default (`-m server`)
+  - Deletion safety: tests set `MLXK2_STRICT_TEST_DELETE=1` so delete ops fail if not in test cache
+
+In PRs, please state your Python version and whether you used the user cache or isolated test caches.
+
 ## Test Commands
 
 ### Basic Test Execution
@@ -218,13 +669,16 @@ pytest -k "process_lifecycle or zombie" -v
 
 # Run health check tests only
 pytest -k "health" -v
+
+# Only JSON API contract/spec tests
+pytest -m spec -v
 ```
 
 ### Timeout and Performance
 
 ```bash
-# Set custom timeout (default: 300s, method=thread)
-pytest --timeout=60 --timeout-method=thread
+# Set custom timeout (default: 300s)
+pytest --timeout=60
 
 # Show slowest tests
 pytest --durations=10
@@ -256,54 +710,26 @@ pytest tests/integration/test_server_functionality.py -v
 - **Models**: Multiple 4-bit quantized models (1B-30B parameters)
 - **Coverage**: Streaming vs non-streaming consistency, token limits, API compliance
 
-### Memory Gating for Large Models
-
-- The integration tests avoid loading oversized models by estimating RAM usage based on model size and quantization.
-- Quantization detection uses common markers in the model name (e.g., `-4bit`, `q4`, `int4`) and, when available, details from `mlxk show <model>`.
-- Two estimation maps are used: one for 4‑bit and one conservative for FP16/BF16.
-- Safety margin: By default, tests use a RAM safety factor to keep headroom.
-  - Configure via `MLXK_TEST_RAM_SAFETY` (float in `0.1..1.0`).
-  - Examples:
-    - `MLXK_TEST_RAM_SAFETY=0.8` (default in some tests): use ~80% of available RAM.
-    - `MLXK_TEST_RAM_SAFETY=1.0`: use up to available RAM (minus 4 GB guard).
-  - This allows FP16 models to be included when they truly fit in memory.
-  
-- Unknown size fallback: tests call `mlxk show <model>` and parse `Size:` and `Quantization:` for more accurate estimates (prevents `unknown → 999GB`).
-  
-- Advanced tuning (optional):
-  - `MLXK_TEST_DISK_TO_RAM_FACTOR`: base factor for converting disk size (GB) to RAM estimate (default: 0.6).
-  - `MLXK_TEST_FACTOR_4BIT`: override factor for 4‑bit models (falls back to `MLXK_TEST_DISK_TO_RAM_FACTOR`).
-  - `MLXK_TEST_FACTOR_FP16`: override factor for FP16/BF16 models (falls back to `MLXK_TEST_DISK_TO_RAM_FACTOR`).
-
-### Robust Server Process Cleanup
-
-- Server tests install a process guard in their managers (not session-wide) and clean up `mlxk server` processes on Ctrl-C, SIGTERM, or teardown.
-- Implementation: `tests/support/process_guard.py`; installed explicitly in server managers.
-- Test code registers processes automatically:
-  - `MLXKnifeServerManager`/`MLXKnifeServer` call `register_popen(...)` when starting `mlxk server`.
-  - The generic `mlx_knife_process` fixture also registers its subprocesses.
-- Environment toggles:
-  - `MLXK_TEST_DISABLE_PROCESS_GUARD=1` disables guard registration (not recommended).
-  - `MLXK_TEST_KILL_ZOMBIES_AT_START=1` sweeps stale servers at session start.
-  - `MLXK_TEST_DETACH_PGRP=1` (advanced): detach runner into its own process group to isolate from stray group-kills.
-
 ## Python Version Compatibility
 
-### Verification Results (September 2025)
+### Verification Results (October 2025)
 
-**✅ 166/166 tests passing** - All standard tests validated on Apple Silicon with isolated cache system  
-**🆕 1.1.1-beta.3** - MXFP4 quantization support and GPT-OSS reasoning model integration
+**✅ 297/317 tests passing** - All standard tests validated on Apple Silicon with enhanced isolation
 
-| Python Version | Status | Tests Passing |
-|----------------|--------|---------------|
-| 3.9.6 (macOS)  | ✅ Verified | 166/166 |
-| 3.10.x         | ✅ Verified | 166/166 |
-| 3.11.x         | ✅ Verified | 166/166 |
-| 3.12.x         | ✅ Verified | 166/166 |
-| 3.13.x         | ✅ Verified | 166/166 |
+| Python Version | Status | Tests Passing | Skipped |
+|----------------|--------|---------------|---------|
+| 3.9.6 (macOS)  | ✅ Verified | 297/317 | 20 |
+| 3.10.x         | ✅ Verified | 297/317 | 20 |
+| 3.11.x         | ✅ Verified | 297/317 | 20 |
+| 3.12.x         | ✅ Verified | 297/317 | 20 |
+| 3.13.x         | ✅ Verified | 297/317 | 20 |
 
-All versions tested with isolated cache system.
-Real MLX execution verified separately with server/run commands.
+**Note:** 20 skipped tests are opt-in (live tests, alpha features). Skipped count may vary by environment:
+- Without `HF_TOKEN`: +1 skip (live push test)
+- Without `MLXK2_ENABLE_ALPHA_FEATURES=1`: +3 skips (alpha feature tests)
+- Without `jsonschema`: +1 skip (spec validation test)
+
+All versions tested with `isolated_cache` system and MLX stubs for platform-neutral execution.
 
 ### Manual Multi-Python Testing
 
@@ -316,7 +742,7 @@ If you have multiple Python versions installed, you can verify compatibility:
 # Or manually test specific versions
 python3.9 -m venv test_39
 source test_39/bin/activate
-pip install -e ".[test]"
+pip install -e . && pip install pytest
 pytest
 deactivate && rm -rf test_39
 ```
@@ -328,8 +754,8 @@ deactivate && rm -rf test_39
 MLX Knife includes comprehensive code quality tools:
 
 ```bash
-# Install development dependencies  
-pip install -e ".[dev]"
+# Install development tools
+pip install ruff mypy
 
 # Automatic code formatting and linting
 ruff check mlx_knife/ --fix
@@ -341,6 +767,44 @@ mypy mlx_knife/
 ruff check mlx_knife/ --fix && mypy mlx_knife/ && pytest
 ```
 
+## Mini‑Matrix: What runs by default vs markers
+
+| Target | How to Run | Markers / Env | Includes | Network |
+|---|---|---|---|---|
+| Default 2.0 suite | `pytest -v` | — | JSON‑API (list/show/health), Human‑Output, Model‑Resolution, Health‑Policy, Push Offline (`--check-only`, `--dry-run`), Spec/Schema checks | No |
+| Spec‑only | `pytest -m spec -v` | `spec` | Schema/contract tests, version sync, docs example validation | No |
+| Exclude Spec | `pytest -m "not spec" -v` | `not spec` | Everything except spec/schema checks | No |
+| Push (alpha, opt‑in) | `MLXK2_ENABLE_ALPHA_FEATURES=1 pytest -k push -v` | Env: `MLXK2_ENABLE_ALPHA_FEATURES=1` | Push offline tests (`--check-only`, `--dry-run`); push command hidden by default | No |
+| Live Push (opt‑in) | `MLXK2_ENABLE_ALPHA_FEATURES=1 pytest -m live_push -v` | `live_push` (subset of `wet`) + Env: `MLXK2_ENABLE_ALPHA_FEATURES=1`, `MLXK2_LIVE_PUSH=1`, `HF_TOKEN`, `MLXK2_LIVE_REPO`, `MLXK2_LIVE_WORKSPACE` | JSON push against the real Hub; on errors the test SKIPs (diagnostic) | Yes |
+| Live List (opt‑in) | `pytest -m live_list -v` | `live_list` (subset of `wet`) + Env: `HF_HOME` (user cache with models) | Tests list/health against user cache models | No (uses local cache) |
+| Clone (alpha, opt‑in) | `MLXK2_ENABLE_ALPHA_FEATURES=1 pytest -k clone -v` | Env: `MLXK2_ENABLE_ALPHA_FEATURES=1` | Clone offline tests (Pull+Copy+Cleanup workflow, APFS optimization); clone command hidden by default | No |
+| Live Clone (ADR-007) | `MLXK2_ENABLE_ALPHA_FEATURES=1 pytest -m live_clone -v` | `live_clone` + Env: `MLXK2_ENABLE_ALPHA_FEATURES=1`, `MLXK2_LIVE_CLONE=1`, `HF_TOKEN`, `MLXK2_LIVE_CLONE_MODEL`, `MLXK2_LIVE_CLONE_WORKSPACE` | Real clone workflow: pull→temp cache→APFS same-volume clone→workspace (ADR-007 Phase 1 constraints: same volume + APFS required) | Yes |
+| Live Stop Tokens (opt‑in, ADR-009) | `pytest -m live_stop_tokens -v` | `live_stop_tokens` + Env: `HF_HOME` (user cache with MXFP4/Qwen/Llama models) | Issue #32: Validates multi-EOS token stop behavior with real models (MXFP4 no visible `<|end|>`, Qwen no self-conversation, Llama baseline) | No (uses local cache) |
+| Live Run (opt‑in) | `pytest -m live_run -v` | `live_run` + Env: `MLXK2_USER_HF_HOME` or `HF_HOME` (user cache with `mlx-community/Phi-3-mini-4k-instruct-4bit`) | Regression tests for Issue #37: Validates private/org MLX model framework detection in run command (renames Phi-3 to simulate private-org model) | No (uses local cache) |
+| Issue #27 real‑model (opt‑in) | `pytest -m issue27 tests_2.0/test_issue_27.py -v` | Marker: `issue27`; Env (required): `MLXK2_USER_HF_HOME` or `HF_HOME` (user cache, read‑only). Env (optional): `MLXK2_ISSUE27_MODEL`, `MLXK2_ISSUE27_INDEX_MODEL`, `MLXK2_SUBSET_COUNT=0`. | Copies real models from user cache into isolated test cache; validates strict health policy on index‑based models (no network) | No (uses local cache) |
+| Server tests (included) | `pytest -k server -v` | — | Basic server API tests (minimal, uses MLX stubs) | No |
+
+Useful commands
+- Only Spec: `pytest -m spec -v`
+- Push tests (offline): `MLXK2_ENABLE_ALPHA_FEATURES=1 pytest -k "push and not live" -v`
+- Exclude Spec: `pytest -m "not spec" -v`
+- Live Push only: `MLXK2_ENABLE_ALPHA_FEATURES=1 MLXK2_LIVE_PUSH=1 HF_TOKEN=... MLXK2_LIVE_REPO=... MLXK2_LIVE_WORKSPACE=... pytest -m live_push -v`
+- Live Clone only: `MLXK2_ENABLE_ALPHA_FEATURES=1 MLXK2_LIVE_CLONE=1 HF_TOKEN=... MLXK2_LIVE_CLONE_MODEL=... MLXK2_LIVE_CLONE_WORKSPACE=... pytest -m live_clone -v`
+- Live List only: `HF_HOME=/path/to/user/cache pytest -m live_list -v`
+- Live Stop Tokens only (ADR-009): `HF_HOME=/path/to/user/cache pytest -m live_stop_tokens -v` (requires MXFP4, Qwen 2.5, Llama 3.2 models in cache)
+- Live Run only: `HF_HOME=/path/to/user/cache pytest -m live_run -v` (requires `mlx-community/Phi-3-mini-4k-instruct-4bit` in cache)
+- Issue #27 only: `MLXK2_USER_HF_HOME=/path/to/user/cache pytest -m issue27 tests_2.0/test_issue_27.py -v`
+- All live tests (umbrella): `MLXK2_ENABLE_ALPHA_FEATURES=1 pytest -m wet -v` (includes live_push, live_clone, live_list)
+
+Markers: wet vs specific live tests
+- `wet`: umbrella marker for any opt‑in "live" test that may require network, credentials, or user environment. Use to run all live tests.
+- `live_push`: narrow marker for push‑specific live tests only. Use to target push live checks without running other live suites.
+- `live_clone`: narrow marker for clone‑specific live tests only. Use to target ADR-007 Phase 1 real workflow validation.
+- `live_stop_tokens`: narrow marker for stop token validation tests with real models (ADR-009). Use to validate Issue #32 fix (multi-EOS models).
+- `live_run`: narrow marker for run command tests with real models. Use to validate Issue #37 framework detection regression fix (private/org MLX models).
+
+Note: Without the required env vars, live tests remain SKIPPED.
+
 ### Development Workflow
 
 Before committing changes:
@@ -413,7 +877,7 @@ pytest --timeout=60
 
 **Import errors:**
 ```bash
-pip install -e ".[test]"
+pip install -e . && pip install pytest
 ```
 
 **Process cleanup issues:**
@@ -448,145 +912,94 @@ When submitting PRs, please include:
    - Python version
    - Which model(s) you tested with
 
-2. **Test results summary**:
-   ```
-   Platform: macOS 14.5, M2 Pro
-   Python: 3.11.6
-   Model: Phi-3-mini-4k-instruct-4bit
-   Results: 150/150 tests passed
-   ```
+2. **Test results summary (2.0)** (example format):
+  ```
+  Platform: macOS 14.5, M2 Pro
+  Python: 3.9.6
+  Results: 297 passed, 20 skipped
+  ```
 
 3. **Any issues encountered** and how you resolved them
 
 ## Summary
 
-**MLX Knife 1.1.0 STABLE + 1.1.1-beta.3 Testing Status:**
+**MLX Knife 2.0 Testing Status:**
 
-✅ **Stable for development use** - 166/166 tests passing  
-✅ **Isolated Test System** - User cache stays pristine with temp cache isolation
-✅ **3-Category Strategy** - Optimized for performance and safety
-✅ **Multi-Python Support** - Python 3.9-3.13 verified  
-✅ **Code Quality** - ruff/mypy integration working  
-✅ **Real Model Testing** - Server/run commands validated with multiple models
-✅ **Memory Management** - Context managers prevent leaks  
-✅ **Exception Safety** - Context managers ensure cleanup  
-✅ **Cache Directory Fix** - Issue #21: Empty cache crash resolved
-✅ **LibreSSL Warning Fix** - Issue #22: macOS Python 3.9 warning suppression
-✅ **Lock Cleanup Fix** - Issue #23: Enhanced rm command with lock cleanup
+✅ **Feature Complete** - 300+ tests (2.0 Beta, see CHANGELOG.md for current release counts)
+✅ **Enhanced Isolation** - Sentinel protection with `isolated_cache` fixture
+✅ **3-Category Strategy** - Isolated/Live/Server tests optimized for 2.0
+✅ **Multi-Python Support** - Python 3.9-3.13 verified
+✅ **Platform Neutral** - MLX stubs enable testing without real MLX hardware
+✅ **Alpha Feature Separation** - Clean boundaries for beta/alpha functionality
+✅ **JSON API Validation** - Complete schema compliance testing
+✅ **Clone Implementation** - Full ADR-007 Phase 1 validation (APFS optimization)
+✅ **Push Operations** - Comprehensive offline testing (dry-run, workspace check)
 
-This comprehensive testing framework validates MLX Knife's **stability for development use** through isolated testing with automatic model downloads and separate real MLX validation.
+This testing framework validates MLX Knife 2.0's JSON-first architecture through comprehensive isolated testing with minimal live dependencies.
 
-## Server-Based Testing (Advanced)
+## Future: Real-Model Server Testing (TODO)
 
-Some tests require a running MLX Knife server with loaded models. These tests are marked with `@pytest.mark.server` and are **not run by default** with `pytest`.
+**Status:** Currently not implemented in 2.0, but valuable for comprehensive model validation
 
-### Why Separate Server Tests?
+### Rationale
+While 2.0 uses MLX stubs for fast testing, real-model server tests validate:
+- Model compatibility across different architectures (Llama, Mistral, Qwen, etc.)
+- Memory management with actual model weights
+- Generation quality and stop token behavior
+- Performance characteristics under load
 
-- **Test count varies** by loaded models (makes CI reporting inconsistent)
-- **Large memory requirements** - need different models for different RAM sizes  
-- **Longer execution time** - each model needs to load individually
-- **Manual setup required** - need to download appropriate models first
-  
-Note: If your shell prints a termination message after a successful run (e.g., "Terminated: 15" or "Killed: 9"), this can be caused by a stray SIGTERM/SIGKILL delivered to the test runner at teardown time by the environment. The suite installs a session handler that exits cleanly on SIGTERM to avoid this cosmetic noise. Disable for debugging with `MLXK_TEST_DISABLE_CATCH_TERM=1`.
+### RAM-Aware Model Selection Strategy
 
-### Prerequisites for Server Tests
-
-| System RAM | Recommended Models | Commands |
-|------------|-------------------|----------|
-| **16GB**   | Small models only | `mlxk pull mlx-community/Qwen2.5-0.5B-Instruct-4bit`<br>`mlxk pull mlx-community/Llama-3.2-1B-Instruct-4bit`<br>`mlxk pull mlx-community/Llama-3.2-3B-Instruct-4bit` |
-| **32GB**   | + Medium models | `mlxk pull mlx-community/Phi-3-mini-4k-instruct-4bit`<br>`mlxk pull mlx-community/Mistral-7B-Instruct-v0.2-4bit`<br>`mlxk pull mlx-community/Mixtral-8x7B-Instruct-v0.1-4bit` |
-| **64GB**   | + Large models | `mlxk pull mlx-community/Mistral-Small-3.2-24B-Instruct-2506-4bit`<br>`mlxk pull mlx-community/Qwen3-30B-A3B-Instruct-2507-4bit`<br>`mlxk pull mlx-community/Llama-3.3-70B-Instruct-4bit` |
-| **96GB+**  | + Huge models | `mlxk pull mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit` |
-
-### Running Server Tests
-
-**Issue #14 Regression Tests** (Chat Self-Conversation Bug):
-
-```bash
-# Set environment
-export HF_HOME=/path/to/your/cache
-
-# Smoke test first (see which models are available)
-python tests/integration/test_issue_14.py
-
-# Run server tests only (excluded from default pytest)
-pytest -m server -v
-
-# Run specific Issue #14 tests
-pytest tests/integration/test_issue_14.py -m server -v
-```
-
-**Expected Output:**
-```
-🦫 MLX Knife Issue #14 Test - Smoke Test
-==================================================
-📊 Safe models for this system: 6
-💾 System RAM: 64GB total, 40GB available
-
-  🎯 mlx-community/Mistral-7B-Instruct-v0.2-4bit
-     └─ Size: 7B, RAM needed: 8GB
-  🎯 mlx-community/Llama-3.2-3B-Instruct-4bit  
-     └─ Size: 3B, RAM needed: 4GB
-  [...]
-
-========== test session starts ==========
-tests/integration/test_issue_14.py::test_server_health[mlx_server] PASSED
-tests/integration/test_issue_14.py::test_issue_14_self_conversation_regression_original[mlx-community/Mistral-7B-Instruct-v0.2-4bit-7B-8] PASSED
-[...6 more model tests...]
-========== 7 passed in 45.23s ==========
-```
-
-### Additional Server Tests
-
-**Issues #15 & #16** - Dynamic Token Limits (Implemented in 1.1.0-beta1):
-```bash
-pytest tests/integration/test_issue_15_16.py -v
-```
-
-**Issue #20** - End-Token Filtering (Implemented in 1.1.0-beta2):
-```bash
-pytest tests/integration/test_end_token_issue.py -m server -v
-```
-
-### Troubleshooting Server Tests
-
-**Permission warnings are normal:**
-```
-WARNING: ⚠️  Cannot scan network connections (permission denied)
-INFO: 🔧 Falling back to process-based cleanup only
-```
-This is expected on macOS - the tests continue with process-based cleanup.
-
-**Memory issues:**
-- Tests automatically skip models exceeding 80% available RAM
-- Use smaller models if you see consistent memory failures  
-- Consider external SSD for model cache to reduce memory pressure
-
-**Server startup failures:**
-```bash
-# Debug server manually
-python -m mlx_knife.cli server --port 8000
-
-# Check model health  
-mlxk health
-
-# Verify environment
-echo $HF_HOME
-```
-
-### Adding New Server Tests
-
-When contributing server-based tests:
+**Methodology:** Automatically select test models based on available system RAM to ensure tests don't fail due to insufficient memory.
 
+**Model RAM Requirements (Rough Estimates):**
 ```python
-@pytest.mark.server
-def test_new_feature(mlx_server, model_name: str, size_str: str, ram_needed: int):
-    """Test new feature with MLX models.""" 
-    # Use mlx_server fixture for automatic server management
-    # Test implementation here
+MODEL_RAM_ESTIMATES = {
+    "0.5B-4bit": 1,      # ~1GB RAM needed
+    "1B-4bit": 2,        # ~2GB RAM needed
+    "3B-4bit": 4,        # ~4GB RAM needed
+    "7B-4bit": 8,        # ~8GB RAM needed
+    "8x7B-4bit": 32,     # ~32GB RAM needed (MoE)
+    "30B-4bit": 40,      # ~40GB RAM needed
+    "70B-4bit": 80,      # ~80GB RAM needed
+}
 ```
 
-1. **Mark with `@pytest.mark.server`** - excludes from default `pytest`
-2. **Use `mlx_server` fixture** - automatic server lifecycle management
-3. **Test RAM requirements** - use `get_safe_models_for_system()` helper
-4. **Document in TESTING.md** - add to this guide
+**Test Model Matrix by System RAM:**
+
+| System RAM | Test Models | Purpose |
+|------------|-------------|---------|
+| **16GB**   | Qwen2.5-0.5B-Instruct-4bit<br>Llama-3.2-1B-Instruct-4bit<br>Llama-3.2-3B-Instruct-4bit | Basic functionality, small model validation |
+| **32GB**   | + Phi-3-mini-4k-instruct-4bit<br>+ Mistral-7B-Instruct-v0.2-4bit<br>+ Mixtral-8x7B-Instruct-v0.1-4bit | Medium model validation, MoE architecture |
+| **64GB**   | + Qwen3-30B-A3B-Instruct-2507-4bit<br>+ Llama-3.3-70B-Instruct-4bit | Large model validation, context handling |
+| **96GB+**  | + Qwen3-Coder-480B-A35B-Instruct-4bit | Huge model validation, memory limits |
+
+### Implementation Approach (Future)
+
+**Test Structure:**
+```python
+@pytest.mark.server_real  # Future marker for real-model tests
+@pytest.mark.parametrize("model", get_safe_models_for_system())
+def test_model_generation_quality(model_name: str, ram_needed: int):
+    """Validate model generates appropriate responses."""
+    # Auto-skip if insufficient RAM
+    # Test actual generation quality
+    # Validate stop tokens work correctly
+    # Check memory cleanup
+```
+
+**Benefits:**
+- ✅ **Real-world validation** - Catches issues MLX stubs cannot
+- ✅ **Architecture diversity** - Tests across different model families
+- ✅ **Memory management** - Validates actual RAM usage patterns
+- ✅ **Performance benchmarking** - Real generation speed metrics
+- ✅ **RAM-aware** - Tests adapt to available system resources
+
+**Implementation Status:**
+- 🚧 **TODO for post-beta.4** - Requires real MLX integration in test environment
+- 📋 **Design preserved** - RAM-aware filtering logic documented for future use
+- 🎯 **Target**: Optional `pytest -m server_real` for comprehensive model validation
+
+---
+
+*MLX-Knife 2.0.0-beta.6*
diff --git a/docs/2.0-IMPLEMENTATION-GUIDE.md b/docs/2.0-IMPLEMENTATION-GUIDE.md
new file mode 100644
index 0000000..cdf8e9c
--- /dev/null
+++ b/docs/2.0-IMPLEMENTATION-GUIDE.md
@@ -0,0 +1,612 @@
+# 2.0 Server/Run Implementation Guide
+
+**Purpose**: Step-by-step guide for Sonnet sessions implementing server/run functionality  
+**Created**: 2025-09-10  
+**Target**: 2.0.0-beta.1-local through beta.3 (public)
+
+## Quick Reference for Sonnet
+
+### What You're Building
+- Port server/run functionality from 1.x (`main` branch) to 2.0 (`feature/2.0.0-alpha.1`)
+- Preserve 2.0's modular architecture (`mlxk2/core/`, `mlxk2/operations/`, `mlxk2/output/`)
+- Test-first approach using specifications in `docs/2.0-TEST-SPECIFICATIONS.md`
+
+### Key Files to Reference
+```bash
+# 1.x source files (use git show to view)
+git show main:mlx_knife/server.py          # FastAPI server implementation
+git show main:mlx_knife/mlx_runner.py      # MLX execution engine
+git show main:mlx_knife/reasoning_utils.py # Reasoning model support
+git show main:mlx_knife/cli.py             # CLI command definitions
+
+# 2.0 existing structure
+mlxk2/core/cache.py      # Extend with model detection
+mlxk2/operations/*.py    # Add run.py, serve.py, chat.py
+mlxk2/output/*.py        # Extend for streaming support
+mlxk2/cli.py            # Add new commands
+```
+
+## Implementation Steps
+
+### Step 1.0: Core Runner Implementation
+
+**File**: `mlxk2/core/runner.py`
+
+```python
+# Key components to port from mlx_runner.py:
+class MLXRunner:
+    """Core MLX model execution engine"""
+    
+    def __init__(self, model_name_or_path):
+        # Model loading logic
+        # Memory tracking
+        
+    def __enter__(self):
+        # Context manager entry
+        return self
+        
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        # CRITICAL: Cleanup even on exception
+        
+    def generate_streaming(self, prompt, **kwargs):
+        # Generator for token-by-token output
+        yield from self._generate_tokens(prompt, **kwargs)
+        
+    def generate_batch(self, prompt, **kwargs):
+        # Complete generation at once
+        return "".join(self.generate_streaming(prompt, **kwargs))
+```
+
+**Critical Requirements**:
+1. Context manager pattern for memory safety
+2. Separate streaming vs batch generation
+3. Stop token filtering (CHAT_STOP_TOKENS)
+4. Dynamic token limits based on model context
+
+### Step 1.1: Complete Run Command 
+
+**File**: `mlxk2/operations/run.py`
+
+```python
+from mlxk2.core.runner import MLXRunner
+
+def run_model(
+    model_spec: str,
+    prompt: Optional[str] = None,
+    stream: bool = True,
+    max_tokens: Optional[int] = None,
+    temperature: float = 0.7,
+    top_p: float = 0.9,
+    **kwargs
+):
+    """Execute model with prompt - supports both single-shot and interactive modes.
+    
+    Args:
+        model_spec: Model specification
+        prompt: Input prompt (None = interactive mode)
+        stream: Enable streaming output
+        max_tokens: Maximum tokens (None = full model context)
+        temperature: Sampling temperature
+        top_p: Top-p sampling parameter
+    """
+    with MLXRunner(model_spec) as runner:
+        # Interactive mode: no prompt provided
+        if prompt is None:
+            interactive_chat(runner, stream=stream, max_tokens=max_tokens, **kwargs)
+        else:
+            # Single-shot mode: prompt provided
+            single_shot_generation(runner, prompt, stream=stream, max_tokens=max_tokens, **kwargs)
+
+def interactive_chat(runner, stream=True, **kwargs):
+    """Interactive conversation mode with history tracking."""
+    print("Starting interactive chat. Type 'exit' or 'quit' to end.\n")
+    
+    conversation_history = []
+    
+    while True:
+        try:
+            user_input = input("You: ").strip()
+            
+            if user_input.lower() in ['exit', 'quit', 'q']:
+                print("\nGoodbye!")
+                break
+                
+            if not user_input:
+                continue
+                
+            # Add user message to conversation history
+            conversation_history.append({"role": "user", "content": user_input})
+            
+            # Format conversation using chat template
+            formatted_prompt = runner._format_conversation(conversation_history)
+            
+            # Generate response
+            print("\nAssistant: ", end="", flush=True)
+            
+            if stream:
+                # Streaming mode
+                response_tokens = []
+                for token in runner.generate_streaming(formatted_prompt, use_chat_template=False, **kwargs):
+                    print(token, end="", flush=True)
+                    response_tokens.append(token)
+                response = "".join(response_tokens).strip()
+            else:
+                # Batch mode
+                response = runner.generate_batch(formatted_prompt, use_chat_template=False, **kwargs)
+                print(response)
+            
+            # Add assistant response to history
+            conversation_history.append({"role": "assistant", "content": response})
+            print()  # Newline after response
+            
+        except KeyboardInterrupt:
+            print("\n\nChat interrupted. Goodbye!")
+            break
+        except Exception as e:
+            print(f"\n[ERROR] {e}")
+            continue
+
+def single_shot_generation(runner, prompt, stream=True, **kwargs):
+    """Single prompt generation."""
+    if stream:
+        for token in runner.generate_streaming(prompt, **kwargs):
+            print(token, end="", flush=True)
+        print()  # Final newline
+    else:
+        result = runner.generate_batch(prompt, **kwargs)
+        print(result)
+```
+
+**CLI Integration** (`mlxk2/cli.py`):
+```python
+# Run command parser
+run_parser = subparsers.add_parser("run", help="Run model with prompt")
+run_parser.add_argument("model", help="Model name to run")
+run_parser.add_argument("prompt", nargs="?", help="Input prompt (optional - triggers interactive mode if omitted)")
+run_parser.add_argument("--max-tokens", type=int, help="Maximum tokens to generate (default: full model context)")
+run_parser.add_argument("--temperature", type=float, default=0.7, help="Sampling temperature")
+run_parser.add_argument("--top-p", type=float, default=0.9, help="Top-p sampling parameter")
+run_parser.add_argument("--no-stream", action="store_true", help="Disable streaming output (batch mode)")
+run_parser.add_argument("--json", action="store_true", help="Output in JSON format")
+run_parser.add_argument("--verbose", action="store_true", help="Show detailed output")
+
+# Usage examples:
+# mlxk2 run model "prompt"                    # Single-shot streaming
+# mlxk2 run model "prompt" --no-stream        # Single-shot batch
+# mlxk2 run model                             # Interactive streaming  
+# mlxk2 run model --no-stream                 # Interactive batch
+```
+
+**Key Changes from Basic to Complete:**
+- ✅ **Interactive mode**: `prompt` parameter is now optional
+- ✅ **Conversation history**: Tracks full chat context
+- ✅ **Stream control**: `--no-stream` works in both modes
+- ✅ **Full context tokens**: No arbitrary limits for run command
+- ✅ **Chat template integration**: Uses model's native conversation format
+
+### Step 1.2: Beta.1 Completion
+
+**Complete the remaining Beta.1 requirements:**
+
+#### 1.2.1: Full Context Token Limits
+
+**File**: `mlxk2/core/runner.py`
+
+```python
+def _calculate_dynamic_max_tokens(self, server_mode: bool = False) -> int:
+    """Calculate dynamic max tokens based on model context and usage mode."""
+    if not self._context_length:
+        return 2048
+    
+    if server_mode:
+        # Server: half context for DoS protection
+        return self._context_length // 2
+    else:
+        # Run command: full context (user's own machine, be generous)
+        return self._context_length
+
+# Update generate_streaming and generate_batch to use:
+effective_max_tokens = max_tokens if max_tokens is not None else self._calculate_dynamic_max_tokens(server_mode=False)
+```
+
+#### 1.2.2: Ctrl-C Handling
+
+**Already implemented in our MLXRunner**: ✅
+- Signal handler in `__init__`
+- `_interrupted` flag checking during generation
+- Graceful interruption with user message
+
+#### 1.2.3: Interactive Mode Implementation
+
+### Server Model Caching (Hot‑Swap, kein Reload pro Prompt)
+
+Ziel: Die UX‑Verbesserung aus 1.1.1 beibehalten – der Server lädt Modelle nicht für jeden Prompt neu.
+
+- Mechanik:
+  - In `mlxk2/core/server_base.py` existiert ein globaler Runner‑Cache:
+    - `_model_cache: Dict[str, MLXRunner]` und `_current_model_path: Optional[str]`.
+    - `get_or_load_model(model_spec)`: gibt einen bestehenden `MLXRunner` zurück, falls bereits geladen; lädt nur bei Modellwechsel neu.
+    - Beim Wechsel wird der alte Runner unter Lock bereinigt (`runner.cleanup()`), dann der neue geladen (Hot‑Swap).
+  - Für den Server wird `MLXRunner(..., install_signal_handlers=False)` verwendet (keine Signal‑Handler‑Konflikte).
+- Verhalten:
+  - Gleiches Modell über mehrere Requests → kein Reload → zügige Antworten, stabile UX.
+  - Anderes Modell → altes Modell freigeben, neues laden (Hot‑Swap), weiterhin kein Reload pro Prompt.
+- Kontextlänge (Erinnerung):
+  - Run‑Command nutzt volle Kontextlänge; Server nutzt halbe Kontextlänge als DoS‑Schutz (`get_effective_max_tokens(..., server_mode=True)`).
+
+**File**: `mlxk2/operations/run.py` - Add missing methods:
+
+```python
+def _format_conversation(self, messages: List[Dict[str, str]]) -> str:
+    """Format conversation history into a prompt using chat template."""
+    if hasattr(self.tokenizer, 'chat_template') and self.tokenizer.chat_template:
+        try:
+            return self.tokenizer.apply_chat_template(
+                messages,
+                tokenize=False,
+                add_generation_prompt=True
+            )
+        except Exception:
+            # Fall back to legacy format
+            pass
+    
+    # Legacy Human:/Assistant: format
+    formatted_parts = []
+    for msg in messages:
+        role = msg["role"]
+        content = msg["content"]
+        if role == "system":
+            formatted_parts.append(f"System: {content}")
+        elif role == "user":
+            formatted_parts.append(f"Human: {content}")
+        elif role == "assistant":
+            formatted_parts.append(f"Assistant: {content}")
+    
+    return "\n\n".join(formatted_parts) + "\n\nAssistant: "
+```
+
+#### 1.2.4: Update CLI for Interactive Mode
+
+**File**: `mlxk2/cli.py`
+
+```python
+# Update run command argument parser
+run_parser.add_argument("prompt", nargs="?", help="Input prompt (optional - triggers interactive mode if omitted)")
+
+# Update run command handler
+elif args.command == "run":
+    result_text = run_model_enhanced(
+        model_spec=args.model,
+        prompt=args.prompt,  # Can be None for interactive mode
+        stream=not args.no_stream,
+        # ... other parameters
+    )
+```
+
+#### 1.2.5: Beta.1 Test Coverage
+
+**Files**: Complete test implementation for:
+- `tests_2.0/test_run_complete.py` - All run command scenarios
+- `tests_2.0/test_interactive_mode.py` - Conversation history and chat templates
+- `tests_2.0/test_token_limits.py` - Full context vs server context
+- `tests_2.0/test_ctrl_c_handling.py` - Interruption scenarios
+
+**Coverage Target**: 80% for run command functionality
+
+### Step 2.0: Server Implementation (Beta.2-local Core)
+
+**File**: `mlxk2/core/server_base.py`
+
+```python
+from fastapi import FastAPI
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+
+# OpenAI-compatible request/response models
+class ChatCompletionRequest(BaseModel):
+    model: str
+    messages: List[Dict[str, str]]
+    stream: Optional[bool] = False
+    max_tokens: Optional[int] = None
+    
+class ChatCompletionResponse(BaseModel):
+    choices: List[Dict]
+    model: str
+    usage: Dict
+```
+
+**File**: `mlxk2/operations/serve.py`
+
+```python
+def start_server(model=None, port=8000, host="127.0.0.1"):
+    """Start OpenAI-compatible API server"""
+    # 1. Create FastAPI app
+    # 2. Setup endpoints (/v1/chat/completions, /v1/models)
+    # 3. Handle streaming vs non-streaming with SSE
+    # 4. Model hot-swapping support
+    # 5. Half context token limits (DoS protection)
+```
+
+### Step 2.1: Beta.2 Parity Features
+
+#### 2.1.1: Reasoning Support (GPT-OSS/MXFP4)
+
+**CRITICAL**: This is already implemented in 1.1.1-beta.3 and must be ported for parity!
+
+**File**: `mlxk2/core/reasoning.py`
+
+```python
+# Port from mlx_knife/reasoning_utils.py (1.x main branch)
+class ReasoningExtractor:
+    """Extract reasoning from GPT-OSS/MXFP4 models"""
+    
+    PATTERNS = {
+        'gpt-oss': {
+            'reasoning': r'<\|channel\|>analysis<\|message\|>(.*?)<\|end\|>',
+            'final': r'<\|channel\|>final<\|message\|>(.*?)(?:<\|return\|>|$)',
+        }
+    }
+    
+class StreamingReasoningParser:
+    """Parse reasoning tokens in real-time"""
+    # Real-time token classification
+    # Format as **[Reasoning]** / **[Answer]**
+```
+
+**Integration**: 
+- Runner detects MXFP4/GPT-OSS models via `_is_reasoning_model()`
+- Formats output as **[Reasoning]** ... --- **[Answer]**
+- Server API includes reasoning in response metadata (optional)
+
+#### 2.1.2: Issue #30 - Gated Models Preflight
+
+**File**: `mlxk2/operations/pull.py`
+
+```python
+def preflight_repo_access(model_spec):
+    """Check repository access before download."""
+    try:
+        HfApi().model_info(repo_id, token=os.getenv("HUGGINGFACE_HUB_TOKEN"))
+    except HTTPError as e:
+        if e.response.status_code in [401, 403]:
+            return {"error": "Model requires authentication"}
+    return {"status": "accessible"}
+```
+
+## Testing Strategy
+
+### Test Organization
+```
+tests_2.0/
+├── test_runner_core.py         # Core MLXRunner tests
+├── test_run_command.py         # CLI run tests
+├── test_server_api.py          # OpenAI API compliance
+├── test_reasoning.py           # GPT-OSS reasoning
+└── test_chat_mode.py          # Interactive chat
+```
+
+### Test Fixtures to Use
+```python
+# From tests_2.0/conftest.py
+@pytest.fixture
+def temp_cache_dir():
+    """Isolated cache for testing"""
+    
+@pytest.fixture
+def mock_tiny_model():
+    """Minimal model for fast tests"""
+```
+
+## CRITICAL NOTES FOR SONNET
+
+### ⚠️ Open Issues to Fix During Port
+
+#### Issue #30: Gated Models Preflight Check [Beta.2]
+**Problem**: Pull von gated models startet Download, dann 403 → Cache pollution
+**Target**: 2.0.0-beta.2-local
+**Solution für 2.0**:
+```python
+# In mlxk2/operations/pull.py
+def preflight_repo_access(model_spec):
+    try:
+        HfApi().model_info(repo_id, token=os.getenv("HUGGINGFACE_HUB_TOKEN"))
+    except HTTPError as e:
+        if e.response.status_code in [401, 403]:
+            # Fail fast BEVOR Download
+            return {"error": "Model requires authentication. Please accept terms and set HUGGINGFACE_HUB_TOKEN"}
+```
+
+#### Ctrl-C Handling [Beta.1] (Nicht als Issue dokumentiert)
+**Problem**: Run/Server blockiert während Model-Generation, Ctrl-C funktioniert nicht
+**Target**: 2.0.0-beta.1-local (Core functionality!)
+**Solution für 2.0**:
+```python
+import signal
+import threading
+
+class MLXRunner:
+    def __init__(self):
+        self._interrupted = False
+        signal.signal(signal.SIGINT, self._handle_interrupt)
+    
+    def _handle_interrupt(self, signum, frame):
+        self._interrupted = True
+        # Generation-Loop checkt self._interrupted
+    
+    def generate_streaming(self):
+        for token in model.generate():
+            if self._interrupted:
+                yield "\n[Generation interrupted by user]"
+                break
+            yield token
+```
+
+### ⚠️ Model Loading & Caching
+**WICHTIG**: Der Server in 1.x cached Modelle im Memory. In 2.0:
+- Model-Cache global in `mlxk2/core/server_base.py`
+- NICHT bei jedem Request neu laden!
+- Hot-swapping = nur wenn anderes Modell requested
+
+### ⚠️ JSON vs Human Output (CLI-Ebene)
+**WICHTIG**: 2.0 hat BEIDE Output-Modi auf CLI-Ebene:
+- Default ohne `--json`: Human-readable output (wie 1.x)
+- Mit `--json`: JSON output auf stdout
+- Server API: Immer OpenAI-JSON Format (unabhängig von CLI)
+- Streaming: Technisch separate Implementierung (SSE für Server, direktes Token-Streaming für CLI)
+
+### ⚠️ Stop Tokens & Code-Sharing
+**DESIGN-PRINZIP**: Server baut maximal auf run-Funktionalität auf!
+```python
+# Runner implementiert die Core-Logik
+CHAT_STOP_TOKENS = ["\nHuman:", "\nAssistant:", "\nUser:", "\nYou:"]
+
+# Server nutzt Runner - KEINE Duplikation
+from mlxk2.core.runner import MLXRunner
+# Server ruft runner.generate_streaming() oder runner.generate_batch()
+```
+**VORTEIL**: Einmal richtig implementiert, überall korrekt
+
+### ⚠️ Test Models & RAM-aware Filtering
+**LOKALE TESTS**: RAM-aware Filtering aus 1.x BEIBEHALTEN!
+```python
+# Aus 1.x TESTING.md - diese Logik portieren:
+- 8GB Mac: Nur tiny models
+- 16GB Mac: Bis zu 7B models  
+- 32GB+ Mac: Alle models möglich
+```
+**GitHub CI**: Nicht möglich (keine Apple Silicon Runner)
+- Docs müssen klar sagen: "Lokale Tests only"
+- Badge "166/166 tests" bezieht sich auf lokale Ausführung
+
+## Common Pitfalls & Solutions
+
+### 1. Memory Leaks & Process Monitoring
+**Problem**: Model stays in memory after error / Zombie processes
+**Solution**: 
+- Context manager mit garantiertem cleanup in `__exit__`
+- Portiere Process-Monitoring aus 1.x beta.2:
+  - `test_server_functionality.py`: Server lifecycle tests
+  - Process guards gegen orphaned Python processes
+  - Automatic cleanup on Ctrl-C/SIGTERM
+
+### 2. Streaming vs Batch Inconsistency
+**Problem**: Different output between modes  
+**Solution**: Filter stop tokens in BOTH paths
+
+### 3. Token Limits
+**Problem**: Hardcoded limits truncate output  
+**Solution**: Dynamic limits aus 1.x (funktioniert gut!)
+```python
+# Von 1.x beibehalten:
+- max_tokens=None → Dynamische Limits basierend auf Model-Context
+- Explicit max_tokens → Respektieren
+- Formel aus 1.x mlx_runner.py übernehmen
+```
+**Mögliche Verbesserung**: Config-basierte Overrides für spezielle Modelle
+
+### 4. Model Path Resolution
+**Problem**: Can't find models in cache  
+**Solution**: Use existing `mlxk2/core/cache.py` resolution
+
+## Version Milestones
+
+### 2.0.0-beta.1-local
+**Step 1.0**: ✅ MLXRunner core engine  
+**Step 1.1**: ✅ Complete run command (single-shot + interactive)  
+**Step 1.2**: 🔄 Beta.1 completion
+- [ ] Full context token limits (no DoS protection)
+- [ ] Interactive mode implementation
+- [ ] CLI integration for interactive mode  
+- [ ] 80% test coverage
+- [x] **Ctrl-C handling** (already implemented)
+
+### 2.0.0-beta.2-local  
+**Goal**: 1.1.1-beta.3 parity + core stability
+**Step 2.0**: 🔄 Server implementation  
+**Step 2.1**: 🔄 Parity features (required for 1.x compatibility)
+- [ ] OpenAI-compatible API server
+- [ ] Half context token limits for server (DoS protection)
+- [ ] Model hot-swapping support  
+- [ ] SSE streaming endpoints
+- [ ] **Reasoning models (GPT-OSS/MXFP4)** ← ALREADY IN 1.1.1-beta.3!
+- [ ] Issue #30: Gated models preflight
+- [ ] Enhanced error handling and logging
+- [ ] Server lifecycle management (Ctrl-C, cleanup)
+- [ ] 90% test coverage
+
+### 2.0.0-beta.3 (public)
+**Goal**: Production-ready with 1.1.1-beta.3 complete parity
+- [ ] All core features stable and battle-tested
+- [ ] Performance optimized
+- [ ] Documentation complete
+- [ ] 95%+ test coverage
+- [ ] Integration testing with real-world scenarios
+
+### Beyond 2.0.0-beta.3 (Future Releases)
+**New features for post-beta.3 versions:**
+- **System Prompt CLI Support** (`--system` parameter) - not yet specified
+- Advanced reasoning model support (DeepSeek R1, QwQ, etc.)
+- Custom reasoning token markers (`--reasoning-start`, `--reasoning-end`)
+- Enhanced chat template system
+
+## Push Function Notes
+
+The `push` operation (experimental in alpha.3) remains functional throughout beta phases:
+- May receive fixes between beta versions
+- Minor enhancements possible
+- Not blocking for server/run implementation
+- Already working with user's workflow
+
+## Quick Commands for Development
+
+```bash
+# View 1.x implementation
+git show main:mlx_knife/server.py | less
+
+# Run 2.0 tests
+pytest tests_2.0/
+
+# Test specific functionality
+pytest tests_2.0/test_runner_core.py -v
+
+# Check coverage
+pytest tests_2.0/ --cov=mlxk2 --cov-report=term-missing
+
+# Create local beta tag (not pushed)
+git tag -a 2.0.0-beta.1-local -m "Initial server/run port"
+
+# Run local 2.0 version
+python -m mlxk2.cli run model "prompt"
+```
+
+## References for Each Step
+
+### Step 1.0 (Runner Core)
+- Source: `git show main:mlx_knife/mlx_runner.py`
+- Tests: `git show main:tests/unit/test_mlx_runner_memory.py`
+
+### Step 1.1 (Run Command)
+- Source: `git show main:mlx_knife/cli.py` (run_model function)
+- Tests: `git show main:tests/integration/test_run_command_advanced.py`
+
+### Step 2.0 (Server)
+- Source: `git show main:mlx_knife/server.py`
+- Tests: `git show main:tests/integration/test_server_functionality.py`
+
+### Step 3.0 (Reasoning)
+- Source: `git show main:mlx_knife/reasoning_utils.py`
+- Context: CLAUDE.md reasoning architecture section
+
+### Step 3.1 (Chat)
+- Source: Search for "interactive_chat" in main branch
+- Tests: Look for chat-related tests in integration
+
+## Success Criteria
+
+Each Sonnet session should:
+1. Write tests first (TDD)
+2. Implement minimal working version
+3. Verify tests pass
+4. Document any deviations from 1.x
+
+Remember: The goal is feature parity with 1.1.1-beta.3, not innovation. Port conservatively.
diff --git a/docs/2.0-TEST-SPECIFICATIONS.md b/docs/2.0-TEST-SPECIFICATIONS.md
new file mode 100644
index 0000000..6743fba
--- /dev/null
+++ b/docs/2.0-TEST-SPECIFICATIONS.md
@@ -0,0 +1,318 @@
+# 2.0 Server/Run Test Specifications
+
+**Purpose**: Abstract test specifications extracted from 1.x for implementation in 2.0  
+**Created**: 2025-09-10  
+**For**: Sonnet implementation sessions
+
+## Open Issues to Address
+
+### Issue #30: Gated Models Preflight
+- Test: Mock 403 response → Verify NO cache writes
+- Test: Clear error message with actionable guidance
+- Test: Successful auth → Normal pull flow
+
+### Ctrl-C Interruption Support
+- Test: Long generation → Ctrl-C → Clean interruption
+- Test: Server request → Ctrl-C → Graceful shutdown
+- Test: No zombie processes after interrupt
+
+## Core Principles
+
+1. **Test-First**: Write failing tests before implementation
+2. **Isolated Caches**: Use temp_cache_dir fixtures, never touch user cache
+3. **Abstract Contracts**: Test behaviors, not implementations
+4. **Model-Agnostic**: Use tiny test models where possible
+
+## Server API Contract Tests
+
+### 1. OpenAI Compatibility (`test_server_api_compliance.py`)
+
+```python
+class TestOpenAICompliance:
+    """Verify OpenAI API compatibility"""
+    
+    def test_models_endpoint(self):
+        # GET /v1/models
+        # Returns: {"data": [{"id": "model-name", "object": "model", ...}]}
+        
+    def test_chat_completions_basic(self):
+        # POST /v1/chat/completions
+        # Body: {"model": "...", "messages": [...], "stream": false}
+        # Returns: {"choices": [{"message": {"content": "..."}}]}
+        
+    def test_chat_completions_streaming(self):
+        # POST /v1/chat/completions with stream=true
+        # Returns: SSE stream with data: prefixed chunks
+        # Final: data: [DONE]
+        
+    def test_completions_endpoint(self):
+        # POST /v1/completions
+        # Body: {"model": "...", "prompt": "...", "stream": false}
+        # Returns: {"choices": [{"text": "..."}]}
+```
+
+### 2. Dynamic Token Management (`test_server_token_limits.py`)
+
+```python
+class TestDynamicTokens:
+    """Test model-aware token limits (Issue #15/16)"""
+    
+    def test_no_max_tokens_uses_dynamic(self):
+        # Given: Model with 8K context
+        # When: max_tokens=None in request
+        # Then: Server uses appropriate dynamic limit (~2000-4000)
+        
+    def test_respects_explicit_max_tokens(self):
+        # Given: Any model
+        # When: max_tokens=500 in request
+        # Then: Server respects explicit limit
+        
+    def test_large_context_models(self):
+        # Given: 30K+ context model
+        # When: max_tokens=None
+        # Then: Larger dynamic limit applied
+```
+
+### 3. Model Hot-Swapping (`test_server_model_switching.py`)
+
+```python
+class TestModelSwitching:
+    """Test model switching without restart"""
+    
+    def test_switch_between_models(self):
+        # Given: Server running with model A
+        # When: Request specifies model B
+        # Then: Model B loads, A unloads, response from B
+        
+    def test_concurrent_model_requests(self):
+        # Given: Multiple requests with different models
+        # Then: Proper queueing/switching without crashes
+```
+
+### 4. Stop Token Filtering (`test_server_stop_tokens.py`)
+
+```python
+class TestStopTokens:
+    """Test stop token handling (Issue #14, #20)"""
+    
+    def test_chat_stop_tokens_filtered(self):
+        # Given: Chat mode
+        # Then: "\nHuman:", "\nAssistant:" never in output
+        
+    def test_streaming_vs_batch_consistency(self):
+        # Given: Same prompt
+        # When: stream=true vs stream=false
+        # Then: Identical output (no extra tokens)
+```
+
+## Run Command Contract Tests
+
+### 1. Complete Run Command (`test_run_complete.py`)
+
+```python
+class TestRunBasic:
+    """Basic run command functionality"""
+    
+    def test_run_single_shot_streaming(self):
+        # mlxk run model "prompt"
+        # Returns: Generated text to stdout, token-by-token
+        
+    def test_run_single_shot_batch(self):
+        # mlxk run model "prompt" --no-stream
+        # Returns: Complete output at once
+        
+    def test_run_interactive_streaming(self):
+        # mlxk run model (no prompt)
+        # Triggers: Interactive chat mode with streaming responses
+        
+    def test_run_interactive_batch(self):
+        # mlxk run model --no-stream (no prompt)
+        # Triggers: Interactive chat mode with batch responses
+        
+    def test_run_full_context_tokens(self):
+        # mlxk run model "prompt"
+        # Uses: Full model context length (no DoS protection)
+        # Verify: max_tokens defaults to model's full context
+        
+    def test_conversation_history_tracking(self):
+        # Interactive mode maintains conversation context
+        # Each new input includes previous conversation
+        
+    def test_chat_template_integration(self):
+        # Uses model's native chat template for conversation formatting
+        # Falls back to Human:/Assistant: if no template available
+```
+
+### 2. Server Token Management (`test_server_tokens.py`)
+
+```python
+class TestServerTokens:
+    """Server-specific token limit behavior"""
+    
+    def test_server_half_context_protection(self):
+        # Server mode uses half model context for DoS protection
+        # Given: Model with 8K context
+        # Server: Uses max 4K tokens by default
+        # Run: Uses full 8K tokens by default
+        
+    def test_server_vs_run_token_limits(self):
+        # Verify different token policies:
+        # Run command: Full context (generous)
+        # Server API: Half context (defensive)
+```
+
+### 3. Reasoning Models (`test_reasoning_models.py`)
+
+```python
+class TestReasoningModels:
+    """GPT-OSS/MXFP4 reasoning support"""
+    
+    def test_gpt_oss_reasoning_detection(self):
+        # Model name contains "gpt-oss" or "mxfp4"
+        # Automatic reasoning extraction
+        
+    def test_reasoning_formatting(self):
+        # Output: **[Reasoning]** ... **[Answer]** ...
+        
+    def test_hide_reasoning_flag(self):
+        # mlxk run model "prompt" --hide-reasoning
+        # Shows only answer, no reasoning
+```
+
+### 4. Memory Management (`test_memory_safety.py`)
+
+```python
+class TestMemorySafety:
+    """Context manager and cleanup"""
+    
+    def test_context_manager_cleanup(self):
+        # Model loaded in context
+        # Automatic cleanup on exit/exception
+        
+    def test_exception_safety(self):
+        # Exception during generation
+        # Resources still cleaned up
+```
+
+## Show Command Enhancements
+
+### Quantization Display (`test_show_quantization.py`)
+
+```python
+class TestShowQuantization:
+    """Enhanced quantization info (beta.3)"""
+    
+    def test_mxfp4_detection(self):
+        # Config has quantization.mode = "mxfp4"
+        # Shows: "Advanced mode 'mxfp4' (requires MLX ≥0.29.0)"
+        
+    def test_gguf_variants(self):
+        # Multiple .gguf files
+        # Lists all variants with sizes
+        
+    def test_precision_display(self):
+        # Shows: int4, int8, gguf, etc.
+```
+
+## Test Data Requirements
+
+### ⚠️ CRITICAL: Test Model Strategy
+
+**NIEMALS** user cache für Tests verwenden! Immer `temp_cache_dir` fixture!
+
+### Minimal Test Models
+```yaml
+tiny-models:
+  - hf-internal-testing/tiny-random-gpt2  # 12MB, for basic tests
+  - local-mock-models/fake-mxfp4-model     # Mock config.json only
+  - local-mock-models/fake-reasoning-model # Mock with reasoning markers
+
+real-models-optional:  # For @pytest.mark.server tests only
+  - mlx-community/Phi-3-mini-4k-instruct-4bit
+  - gpt-oss-20b-MXFP4-Q8  # For reasoning tests
+```
+
+## Implementation Priority
+
+### Priority A: Beta.1 - Complete Run Command (CRITICAL - Must Have)
+1. `mlxk2/core/runner.py` - MLX execution engine ✅
+2. Single-shot run: `mlxk2 run model "prompt"` ✅
+3. Interactive run: `mlxk2 run model` (no prompt)
+4. Streaming and batch modes for both
+5. Full context token limits (no DoS protection)
+6. Conversation history tracking
+7. Chat template integration
+8. Ctrl-C handling
+
+### Priority B: Beta.2 - Server Implementation (HIGH - Should Have) 
+1. OpenAI-compatible API server
+2. Half context token limits for server (DoS protection)
+3. Model hot-swapping support
+4. SSE streaming in server endpoints
+5. Reasoning model support
+6. System prompt support
+
+### Priority C: Beta.3 - Advanced Features (MEDIUM - Could Have)
+1. Performance optimizations
+2. Enhanced error handling
+3. Advanced reasoning features
+4. Issue #30: Gated models preflight
+
+## Critical Implementation Notes
+
+### 1. Streaming Architecture
+```python
+# 1.x uses generator pattern - PRESERVE THIS
+def generate_streaming(prompt, **kwargs):
+    for token in model.generate(...):
+        yield token
+        
+# Server SSE format - MUST MATCH
+data: {"choices": [{"delta": {"content": "token"}}]}
+data: [DONE]
+```
+
+### 2. Stop Token Management
+```python
+# Priority order (from 1.x mlx_runner.py)
+CHAT_STOP_TOKENS = ["\nHuman:", "\nAssistant:", "\nUser:", "\nYou:"]
+
+# 1. Check model's native stop tokens first
+# 2. Add chat stop tokens as fallback
+# 3. Filter from output in both streaming and batch
+```
+
+### 3. Model Loading Pattern
+```python
+# Context manager pattern from 1.x - CRITICAL
+class MLXRunner:
+    def __enter__(self):
+        self.load_model()
+        return self
+        
+    def __exit__(self, ...):
+        self.cleanup()  # MUST cleanup even on exception
+```
+
+## Version Strategy
+
+### Local Git Tags (Not Published)
+- `2.0.0-beta.1-local` - Basic server/run port
+- `2.0.0-beta.2-local` - Full reasoning support
+
+### Public Release
+- `2.0.0-beta.3` - First public beta (fully tested)
+
+## Gotchas for Sonnet Sessions
+
+1. **Don't forget MLX version checks**: MXFP4 requires MLX ≥0.29.0
+2. **Test with isolated caches**: Never assume user has models
+3. **Preserve 1.x CLI interface**: Same commands, same flags
+4. **Keep modular boundaries**: Core vs Operations vs Output
+5. **Test streaming separately**: Different code paths
+
+## References
+
+- 1.x source: `git show main:mlx_knife/server.py`
+- 1.x tests: `git show main:tests/integration/test_server_functionality.py`
+- Test patterns: `tests_2.0/conftest.py` for fixtures
\ No newline at end of file
diff --git a/docs/ADR/ADR-001-json-api-strategy.md b/docs/ADR/ADR-001-json-api-strategy.md
new file mode 100644
index 0000000..1c1391f
--- /dev/null
+++ b/docs/ADR/ADR-001-json-api-strategy.md
@@ -0,0 +1,178 @@
+# ADR-001: MLX-Knife 2.0 Migration Path to JSON-First Architecture
+
+## Status
+**Accepted** - 2025-08-28
+
+**Implementation Status:**
+- ✅ Clean-room 2.0 implementation complete (Sessions 1-3)
+- ✅ JSON-first architecture validated
+- ✅ Parallel deployment strategy documented
+- ✅ Broke-cluster integration ready
+
+## Context
+
+MLX-Knife 1.1.0 has achieved stability with 150/150 tests passing, but faces architectural challenges:
+- `cache_utils.py` contains 1000+ lines causing ~4000 tokens per Claude interaction
+- Dual output format (human + JSON) would add complexity 
+- Refactoring existing code risks breaking stable functionality
+- broke-cluster project needs scriptable JSON API for automated model management
+
+## Decision
+
+We will create MLX-Knife 2.0 as a **clean-room implementation** with JSON-first architecture, maintaining the robust maintenance functions while simplifying the codebase.
+
+## Migration Path
+
+### Phase 1: Alpha Foundation 
+**Version: 2.0.0-alpha**
+- Feature-complete JSON-only implementation
+- All 5 commands: list, show, pull, rm, health
+- 100% test coverage (45/45 passing)
+- Clean modular architecture
+- No server/run functionality (JSON-only scope)
+
+### Phase 2: Beta Validation (6-8 weeks)
+**Version: 2.0.0-beta**
+- All alpha features with production-grade testing
+- Performance benchmarks with large caches
+- Robust broke-cluster integration validation
+- Still JSON-only (no server/run)
+
+### Phase 3: Feature Parity (Release Candidate)
+**Version: 2.0.0-rc**  
+- Add server functionality from 1.x
+- Add run/chat functionality
+- Full feature parity with MLX-Knife 1.x
+- Human-readable output via CLI layer 
+- All features JSON-first design
+- No dual output logic
+
+### Phase 4: Test Suite Migration (Week 5)
+**Version: 2.0.0-beta2**
+- New test suite for JSON output
+- Compatibility tests against 1.1.0
+- Edge case coverage (from ADR-002)
+- Target: 50-70 focused tests vs 150 in 1.x
+
+### Phase 5: Production Ready (Month 2)
+**Version: 2.0.0-rc1 → 2.0.0**
+- Documentation complete
+- Migration guide from 1.x
+- broke-cluster validated in production
+- Community feedback incorporated
+
+## Architecture Principles
+
+### 1. Module Structure
+```
+mlx-knife-2/
+├── mlxk2/
+│   ├── core/
+│   │   ├── cache.py       # Cache path management
+│   │   └── model_resolution.py  # Model discovery & resolution
+│   ├── operations/
+│   │   ├── list.py        # List operation
+│   │   ├── health.py      # Health validation
+│   │   ├── show.py        # Show details (50 lines)
+│   │   ├── pull.py        # Download models (100 lines)
+│   │   └── remove.py      # Delete models (50 lines)
+│   ├── output/
+│   │   └── json.py        # JSON serialization (50 lines)
+│   └── cli.py             # CLI entry point (100 lines)
+```
+
+### 2. Dependency Rules
+- No circular dependencies
+- Core modules are dependency-free
+- Operations depend on core only
+- CLI depends on operations and output
+- Maximum dependency depth: 3 levels
+
+### 3. Code Limits
+- No file exceeds 200 lines
+- No function exceeds 50 lines
+- No class exceeds 100 lines
+- Clear separation of concerns
+
+## Implementation Guidelines
+
+### JSON Output Schema
+All commands return consistent JSON structure:
+```json
+{
+  "status": "success|error",
+  "command": "list|show|pull|rm|health",
+  "data": { /* command specific */ },
+  "error": null | { "type": "...", "message": "..." }
+}
+```
+
+### Error Handling
+- All errors return valid JSON
+- Exit codes remain compatible with 1.x
+- Detailed error messages for debugging
+
+### Backward Compatibility
+- Same cache directory structure
+- Same model naming conventions
+- Can run parallel to 1.1.0
+- No shared state between versions
+
+## Testing Strategy
+
+### Alpha Testing (alpha0-alpha1)
+- Manual testing against known models
+- Comparison with 1.1.0 output
+- broke-cluster integration testing
+
+### Beta Testing (beta1-beta2)
+- Automated test suite
+- Edge case coverage from ADR-002
+- Performance benchmarks
+
+### Release Testing (rc1)
+- Full compatibility validation
+- Community beta testing
+- Production deployment in broke-cluster
+
+## Success Metrics
+
+1. **Code Reduction**: <1000 lines total (vs 3000+ in 1.x)
+2. **Token Efficiency**: <500 tokens per file for Claude
+3. **Test Coverage**: >90% for critical paths
+4. **Performance**: Same or better than 1.1.0
+5. **broke-cluster**: Successful production deployment
+
+## Risks and Mitigations
+
+| Risk | Impact | Mitigation |
+|------|--------|------------|
+| Missing edge cases | High | Extract from 1.x tests (ADR-002) |
+| User migration resistance | Medium | Maintain 1.x support, clear benefits |
+| Feature gaps | Low | Incremental feature addition |
+| Performance regression | Medium | Benchmark against 1.1.0 |
+
+## Consequences
+
+### Positive
+- Clean, maintainable codebase
+- 80% reduction in Claude token usage
+- Perfect for automation/scripting
+- Faster development cycles
+- Clear architecture
+
+### Negative
+- Breaking change for users
+- Temporary feature gaps
+- Parallel maintenance (short-term)
+- Learning curve for JSON output
+
+## Decision Outcome
+
+Proceed with clean-room 2.0.0 implementation following the phased approach, starting with alpha0 for immediate broke-cluster value.
+
+## References
+- Issue #8: Model caching
+- Issue #26: Embeddings API  
+- JSON Feature Request document
+- mlx-knife-refactoring-plan.md (rejected approach)
\ No newline at end of file
diff --git a/docs/ADR/ADR-002-edge-cases.md b/docs/ADR/ADR-002-edge-cases.md
new file mode 100644
index 0000000..296c811
--- /dev/null
+++ b/docs/ADR/ADR-002-edge-cases.md
@@ -0,0 +1,339 @@
+# ADR-002: Edge Cases Learned from MLX-Knife 1.x Test Suite
+
+## Status
+**Accepted, Implementation In Progress** - 2025-08-28
+
+**Implementation Status:**
+- ✅ Edge cases identified and catalogued
+- ✅ Test infrastructure with isolated cache established
+- ❌ 10/45 tests failing - edge case validation incomplete
+- 🎯 **Session 4 Goal**: Complete edge case implementation and validation
+
+## Context
+
+MLX-Knife 1.x has 150+ tests covering numerous edge cases discovered during development. These tests represent critical knowledge about real-world usage patterns, failure modes, and subtle requirements that must be preserved in 2.0.
+
+## Extracted Edge Cases by Category
+
+### 1. Model Name Resolution
+
+**Critical Cases:**
+- **Short name expansion**: "Phi-3" → "mlx-community/Phi-3-mini-4k-instruct-4bit"
+- **Hash disambiguation**: When multiple models match, allow `#abc123` suffix
+- **Partial matching**: "Llama" matches all Llama models (ambiguous)
+- **Empty/whitespace names**: Must handle gracefully
+- **Invalid characters**: Names with multiple slashes, special chars
+- **Name length limits**: HuggingFace has 96 character limit
+
+**Implementation Requirements:**
+```python
+def resolve_model_name(name: str) -> tuple[str, Optional[str]]:
+    # Returns (model_name, commit_hash)
+    # Handle: "Phi-3", "Phi-3#abc123", "mlx-community/Phi-3", etc.
+    # Max 96 chars validation
+    # Graceful fallback for unknowns
+```
+
+### 2. Cache Directory Management
+
+**Critical Cases:**
+- **Round-trip conversion**: HF name ↔ cache dir must be bijective
+- **Special characters**: Org names with hyphens, dots
+- **Missing snapshots directory**: Model without snapshots/
+- **Multiple snapshots**: Same model, different commits
+- **Empty model directories**: Leftover from failed downloads
+- **Orphaned lock files**: .lock files without corresponding models
+
+**Implementation Requirements:**
+```python
+def cache_path_operations():
+    # Must handle:
+    # - models--org--name format
+    # - snapshots/<hash>/ structure
+    # - refs/ for branch tracking
+    # - .lock cleanup on operations
+```
+
+### 3. Health Checking
+
+**Critical Cases:**
+- **LFS pointer files**: Detect Git LFS placeholders (not actual weights)
+- **Truncated safetensors**: Partial downloads appearing valid
+- **Missing config.json**: Model without configuration
+- **Missing tokenizer files**: No tokenizer_config.json
+- **Framework detection**: MLX vs PyTorch vs Tokenizer-only
+- **Symlink handling**: Don't follow dangerous symlinks
+- **Race conditions**: Health check during active download
+
+**Framework Detection Logic (TRICKY!):**
+```python
+def detect_framework(model_path, hf_name):
+    # Quick win: mlx-community models are always MLX
+    if "mlx-community" in hf_name:
+        return "MLX"
+    
+    # Check actual files
+    has_safetensors = any(path.glob("*/*.safetensors"))
+    has_pytorch = any(path.glob("*/pytorch_model.bin"))
+    has_config = any(path.glob("*/config.json"))
+    total_size = get_model_size(model_path)
+    
+    # Edge case: Tokenizer-only "models" (< 10MB)
+    if total_size < 10 * 1024 * 1024:  # 10MB threshold
+        return "Tokenizer"
+    
+    # Priority order matters!
+    if has_safetensors and has_config:
+        return "MLX"  # Assume safetensors = MLX
+    elif has_pytorch:
+        return "PyTorch"
+    else:
+        return "Unknown"
+
+# PROBLEM: This heuristic fails for:
+# - Non-mlx-community MLX models
+# - Mixed framework models
+# - Models with both .safetensors and .bin files
+```
+
+**For 2.0:** 
+- Health checks should work for ALL frameworks
+- Don't filter by framework in health command
+- Show framework in output but don't block operations
+
+**LFS Pointer Detection Pattern:**
+```python
+def is_lfs_pointer(file_path):
+    # Check for:
+    # - File size < 1KB for .safetensors
+    # - Content starts with "version https://git-lfs"
+    # - "oid sha256:" in first 200 bytes
+```
+
+## Implementation Outcome (2.0 alpha)
+
+- Multi‑shard completeness is enforced strictly:
+  - If a safetensors index exists, every referenced shard must exist and be non‑empty; any missing or zero‑byte shard is unhealthy.
+  - Without an index, shard patterns like `model‑XXXXX‑of‑YYYYY.safetensors` are detected and the complete 1..Y sequence is required; subsets are unhealthy. Conservative policy: pattern‑only sharded models are considered unhealthy even if they appear complete, unless an index is present.
+- Partial/temporary markers (e.g., `.partial.tmp`) mark snapshots as unhealthy.
+- LFS pointers are detected recursively (including index‑referenced shard files) and flagged as unhealthy.
+- Invalid or missing `config.json` results in unhealthy status.
+- Test coverage includes deterministic isolated cases and opt‑in real‑cache validations; both confirm no false OK for incomplete multi‑shard states.
+
+### 4. Delete Operations (rm command)
+
+**Critical Cases (Issue #23 regression):**
+- **Force flag behavior**: `-f` must skip ALL confirmations
+- **Interactive prompts**: Must respect user input exactly
+- **Lock file cleanup**: Remove .lock files with model
+- **Partial deletion recovery**: Handle interrupted deletes
+- **Permission errors**: Read-only files, system dirs
+- **Non-existent models**: Graceful error messages
+
+**Implementation Requirements:**
+```python
+def remove_model(name: str, force: bool = False):
+    # MUST respect force flag completely
+    # Clean .lock files ALWAYS
+    # Atomic operation or rollback
+```
+
+### 5. Server Mode Edge Cases
+
+**Critical Cases (Issues #14, #15, #16):**
+- **Token limits**: Respect model's actual context length
+- **Self-conversation bug**: Messages accumulating incorrectly
+- **Streaming vs non-streaming**: End tokens must match
+- **Concurrent requests**: Model loading race conditions
+- **Port conflicts**: Handle "address already in use"
+- **SIGTERM handling**: Clean shutdown (Issue #18 known limitation)
+- **Memory management**: Proper cleanup after each request
+
+**Token Limit Strategy:**
+```python
+def get_safe_token_limit(model_path: Path, is_server: bool):
+    # Extract from config.json:
+    # - max_position_embeddings (priority 1)
+    # - n_positions (priority 2) 
+    # - context_length (priority 3)
+    # Server mode: min(model_limit, 8192)  # DOS protection
+    # Interactive: model_limit or 4096 default
+```
+
+### 6. Download & Network Operations
+
+**Critical Cases:**
+- **Network timeouts**: Graceful handling, clear messages
+- **Partial downloads**: Resume or clean restart
+- **Invalid repo names**: Early validation before network call
+- **Rate limiting**: Respect HF rate limits
+- **Disk space**: Check before download starts
+- **Concurrent downloads**: Prevent duplicate downloads
+
+### 7. Process Lifecycle
+
+**Critical Cases:**
+- **Zombie processes**: Clean up on parent crash
+- **Resource leaks**: File handles, network connections
+- **Lock starvation**: Prevent infinite lock waiting
+- **Signal handling**: SIGINT, SIGTERM, SIGKILL
+- **Timeout handling**: Commands taking too long
+
+### 8. Test Isolation Requirements
+
+**Critical Cases:**
+- **Cache pollution**: Tests must NEVER touch user's ~/.cache/huggingface
+- **Temporary test cache**: Use isolated temp directory for ALL tests
+- **Parallel execution**: Tests must be independent
+- **Cleanup verification**: Ensure complete cleanup after each test
+- **Mock boundaries**: What to mock vs real
+- **Deterministic output**: Consistent across runs
+
+**Implementation Pattern:**
+```python
+# conftest.py - CRITICAL for 2.0 tests
+import tempfile
+import os
+from pathlib import Path
+
+@pytest.fixture
+def isolated_cache(monkeypatch):
+    """EVERY test MUST use this to avoid user cache pollution."""
+    with tempfile.TemporaryDirectory() as tmpdir:
+        test_cache = Path(tmpdir) / "huggingface/hub"
+        test_cache.mkdir(parents=True)
+        
+        # Override environment for complete isolation
+        monkeypatch.setenv("HF_HOME", str(tmpdir / "huggingface"))
+        monkeypatch.setenv("TMPDIR", str(tmpdir))
+        
+        # Also patch any direct references in code
+        monkeypatch.setattr("mlxk2.core.cache.CACHE_ROOT", test_cache.parent)
+        monkeypatch.setattr("mlxk2.core.cache.MODEL_CACHE", test_cache)
+        
+        yield test_cache
+        
+        # Cleanup is automatic with TemporaryDirectory
+
+# EVERY test MUST use it:
+def test_list_models(isolated_cache):
+    # This test cannot pollute user cache
+    result = list_models()
+    assert result["models"] == []
+```
+
+## JSON-Specific Edge Cases for 2.0
+
+### 1. Output Consistency
+- **Error format**: Always valid JSON even on crash
+- **Partial results**: Stream vs complete JSON
+- **Unicode handling**: Proper escaping in JSON
+- **Large outputs**: Streaming JSON for big lists
+- **Number precision**: Float representation
+
+### 2. Backward Compatibility
+- **Exit codes**: Must match 1.x behavior
+- **Error messages**: Similar enough for scripts
+- **Model resolution**: Same fuzzy matching
+- **Path handling**: Same cache structure
+
+## Implementation Checklist for 2.0
+
+### Phase 1: Core Robustness (alpha0)
+- [ ] Model name validation (96 char limit)
+- [ ] Cache directory round-trip conversion
+- [ ] Basic health checks (file existence)
+- [ ] Force flag in rm command
+- [ ] JSON error handling
+
+### Phase 2: Advanced Edge Cases (alpha1)
+- [ ] LFS pointer detection
+- [ ] Hash disambiguation
+- [ ] Lock file cleanup
+- [ ] Partial match warnings
+- [ ] Network timeout handling
+
+### Phase 3: Server Integration (beta1)
+- [ ] Token limit extraction
+- [ ] Memory cleanup patterns
+- [ ] Streaming JSON support
+- [ ] Concurrent request handling
+
+## Testing Strategy for 2.0
+
+### Unit Tests (30-40 tests)
+Focus on pure functions:
+- Name resolution logic
+- Path conversions
+- JSON serialization
+- Error formatting
+
+### Integration Tests (20-30 tests)
+Real operations with mock cache:
+- Health checks on various states
+- Delete operations with locks
+- List with mixed frameworks
+- Error recovery paths
+
+### No Need to Port
+- UI/formatting tests (JSON-only now)
+- Server streaming format tests
+- Terminal color tests
+- Progress bar tests
+
+## Patterns to Preserve
+
+### 1. Fail-Fast with Clear Errors
+```python
+if len(model_name) > 96:
+    return {
+        "status": "error",
+        "error": {
+            "type": "ValidationError",
+            "message": f"Model name too long: {len(model_name)}/96"
+        }
+    }
+```
+
+### 2. Defensive File Operations
+```python
+# Always check exists before operations
+if not path.exists():
+    return None  # Don't throw, return None
+    
+# Always use Path, not strings
+path = Path(model_path)
+```
+
+### 3. Atomic Operations
+```python
+# Either complete fully or rollback
+try:
+    shutil.rmtree(model_path)
+    remove_lock_files(model_name)
+except Exception as e:
+    # Log but don't partially delete
+    pass
+```
+
+## Key Learnings
+
+1. **Users expect fuzzy matching** - "Phi" should find Phi models
+2. **Force flags must be absolute** - No prompts when -f is used
+3. **Lock files cause problems** - Always clean them up
+4. **LFS pointers fool naive checks** - Must detect explicitly
+5. **Token limits prevent crashes** - Respect model capabilities
+6. **Health checks save debugging time** - Worth the complexity
+7. **Network operations fail often** - Timeout and retry logic essential
+8. **Cache corruption is common** - Robust detection critical
+
+## Decision Outcome
+
+These edge cases represent hard-won knowledge from production usage. The 2.0 implementation MUST handle these cases correctly to maintain user trust and functionality, even while moving to JSON-only output.
+
+## References
+- Issue #14: Self-conversation bug
+- Issue #15/16: Token limit race conditions
+- Issue #18: Server signal handling
+- Issue #23: Force flag regression
+- Test suite: 150+ tests in tests/
diff --git a/docs/ADR/ADR-003-Server-Run-Port-to-2.0.md b/docs/ADR/ADR-003-Server-Run-Port-to-2.0.md
new file mode 100644
index 0000000..13c102e
--- /dev/null
+++ b/docs/ADR/ADR-003-Server-Run-Port-to-2.0.md
@@ -0,0 +1,215 @@
+# ADR-003: Server and Run Functionality Port from 1.x to 2.0
+
+**Status**: Accepted  
+**Date**: 2025-09-10  
+**Decision Makers**: mzau, Claude
+
+## Context
+
+The 2.0 branch (`feature/2.0.0-alpha.1`) currently lacks the server and run functionality that has been significantly enhanced in the 1.x branch through versions 1.1.1-beta.2 and 1.1.1-beta.3. This includes:
+
+1. **Server functionality** (1.x: `mlx_knife/server.py`):
+   - OpenAI-compatible REST API (`/v1/chat/completions`, `/v1/completions`)
+   - Real-time streaming support via SSE
+   - Model hot-swapping and caching
+   - Dynamic token limits based on model context length
+
+2. **Run functionality** (1.x: `mlx_knife/mlx_runner.py`):
+   - Direct MLX model execution with streaming
+   - Interactive chat mode with conversation history
+   - Memory management with context managers
+   - Stop token filtering and handling
+
+3. **Reasoning support** (1.x: `mlx_knife/reasoning_utils.py` - NEW in beta.3):
+   - GPT-OSS/MXFP4 reasoning model support
+   - Pattern-based reasoning extraction
+   - Formatted output with `**[Reasoning]**` / `**[Answer]**` sections
+   - `--hide-reasoning` flag for answer-only output
+
+4. **Enhanced features from beta.2/beta.3**:
+   - MXFP4 quantization support (requires MLX ≥0.29.0)
+   - Lenient MLX detection for private repos (Issue #31)
+   - README/tokenizer-based model type detection
+   - Strict health checks for multi-shard models (Issue #27)
+   - Enhanced `show` command with detailed quantization display:
+     - MXFP4 mode detection with version requirements
+     - GGUF variants listing with sizes
+     - Precision info extraction (int4, int8, gguf, etc.)
+
+The 2.0 architecture already includes:
+- Modular structure (`mlxk2/core/`, `mlxk2/operations/`, `mlxk2/output/`)
+- JSON-first API with schema versioning
+- Human output backend (despite docs suggesting JSON-only for beta)
+- Enhanced testing infrastructure with isolated caches
+
+## Decision
+
+We will port the server and run functionality from 1.x to 2.0 following a **test-driven, modular approach** that preserves the 2.0 architecture advantages while incorporating all 1.x enhancements.
+
+### Port Strategy
+
+*Note: "Week 1-4" bezeichnet die logische Reihenfolge, nicht reale Kalenderwochen*
+
+#### Week 1: Test Suite Extraction and Abstraction
+1. **Extract test specifications** from 1.x test suite:
+   - Server tests: `test_server_functionality.py`, `test_issue_14.py`, `test_issue_15_16.py`, `test_end_token_issue.py`
+   - Run tests: `test_run_command_advanced.py`, `test_mlx_runner_memory.py`
+   - Reasoning tests: Tests for GPT-OSS/MXFP4 formatting
+
+2. **Create abstract test specifications** in 2.0:
+   - Document expected behaviors, not implementation details
+   - Define API contracts and edge cases
+   - Create test matrices for different model types
+
+3. **Implement 2.0-native tests first**:
+   - Write tests against the expected 2.0 API
+   - Use 2.0's isolated cache infrastructure
+   - Ensure tests fail initially (red phase of TDD)
+
+#### Week 2: Modular Implementation
+1. **Core modules** (`mlxk2/core/`):
+   - `runner.py`: MLX model execution engine (from `mlx_runner.py`)
+   - `reasoning.py`: Reasoning extraction utilities (from `reasoning_utils.py`)
+   - `server_base.py`: FastAPI server foundation
+
+2. **Operations modules** (`mlxk2/operations/`):
+   - `run.py`: CLI run command implementation (inkl. Interactive Chat; kein separates `chat.py`)
+   - `serve.py`: Server startup and management (Supervisor als Default)
+
+3. **Output adaptors** (`mlxk2/output/`):
+   - Extend existing JSON/Human output for server responses
+   - Add streaming output support for both formats
+
+#### Week 3: Feature Integration
+1. **Port enhancements in priority order**:
+   - Basic run/server functionality (MVP for 2.0.0-beta.1)
+   - Reasoning support (GPT-OSS/MXFP4)
+   - Dynamic token limits
+   - Enhanced model detection (Issue #31)
+   - Strict health checks (already partially in 2.0)
+
+2. **Maintain backward compatibility**:
+   - Same CLI interface as 1.x
+   - Same OpenAI API endpoints
+   - Same web UI (update version strings)
+
+### Test-Driven Approach
+
+```python
+# Example: Abstract test specification for server
+class ServerAPIContract:
+    """Define expected server behaviors independent of implementation"""
+    
+    def test_chat_completions_streaming(self):
+        """Server must support streaming chat completions"""
+        # Given: A running server with a loaded model
+        # When: POST to /v1/chat/completions with stream=true
+        # Then: Receive SSE stream with data: prefixed chunks
+        
+    def test_model_hot_swapping(self):
+        """Server must support switching models without restart"""
+        # Given: Server running with model A
+        # When: Request with different model B
+        # Then: Model B loads and responds correctly
+        
+    def test_dynamic_token_limits(self):
+        """Server must respect model context limits"""
+        # Given: Model with 8K context
+        # When: No max_tokens specified
+        # Then: Use appropriate dynamic limit
+```
+
+### Implementation Mapping
+
+| 1.x Component | 2.0 Location | Notes |
+|--------------|--------------|-------|
+| `mlx_knife/server.py` | `mlxk2/core/server_base.py` + `mlxk2/operations/serve.py` | Split core from CLI |
+| `mlx_knife/mlx_runner.py` | `mlxk2/core/runner/` | Core execution engine (modularisiert als Paket) |
+| `mlx_knife/reasoning_utils.py` | `mlxk2/core/reasoning.py` | Pattern-based extraction |
+| `mlx_knife/cache_utils.py` additions | `mlxk2/core/cache.py` extensions | Model detection + quantization display |
+| Server CLI logic | `mlxk2/operations/serve.py` | Command implementation |
+| Run CLI logic | `mlxk2/operations/run.py` | Command implementation (inkl. Interactive) |
+
+## Consequences
+
+### Positive
+- **Test coverage maintained**: All 1.x test scenarios covered in 2.0
+- **Architecture preserved**: 2.0's modular structure enhanced, not compromised
+- **Feature parity**: 2.0.0-beta.1 will be feature-complete vs 1.1.1
+- **Clean separation**: Core logic separate from CLI/output concerns
+- **Future-proof**: Easier to add new output formats or APIs
+
+### Negative
+- **Development time**: Test-first approach takes longer initially
+- **Temporary duplication**: Some code exists in both branches during transition
+- **Complexity**: More files/modules than 1.x monolithic approach
+
+### Neutral
+- **Version jump to beta.1**: Justified by feature completeness and "human" backend
+- **Push feature**: Remains experimental/undefined as per current state
+- **License split**: Maintained (1.x MIT, 2.x Apache-2.0)
+
+## Implementation Checklist
+
+*Chronologische Reihenfolge - kann parallel oder iterativ bearbeitet werden*
+
+### Week 1: Test Infrastructure
+- [ ] Extract server test specifications from 1.x
+- [ ] Extract run/chat test specifications from 1.x
+- [ ] Create abstract test contracts in 2.0
+- [ ] Write failing tests for all core features
+
+### Week 2: Core Implementation
+- [ ] Implement `mlxk2/core/runner.py`
+- [ ] Implement `mlxk2/core/server_base.py`
+- [ ] Implement `mlxk2/core/reasoning.py`
+- [ ] Extend `mlxk2/core/cache.py` with detection
+
+### Week 3: Operations Layer
+- [ ] Implement `mlxk2/operations/run.py`
+- [ ] Implement `mlxk2/operations/chat.py`
+- [ ] Implement `mlxk2/operations/serve.py`
+- [ ] Update CLI in `mlxk2/cli.py`
+
+### Week 4: Integration & Polish
+- [x] Integrate output formatters (Human + JSON)
+- [x] Full 2.0 default test suite passing (containing server-minimaltests)
+- [x] Documentation updates (CLAUDE.md, TESTING.md)
+
+## Release Criteria for 2.0.0-beta.1
+
+Based on this port and existing 2.0 features:
+
+### Must Have (Beta.1)
+- ✅ JSON-first API (already in alpha.3)
+- ✅ Human output backend (already in alpha.3)
+- ✅ Enhanced model detection (already in alpha.3)
+- ✅ Server functionality with OpenAI API (Supervisor, SSE, Hot‑Swap)
+- ✅ Run command with streaming
+- ✅ Interactive chat mode
+- ✅ Basic reasoning support (GPT-OSS)
+- [ ] 90%+ test coverage
+
+### Should Have (Beta.2)
+- [ ] Full reasoning features (hide-reasoning flag)
+- [ ] Advanced token management
+- [ ] Performance optimizations
+- [ ] Extended test coverage (95%+)
+- [x] Issue #30 Preflight (premature integration)
+
+### Could Have (Future)
+- [ ] Custom reasoning token configuration
+- [ ] Multi-model server support
+- [ ] Push functionality (currently experimental)
+- [ ] Web UI (not part of 2.0‑port)
+
+### Not in Scope for Port
+- **System prompt CLI support** (`--system` parameter): This is a future enhancement not yet implemented in 1.x. Decision on this feature will be made after successful server & run functional parity with 1.1.1 is achieved. See CLAUDE.md for ongoing discussion.
+
+## References
+
+- CHANGELOG.md: Complete feature history of 1.1.1-beta.2 and beta.3
+- TESTING.md: 1.x test structure and categories
+- Issue #27: Strict health checks for multi-shard models
+- Issue #31: Lenient MLX detection for private repos
+- CLAUDE.md: Current context and TODOs
diff --git a/docs/ADR/ADR-004-Enhanced-Error-Logging.md b/docs/ADR/ADR-004-Enhanced-Error-Logging.md
new file mode 100644
index 0000000..8926f6d
--- /dev/null
+++ b/docs/ADR/ADR-004-Enhanced-Error-Logging.md
@@ -0,0 +1,66 @@
+# ADR-004: Enhanced Error Handling & Logging
+
+Status: Accepted (Implementation: beta.5+)
+
+Note: Error type taxonomy and rate-limiting parameters may be refined during implementation based on real-world usage patterns.
+
+Context
+- 2.0 currently has working error paths and minimal logs. We want a unified error envelope, structured logging, and consistent HTTP/CLI mapping without overcomplicating local workflows.
+
+Decision
+- Implement a unified error envelope and structured logging after beta.3, with opt-in JSON logs and basic redaction. Preserve current defaults for developer ergonomics.
+
+Scope (phase 1)
+- Error JSON (CLI/Server): {"status":"error","error":{"type","message","detail"?,"retryable"?}, "data"?}
+- Server HTTP mapping: 400/404/503 stable (already in place), graceful SSE error close.
+- Logging: INFO/WARN/ERROR (+DEBUG), optional JSON logs via env `MLXK2_LOG_JSON=1`; redact secrets.
+- Correlation: `request_id` (UUID4) included in responses and logs.
+
+Out of scope (for now)
+- Embeddings/other endpoints, distributed tracing, external log backends.
+
+Open Questions
+- Error.type taxonomy and granularity vs. stability.
+- Default log format (plain) vs. JSON ergonomics; env/flag naming.
+- Rate-limiting repeated errors; scope and counters.
+
+Acceptance (high level)
+- Tests assert error.type ↔ HTTP status mapping, presence/shape of `request_id`, SSE error termination, and redaction of tokens.
+
+Specification (phase 1)
+- Error envelope (CLI/Server consistent)
+  - JSON shape: {"status":"error","error":{"type": <enum>, "message": <str>, "detail": <obj|null>, "retryable": <bool|null>}, ...}
+  - Standardized type values: access_denied, model_not_found, ambiguous_match, download_failed, validation_error, push_operation_failed, server_shutdown, internal_error.
+  - Correlation: request_id/trace_id (UUID) included in responses and logs.
+
+- Logging (structured, level-based output)
+  - Levels: INFO (startup, model switch), WARN (preflight warnings, recoveries), ERROR (unhandled/500), DEBUG (enabled by --verbose).
+  - Formats: plain text by default; optional JSON logs via MLXK2_LOG_JSON=1 (fields: ts, level, msg, request_id, route, model, duration_ms).
+  - Redaction: filter sensitive data (HF_TOKEN, user-specific paths, access URLs).
+  - Rate limiting: suppress duplicate error floods (e.g., max 1/5s with counters).
+
+- Server specifics
+  - HTTP mapping: 503 during shutdown (_shutdown_event), 404 on model-load errors, 400 for invalid requests (e.g., multiple prompts in completions).
+  - Streaming errors: final SSE chunk carries error field, then [DONE]; interrupts emit a clear marker and close cleanly.
+  - Hot-swap logging: "Switching to model", "Model loaded", cleanup results (freed memory, optional).
+
+Rollout plan
+- Beta.3: ✅ Keep current behavior; add tests (done) and reduce noisy logs (done).
+- Beta.4 (KW 41 2024): Runtime Check (Issue #36) - separate bugfix, not part of ADR-004.
+- Beta.5+ (Q4 2024): ADR-004 Phase 1 implementation
+  - Add request_id generation and propagation
+  - Unified error envelope for HTTP errors
+  - Optional JSON logs via env `MLXK2_LOG_JSON=1`
+  - Minimal redaction (HF_TOKEN, paths)
+- Beta.5+ (follow-up): SSE error finalization parity across endpoints; rate-limit error floods.
+- 2.0.0 Final (Q1 2026): Production-ready with full error/logging infrastructure.
+
+- CLI operations
+  - Exit codes: success=0; any status:error → 1 (no special codes per type).
+  - --verbose: buffer hub/server logs in hf_logs[]; do not mix progress logs into JSON; human mode shows concise summary (+URL/commit with --verbose).
+  - Preflight (#30): preflight_warning as data field; WARN log-level; access_denied is a hard error.
+
+- Tests (coverage)
+  - Mapping tests: error.type ↔ HTTP status; request_id present; optional JSON logs.
+  - Streaming failure scenarios: interrupt and exception → proper finalization/marker.
+  - Redaction tests: HF_TOKEN never appears in logs/JSON in cleartext.
diff --git a/docs/ADR/ADR-005-Clone-Implementation-Beta3.md b/docs/ADR/ADR-005-Clone-Implementation-Beta3.md
new file mode 100644
index 0000000..6bfa4ad
--- /dev/null
+++ b/docs/ADR/ADR-005-Clone-Implementation-Beta3.md
@@ -0,0 +1,250 @@
+# ADR-005: mlxk2 clone Implementation for 2.0.0-beta.3
+
+## Status
+**DEPRECATED** - 2025-09-15
+
+**Superseded by:** ADR-006 (Clone Implementation - Revised Strategy)
+
+**Reason for Deprecation:** Critical findings revealed that HuggingFace Hub's `local_dir` parameter does not provide reliable cache isolation and can corrupt existing cache entries. The assumptions about cache isolation in this ADR were incorrect.
+
+## Context
+
+GitHub Issue #29 requests Ollama CLI-like "push" functionality for MLX Knife. The push function was successfully implemented in 2.0.0-alpha.1, but analysis revealed a critical workflow gap: there is no `clone` command to create writable workspaces from HuggingFace models.
+
+### Current Workflow Limitations
+
+**Missing Link in Author Workflow:**
+```bash
+# Desired workflow - currently incomplete
+mlxk2 clone org/model@revision ./workspace    # ❌ Missing
+mlxk2 health ./workspace                      # ✅ Exists
+mlxk2 push ./workspace org/my-model --private # ✅ Exists
+```
+
+**Two Key Use Cases Identified:**
+1. **Fork-Modify-Push:** `clone` existing HF model → edit → `push` to new repo
+2. **Author-Generated Models:** Native MLX training → workspace → `health` → `push`
+
+### Technical Analysis Results
+
+**MLX Model Compatibility:** ✅ No additional work needed
+- Native MLX models use identical structure to HuggingFace models (config.json + .safetensors)
+- Existing `_analyze_workspace()` in push.py already validates MLX-native models correctly
+- No .npz/.mlx extensions - MLX uses .safetensors with metadata={"format": "mlx"}
+
+**Implementation Effort:** Very Low (~2 hours)
+- Can reuse 90% of existing `pull.py` logic (snapshot_download)
+- Only difference: download to custom local_dir instead of HF cache
+- Test patterns already established for push (21 tests with offline/online/spec coverage)
+
+### JSON API Schema Impact
+
+**Required Changes for JSON API 0.1.4:**
+- Schema update: Add "clone" to command enum in `docs/json-api-schema.json:9`
+- Version bump: `mlxk2/spec.py` → `JSON_API_SPEC_VERSION = "0.1.4"`
+- Documentation update: `docs/json-api-specification.md` → Version 0.1.4
+- **No new schema definition needed** - clone reuses existing pull schema
+
+## Decision
+
+We will implement `mlxk2 clone` for 2.0.0-beta.3 to complete the GitHub Issue #29 feature request and provide a comprehensive workspace-based workflow, including full JSON API 0.1.4 compliance.
+
+## Implementation Plan
+
+### Phase 1: Core Implementation + JSON API (Session 1)
+- **Time Estimate:** 1-1.5 hours (simplified - no new schema needed)
+- **Files to Modify:**
+  - `mlxk2/operations/clone.py` - New file, ~80 lines (reuse pull.py patterns)
+  - `mlxk2/cli.py` - Add clone command integration
+  - `mlxk2/spec.py` - Version bump to 0.1.4
+  - `docs/json-api-schema.json` - Add "clone" to command enum only
+  - `docs/json-api-specification.md` - Version update + clone documentation
+  - Basic test coverage: CLI args, validation, JSON output schema
+
+### Phase 2: Complete Test Suite (Session 2)
+- **Time Estimate:** 1-2 hours
+- **Test Structure:** Mirror existing push test patterns from TESTING.md
+  - Offline tests: target directory validation, CLI argument parsing
+  - Online tests: live clone with opt-in env vars (MLXK2_LIVE_CLONE=1)
+  - Spec tests: JSON schema validation for clone command output (JSON API 0.1.4)
+  - Integration: Add to existing test matrix in TESTING.md
+
+### Phase 3: Issue #29 Feedback
+- Request user testing from feynon (Swift/iOS porting use case)
+- Validate workflow completeness for both identified use cases
+
+## Implementation Details
+
+### API Signature
+```bash
+mlxk2 clone <org>/<repo>[@<revision>] <target_dir> [options]
+```
+
+**Options:**
+- `--branch <branch>` - Clone specific branch/revision
+- `--json` - JSON output mode
+- `--quiet` - Suppress progress output
+
+### Critical Cache Behavior Requirements
+
+**IMPORTANT:** Session 1 initial implementation used `snapshot_download(local_dir=target)` which creates symlinks to HF cache. This violates the core requirements below.
+
+**Required Implementation:**
+```python
+snapshot_download(
+    repo_id=model_name,
+    local_dir=str(target_path),
+    local_dir_use_symlinks=False  # CRITICAL: Force actual file copies
+)
+```
+
+**Cache Isolation Validation:**
+- Clone target must contain real files, not symlinks
+- HF cache (`~/.cache/huggingface/hub/`) must not be populated during clone
+- Target directory should be completely self-contained workspace
+
+### JSON Response Schema (API 0.1.4)
+```json
+{
+  "status": "success|error",
+  "command": "clone",
+  "data": {
+    "model": "org/repo",
+    "download_status": "completed",
+    "message": "Cloned successfully to ./workspace",
+    "target_dir": "/abs/path/to/workspace"
+  },
+  "error": null
+}
+```
+
+**Note:** Clone reuses the existing `pull` schema. The `additionalProperties: true` allows `target_dir` field. Only schema change: command enum addition.
+
+### Code Reuse Strategy
+- Leverage `pull.py:snapshot_download()` logic
+- Reuse `push.py:_analyze_workspace()` for post-clone health validation
+- Maintain consistent error handling patterns with existing operations
+
+## JSON API Schema Updates
+
+### Required Schema Changes (docs/json-api-schema.json)
+
+**1. Command Enum Update (Line 9):**
+```json
+"command": {"type": "string", "enum": ["list", "show", "health", "pull", "rm", "version", "push", "run", "clone"]}
+```
+
+**2. No new schema definition needed:**
+- Clone reuses existing `pull` schema (lines 180-202)
+- `"additionalProperties": true` allows `target_dir` field
+- `"required": ["download_status", "message"]` covers clone requirements
+- Schema validation works automatically for clone commands
+
+### Specification Documentation Update
+
+**Version:** 0.1.4 (minimal bump for command enum change)
+**New Section:** Clone Command documentation with examples
+
+## Testing Strategy
+
+**Test Categories (following TESTING.md patterns):**
+- **Offline Tests:** ~10 tests (CLI validation, error handling, schema compliance)
+- **Online Tests:** ~3 opt-in tests with live HF repos (MLXK2_LIVE_CLONE=1)
+- **Spec Tests:** ~3 JSON schema validation tests (JSON API 0.1.4)
+
+**Environment Variables:**
+- `MLXK2_ENABLE_EXPERIMENTAL_CLONE=1` - Enable clone tests in CI
+- `MLXK2_LIVE_CLONE=1` - Enable live network tests (opt-in)
+
+**Schema Validation Testing:**
+- All clone responses validate against updated JSON schema 0.1.4
+- Test both success and error response structures
+- Verify backward compatibility with existing commands
+
+## Benefits
+
+1. **Completes Issue #29:** Provides full workspace-based model management workflow
+2. **Swift/iOS Friendly:** Clean JSON API suitable for cross-platform porting
+3. **Low Risk:** Reuses battle-tested components (snapshot_download, workspace analysis)
+4. **Fast Implementation:** Can be completed in 1-2 Claude sessions
+5. **Test Coverage:** Follows established patterns from push implementation
+6. **JSON API Compliance:** Full schema validation and version management
+
+## Security Classification and Risk Analysis
+
+### Clone vs Push: Fundamental Safety Difference
+
+**Clone Operation: LOW RISK**
+- **Read-only operation:** Downloads existing HF content to local workspace
+- **No publication risk:** Cannot create/modify remote repositories
+- **Local-only impact:** Only affects specified target directory
+- **Cache isolation:** Bypasses HF cache entirely - direct download to target
+- **Validation safeguards:** Refuses to overwrite non-empty directories
+- **Risk profile:** Similar to `pull` operation - safe for general use
+
+**Push Operation: HIGH RISK**
+- **Write operation:** Publishes content to HuggingFace Hub
+- **Publication risk:** Can accidentally expose private/sensitive data
+- **Global impact:** Creates permanent public records
+- **Requires authentication:** Uses HF_TOKEN with write permissions
+- **Experimental status:** Hidden behind `MLXK2_ENABLE_EXPERIMENTAL_PUSH=1`
+
+### Implementation Implications
+
+**Clone does NOT require experimental gating:**
+- No `MLXK2_ENABLE_EXPERIMENTAL_CLONE=1` flag needed
+- Can be enabled by default in 2.0.0-beta.3
+- Standard test integration (not opt-in only)
+- Live tests follow normal marker patterns (like `list`, `pull`)
+
+**Clone workspace isolation guarantees:**
+1. **No cache pollution:** Downloads directly to target_dir with `local_dir_use_symlinks=False`, never touches HF_HOME
+2. **No overwrite risk:** Validation ensures target directory is empty or non-existent
+3. **Explicit targeting:** User must specify exact target path
+4. **Atomic operation:** Either succeeds completely or fails cleanly
+5. **Real file copies:** Target contains actual files, not symlinks to cache (validated in tests)
+
+## Risks and Mitigations
+
+**Risk:** Directory conflicts and overwrite behavior
+**Mitigation:** Require explicit target directory, validate empty/non-existent before download
+
+**Risk:** Large model download interruption
+**Mitigation:** Leverage huggingface_hub's built-in resume_download=True
+
+**Risk:** Disk space exhaustion
+**Mitigation:** Pre-flight disk space check, clear error messages
+
+**Risk:** JSON API version compatibility
+**Mitigation:**
+- Follow established versioning patterns from existing commands
+- Complete schema validation test coverage
+- Document breaking changes clearly
+
+**Risk:** Test suite complexity
+**Mitigation:** Standard test integration (not experimental opt-in), proven patterns from pull tests
+
+## Timeline
+
+**Target:** 2.0.0-beta.3 release within 24 hours
+- Session 1: Core implementation + minimal schema update + basic tests (1.5-2 hours)
+- Session 2: Complete test suite + documentation (1-2 hours)
+- Issue #29 feedback request: Immediate after implementation
+
+## Success Criteria
+
+1. ✅ Complete workflow: `clone` → `health` → `push`
+2. ✅ Both use cases supported (fork-modify-push + author-generated)
+3. ✅ JSON API 0.1.4 compliance with full schema validation
+4. ✅ Test coverage matches push patterns (~15 total tests)
+5. ✅ Schema backwards compatibility maintained
+6. ✅ feynon feedback positive for Swift porting use case
+
+## References
+
+- GitHub Issue #29: https://github.com/ml-explore/mlx-knife/issues/29
+- TESTING.md: Push test patterns (21 tests, offline/online/spec structure)
+- ADR-001: JSON-first architecture principles
+- mlxk2/operations/push.py: Workspace analysis and health check patterns
+- docs/json-api-schema.json: Current schema definition (0.1.3)
+- docs/json-api-specification.md: Current specification (0.1.3)
\ No newline at end of file
diff --git a/docs/ADR/ADR-006-Clone-Implementation-Revised.md b/docs/ADR/ADR-006-Clone-Implementation-Revised.md
new file mode 100644
index 0000000..6ddde98
--- /dev/null
+++ b/docs/ADR/ADR-006-Clone-Implementation-Revised.md
@@ -0,0 +1,209 @@
+# ADR-006: mlxk2 clone Implementation - Revised Strategy
+
+## Status
+**Accepted** - 2025-09-15
+
+**Supersedes:** ADR-005 (deprecated due to incorrect HuggingFace cache assumptions)
+
+## Context
+
+GitHub Issue #29 requests clone functionality for MLX Knife 2.0. After implementing ADR-005, critical findings revealed that HuggingFace Hub's `local_dir` parameter does not provide true cache isolation and can corrupt existing cache entries.
+
+### Key Findings from ADR-005 Implementation
+
+**Problem: HuggingFace Cache Behavior is Unreliable**
+1. `snapshot_download(local_dir=target, local_dir_use_symlinks=False)` still interacts with global cache
+2. Global cache corruption observed (models showing 0.0 KB after clone operations)
+3. `local_dir_use_symlinks` parameter is deprecated but behavior remains unclear
+4. Documentation promises cache isolation but implementation differs
+
+**Evidence:**
+```bash
+# Before clone: Phi-3-mini shows 4.3 GB in cache
+mlxk list --health  # Shows healthy model
+
+# After clone with local_dir: Cache corrupted
+mlxk list --health  # Shows 0.0 KB - corrupted cache entry
+```
+
+### Revised Strategy: Pull + APFS Copy + Cleanup
+
+**Core Insight:** Instead of fighting HuggingFace Hub's undocumented cache behavior, leverage it robustly:
+
+1. **Pull to Cache** (battle-tested, reliable)
+2. **Copy Cache → Workspace** (APFS copy-on-write optimization)
+3. **Delete Cache Entry** (automatic cleanup)
+
+## Decision
+
+Implement `mlxk2 clone` using a **Pull + Copy + Cleanup** strategy that provides robust workspace creation without relying on HuggingFace Hub's unreliable `local_dir` behavior.
+
+## Implementation Strategy
+
+### Core Workflow
+```
+1. Hidden Pull      → Download to cache (existing reliable logic)
+2. Optional Health  → Validate model integrity before copy
+3. APFS Copy        → Copy cache → workspace (copy-on-write efficient)
+4. Cache Cleanup    → Delete cache entry (no user prompt needed)
+```
+
+### APFS Volume Optimization
+
+**Key Advantage:** On APFS volumes (standard on macOS), file copies use copy-on-write:
+- Initial copy: **No additional disk space** (metadata references only)
+- Space usage: Only when files are modified in workspace
+- Copy speed: Near-instantaneous for large models
+
+**Volume Detection:**
+```python
+def is_same_apfs_volume(cache_path, workspace_path):
+    # Check if both paths are on same APFS volume
+    # Optimize copy strategy accordingly
+```
+
+### API Signature (Unchanged)
+```bash
+mlxk2 clone <org>/<repo>[@<revision>] <target_dir> [options]
+```
+
+**Options:**
+- `--branch <branch>` - Clone specific branch/revision
+- `--json` - JSON output mode
+- `--quiet` - Suppress progress output
+- `--no-health-check` - Skip optional health validation
+
+### JSON Response Schema (API 0.1.4 - Unchanged)
+```json
+{
+  "status": "success|error",
+  "command": "clone",
+  "data": {
+    "model": "org/repo",
+    "clone_status": "completed",
+    "message": "Cloned successfully to ./workspace",
+    "target_dir": "/abs/path/to/workspace",
+    "cache_cleanup": true,
+    "health_check": true
+  },
+  "error": null
+}
+```
+
+## Implementation Details
+
+### Phase 1: Core Clone Logic
+```python
+def clone_operation(model_spec, target_dir, health_check=True):
+    # 1. Standard pull to cache
+    pull_result = pull_operation(model_spec)
+    if pull_result["status"] != "success":
+        return error_response("Pull failed", pull_result["error"])
+
+    # 2. Optional health check
+    if health_check:
+        health_result = health_check_cache(model_spec)
+        if not health_result["healthy"]:
+            return error_response("Model unhealthy", health_result)
+
+    # 3. Copy cache to workspace
+    cache_path = resolve_cache_path(model_spec)
+    copy_result = apfs_optimized_copy(cache_path, target_dir)
+    if not copy_result["success"]:
+        return error_response("Copy failed", copy_result["error"])
+
+    # 4. Cleanup cache entry
+    cleanup_result = remove_cache_entry(model_spec)
+
+    return success_response(copy_result, cleanup_result)
+```
+
+### Phase 2: APFS Optimization
+```python
+def apfs_optimized_copy(source_path, target_path):
+    """Copy with APFS copy-on-write optimization where possible."""
+    if is_same_apfs_volume(source_path, target_path):
+        # Use APFS-optimized copy (clonefile on macOS)
+        return apfs_clone_files(source_path, target_path)
+    else:
+        # Fall back to standard file copy
+        return standard_copy(source_path, target_path)
+```
+
+### Phase 3: Cache Management
+```python
+def remove_cache_entry(model_spec):
+    """Remove cache entry after successful workspace creation."""
+    cache_path = hf_to_cache_dir(model_spec)
+    if cache_path.exists():
+        shutil.rmtree(cache_path)
+    return {"cache_cleanup": True, "path": str(cache_path)}
+```
+
+## Benefits
+
+1. **Robust Behavior:** Uses proven pull logic, avoids HF cache edge cases
+2. **APFS Efficient:** No duplicate storage on same volume (copy-on-write)
+3. **Clean Workspaces:** No cache artifacts (.cache folders, symlinks)
+4. **Predictable:** No undocumented HF behavior dependencies
+5. **Testable:** Each phase can be tested independently
+
+## Security Classification
+
+**Clone Operation: LOW RISK** (unchanged)
+- Read-only operation with local file manipulation only
+- No remote publication risk
+- Workspace isolation maintained through file copying
+
+## Risk Analysis
+
+### Mitigated Risks (from ADR-005)
+- ✅ **Cache Corruption:** Eliminated by using standard pull path
+- ✅ **Undocumented Behavior:** No reliance on HF `local_dir` edge cases
+- ✅ **Symlink Issues:** Pure file copying, no symlinks
+
+### New Risks and Mitigations
+
+**Risk:** Double storage usage during copy process
+**Mitigation:** APFS copy-on-write optimization, volume detection
+
+**Risk:** Cache cleanup removes model unexpectedly
+**Mitigation:** Only cleanup after successful workspace creation
+
+**Risk:** Interrupted copy leaves partial workspace
+**Mitigation:** Atomic operations, rollback on failure
+
+## Testing Strategy
+
+### Test Categories
+1. **Pull Integration:** Verify pull-phase works correctly
+2. **Copy Operations:** Test APFS vs standard copying
+3. **Cache Management:** Validate cleanup behavior
+4. **Error Handling:** Test failure modes at each phase
+5. **JSON Schema:** API 0.1.4 compliance validation
+
+### Environment Variables
+- `MLXK2_ENABLE_EXPERIMENTAL_CLONE=1` - Enable clone tests in CI
+- `MLXK2_LIVE_CLONE=1` - Enable live network tests (opt-in)
+
+## Timeline
+
+**Target:** Complete within current session
+- Implementation: 1-2 hours (reuse existing pull logic)
+- Testing: 1 hour (focused on copy + cleanup logic)
+- Documentation: 30 minutes
+
+## Success Criteria
+
+1. ✅ **Reliable Clone:** No cache corruption, predictable behavior
+2. ✅ **APFS Optimized:** Minimal storage overhead on macOS
+3. ✅ **Clean Workspaces:** No cache artifacts in target directories
+4. ✅ **JSON API Compliance:** Full 0.1.4 schema validation
+5. ✅ **Robust Error Handling:** Graceful failure at each phase
+
+## References
+
+- **Supersedes:** ADR-005 (retained for historical reference)
+- GitHub Issue #29: Clone functionality request
+- HuggingFace Hub Documentation: `snapshot_download` behavior analysis
+- APFS Technical Reference: Copy-on-write filesystem optimization
\ No newline at end of file
diff --git a/docs/ADR/ADR-007-Clone-Implementation-Fixed.md b/docs/ADR/ADR-007-Clone-Implementation-Fixed.md
new file mode 100644
index 0000000..e8154b7
--- /dev/null
+++ b/docs/ADR/ADR-007-Clone-Implementation-Fixed.md
@@ -0,0 +1,534 @@
+# ADR-007: Clone Implementation Fixed Strategy
+
+**Status:** Accepted
+**Date:** 2025-01-16
+**Supersedes:** ADR-006 (Clone Implementation Revised)
+
+## Context
+
+The clone implementation following ADR-006 has critical bugs that make it unsuitable for release:
+
+1. **Destructive Cache Cleanup:** Always deletes user cache after copy, even when model pre-existed
+2. **Commit Hash Mismatch:** Can copy outdated snapshots when remote HEAD differs from cached version
+3. **Inconsistent Behavior:** User unexpectedly loses models from cache
+
+Analysis revealed that the Pull+Copy+Cleanup strategy is fundamentally flawed for a "clone" operation, which should be non-destructive by nature.
+
+## Decision
+
+### Phased Implementation Strategy
+
+**Core Principle:** Cache must be APFS (for optimization), workspace flexibility increases per phase.
+
+### Phase 1: Same-Volume APFS (2.0.0-beta.3)
+
+**Constraints:**
+- Cache: APFS required
+- Workspace: APFS required, same volume as cache
+- Optimization: Direct APFS copy-on-write
+
+**Workflow:**
+```
+1. Validate cache and workspace both on same APFS volume
+2. Create isolated temp cache on same volume as workspace
+3. Pull model to temp cache (isolated from user cache)
+4. APFS clone temp cache → workspace (instant, zero space initially)
+5. Delete temp cache (cleanup)
+```
+
+### Phase 2: Cross-Filesystem Support (eventually, when clone and push is non-Alpha)
+
+**Constraints:**
+- Cache: APFS required (for temp cache optimization)
+- Workspace: Any filesystem supported
+- Optimization: APFS CoW for temp cache, standard copy to workspace
+
+**Workflow:**
+```
+1. Validate cache on APFS (workspace can be any filesystem)
+2. Create isolated temp cache on APFS volume (cache volume)
+3. Pull model to temp cache via APFS optimization
+4. Copy temp cache → workspace (standard copy if cross-filesystem)
+5. Delete temp cache (cleanup)
+```
+
+## Implementation Matrix
+
+### Filesystem Compatibility Table
+
+| Cache FS | Workspace FS | Same Volume | Phase 1 Support | Phase 2 Support | Copy Method | Performance |
+|----------|--------------|-------------|------------------|------------------|-------------|-------------|
+| APFS | APFS | Yes | ✅ Supported | ✅ Supported | APFS CoW Direct | ⚡ Instant |
+| APFS | APFS | No | ❌ Error | ✅ Supported | Temp+Standard | 🐌 2x Copy |
+| APFS | HFS+ | No | ❌ Error | ✅ Supported | Temp+Standard | 🐌 2x Copy |
+| APFS | ExFAT | No | ❌ Error | ✅ Supported | Temp+Standard | 🐌 2x Copy |
+| APFS | NFS | No | ❌ Error | ⚠️ Warning | Temp+Network | 🐌🐌 Slow |
+| APFS | SMB/CIFS | No | ❌ Error | ⚠️ Warning | Temp+Network | 🐌🐌 Slow |
+| HFS+ | Any | Any | ❌ Error | ❌ Error | N/A | N/A |
+| NFS | Any | Any | ❌ Error | ❌ Error | N/A | N/A |
+| SMB/CIFS | Any | Any | ❌ Error | ❌ Error | N/A | N/A |
+
+### Data Flow Scenarios
+
+#### Scenario A: Phase 1 Optimal (Same APFS Volume)
+```
+User Cache (APFS): /Users/me/.cache/huggingface/hub/
+Target Workspace:   /Users/me/projects/mymodel/
+Temp Cache:         /Users/me/.mlxk2_temp_12345/
+
+Flow:
+[Remote] --pull--> [Temp Cache] --APFS CoW--> [Workspace]
+                        ↑              ↓
+                   Zero space      Instant copy
+```
+
+#### Scenario B: Phase 2 Cross-Filesystem
+```
+User Cache (APFS):  /Users/me/.cache/huggingface/hub/
+Target Workspace:   /Volumes/ProjectSSD/myapp/models/
+Temp Cache:         /Users/me/.mlxk2_temp_12345/
+
+Flow:
+[Remote] --pull--> [Temp Cache] --Standard Copy--> [Workspace]
+                        ↑              ↓
+                   APFS CoW       Full copy
+```
+
+#### Scenario C: Phase 2 Network Workspace (NFS/SMB)
+```
+User Cache (APFS):  /Users/me/.cache/huggingface/hub/
+Target Workspace:   /Volumes/NetworkShare/models/  (NFS or SMB/CIFS)
+Temp Cache:         /Users/me/.mlxk2_temp_12345/
+
+Flow:
+[Remote] --pull--> [Temp Cache] --Network Copy--> [Network Workspace]
+                        ↑              ↓
+                   Fast local     Slow network
+```
+
+### Response Matrix (Phase 1 Implementation)
+
+| Function | APFS Check Timing | Non-APFS Response | Response Type | JSON Example |
+|----------|------------------|-------------------|---------------|--------------|
+| `serve` | Never | Normal operation | Success | `{"status": "success", "command": "serve", ...}` |
+| `list` | Never | Normal operation | Success | `{"status": "success", "command": "list", ...}` |
+| `show` | Never | Normal operation | Success | `{"status": "success", "command": "show", ...}` |
+| `health` | Never | Normal operation | Success | `{"status": "success", "command": "health", ...}` |
+| `pull` | Never | Normal operation | Success | `{"status": "success", "command": "pull", ...}` |
+| **`push`** | On success (Alpha only) | **Add APFS hint to message** | ⚠️ Success + Warning | `{"status": "success", "data": {"message": "Push successful. Clone operations require APFS filesystem."}}` |
+| **`clone`** | On demand (lazy) | **Hard error, abort** | ❌ Error | `{"status": "error", "data": {"clone_status": "filesystem_error"}}` |
+
+### Error Handling Matrix
+
+#### Phase 1 Error Matrix (Same-Volume APFS Only)
+
+| Scenario | Cache FS | Workspace FS | Same Volume | Error Type | Behavior | JSON Error | User Action |
+|----------|----------|--------------|-------------|------------|----------|------------|-------------|
+| ✅ **Supported** | APFS | APFS | Yes | None | Success | N/A | None |
+| ❌ **Cache Requirement** | HFS+ | Any | Any | CacheFilesystemError | Hard Error | `{"status": "error", "data": {"clone_status": "filesystem_error"}}` | Migrate cache to APFS |
+| ❌ **Cache Requirement** | ExFAT | Any | Any | CacheFilesystemError | Hard Error | `{"status": "error", "data": {"clone_status": "filesystem_error"}}` | Migrate cache to APFS |
+| ❌ **Cache Requirement** | NFS/SMB | Any | Any | CacheFilesystemError | Hard Error | `{"status": "error", "data": {"clone_status": "filesystem_error"}}` | Use local APFS cache |
+| ❌ **Workspace Requirement** | APFS | HFS+ | No | WorkspaceFilesystemError | Hard Error | `{"status": "error", "data": {"clone_status": "filesystem_error"}}` | Use APFS workspace |
+| ❌ **Workspace Requirement** | APFS | ExFAT | No | WorkspaceFilesystemError | Hard Error | `{"status": "error", "data": {"clone_status": "filesystem_error"}}` | Use APFS workspace |
+| ❌ **Workspace Requirement** | APFS | NFS/SMB | No | WorkspaceFilesystemError | Hard Error | `{"status": "error", "data": {"clone_status": "filesystem_error"}}` | Use APFS workspace |
+| ❌ **Volume Requirement** | APFS | APFS | No | VolumeError | Hard Error | `{"status": "error", "data": {"clone_status": "filesystem_error"}}` | Move workspace to cache volume |
+
+#### Phase 2 Error Matrix (Cross-Filesystem Support)
+
+| Scenario | Cache FS | Workspace FS | Same Volume | Error Type | Behavior | JSON Response | User Action |
+|----------|----------|--------------|-------------|------------|----------|---------------|-------------|
+| ✅ **Optimal** | APFS | APFS | Yes | None | Success (CoW) | `{"status": "success", "data": {"clone_status": "success", "copy_method": "apfs_cow"}}` | None |
+| ✅ **Standard** | APFS | APFS | No | None | Success (Standard) | `{"status": "success", "data": {"clone_status": "success", "copy_method": "standard_copy"}}` | None |
+| ✅ **Standard** | APFS | HFS+ | No | None | Success (Standard) | `{"status": "success", "data": {"clone_status": "success", "copy_method": "standard_copy"}}` | None |
+| ✅ **Standard** | APFS | ExFAT | No | None | Success (Standard) | `{"status": "success", "data": {"clone_status": "success", "copy_method": "standard_copy"}}` | None |
+| ⚠️ **Network Warning** | APFS | NFS | No | NetworkWarning | Warning + Proceed | `{"status": "success", "data": {"clone_status": "success", "copy_method": "network_copy", "warning": "Network filesystem detected. Copy will be slower."}}` | Expect slower performance |
+| ⚠️ **Network Warning** | APFS | SMB/CIFS | No | NetworkWarning | Warning + Proceed | `{"status": "success", "data": {"clone_status": "success", "copy_method": "network_copy", "warning": "Network filesystem detected. Copy will be slower."}}` | Expect slower performance |
+| ❌ **Cache Requirement** | HFS+ | Any | Any | CacheFilesystemError | Hard Error | `{"status": "error", "data": {"clone_status": "filesystem_error"}}` | Migrate cache to APFS |
+| ❌ **Cache Requirement** | ExFAT | Any | Any | CacheFilesystemError | Hard Error | `{"status": "error", "data": {"clone_status": "filesystem_error"}}` | Migrate cache to APFS |
+| ❌ **Cache Requirement** | NFS/SMB | Any | Any | CacheFilesystemError | Hard Error | `{"status": "error", "data": {"clone_status": "filesystem_error"}}` | Use local APFS cache |
+
+#### Error Message Examples
+
+**⚠️ JSON Protocol Disclaimer:**
+> All JSON response examples are provisional and based on specification v0.1.4. Field contents (e.g., `clone_status` values) and response structure may evolve during Phase 1 and Phase 2 implementation.
+
+**Phase 1 Errors:**
+```json
+{
+  "status": "error",
+  "command": "clone",
+  "data": {
+    "clone_status": "filesystem_error",
+    "target_dir": "/some/workspace"
+  },
+  "error": {
+    "type": "FilesystemError",
+    "message": "APFS cache required for clone operations."
+  }
+}
+```
+
+**Phase 2 Network Warnings:**
+```json
+{
+  "status": "success",
+  "command": "clone",
+  "data": {
+    "model": "microsoft/DialoGPT-small",
+    "target_dir": "/Volumes/NASShare/models/dialog",
+    "clone_status": "success",
+    "message": "Cloned to /Volumes/NASShare/models/dialog",
+    "expanded_name": "microsoft/DialoGPT-small"
+  },
+  "error": null
+}
+```
+
+### Performance Characteristics
+
+#### Phase 1 (Same APFS Volume)
+```
+4GB Model Clone Performance:
+- Temp cache creation: ~30 seconds (network download)
+- APFS CoW copy: ~0.1 seconds (metadata only)
+- Temp cleanup: ~0.5 seconds
+- Total time: ~30.6 seconds
+- Total space: ~4GB (only in workspace after CoW)
+```
+
+#### Phase 2 (Cross-Filesystem)
+```
+4GB Model Clone Performance:
+- Temp cache creation: ~30 seconds (network download)
+- Standard copy: ~60 seconds (4GB copy)
+- Temp cleanup: ~0.5 seconds
+- Total time: ~90.5 seconds
+- Peak space: ~8GB (temp + workspace during copy)
+```
+
+## Migration Strategy Between Phases
+
+### Phase 1 → Phase 2 Upgrade
+- **Breaking Change:** None (Phase 1 scenarios still work optimally)
+- **New Capability:** Cross-filesystem support added
+- **User Impact:** More flexible workspace placement
+- **Performance:** Same for existing use cases, degraded for new cross-FS cases
+
+### Implementation Flags
+```python
+# Alpha feature gate (existing)
+MLXK2_ENABLE_ALPHA_FEATURES=1  # Required for clone and push operations
+
+# Future Phase 2 flags (if needed)
+# MLXK2_CLONE_ALLOW_CROSS_FILESYSTEM=1
+```
+
+## Implementation Details
+
+### 1. Volume-Aware Temp Cache Creation
+
+```python
+def create_temp_cache_same_volume(target_workspace: Path) -> Path:
+    """Create temp cache on same APFS volume as target for CoW optimization."""
+
+    # Get target volume mount point via st_dev
+    target_volume = get_volume_mount_point(target_workspace)
+
+    # Create temp cache on same volume
+    temp_cache = target_volume / f".mlxk2_temp_{os.getpid()}_{random.randint(1000,9999)}"
+    temp_cache.mkdir(parents=True)
+
+    # SAFETY: Create sentinel file to prevent accidental user cache deletion
+    sentinel = temp_cache / ".mlxk2_temp_cache_sentinel"
+    sentinel.write_text(f"mlxk2_temp_cache_created_{int(time.time())}")
+
+    return temp_cache
+
+def cleanup_temp_cache_safe(temp_cache: Path) -> bool:
+    """Safely delete temp cache only if sentinel exists."""
+
+    # SAFETY: Only delete if sentinel exists
+    sentinel = temp_cache / ".mlxk2_temp_cache_sentinel"
+    if not sentinel.exists():
+        logger.warning(f"Refusing to delete {temp_cache} - no sentinel found")
+        return False
+
+    shutil.rmtree(temp_cache, ignore_errors=True)
+    return True
+
+def get_volume_mount_point(path: Path) -> Path:
+    """Find mount point (volume root) for given path via st_dev changes."""
+    abs_path = path.resolve()
+    current = abs_path
+
+    while current != current.parent:
+        try:
+            parent_stat = current.parent.stat()
+            current_stat = current.stat()
+
+            # Different st_dev = mount boundary
+            if parent_stat.st_dev != current_stat.st_dev:
+                return current
+        except (OSError, PermissionError):
+            pass
+        current = current.parent
+
+    return current  # Filesystem root
+```
+
+### 2. Shared APFS Filesystem Check
+
+```python
+def is_apfs_filesystem(path: Path) -> bool:
+    """Simple APFS check - returns True/False only.
+
+    Used by both clone (validation) and push (conditional warning).
+    """
+    try:
+        import subprocess
+        result = subprocess.run(['stat', '-f', '-c', '%T', str(path)],
+                              capture_output=True, text=True)
+        return result.stdout.strip() == 'apfs'
+    except subprocess.CalledProcessError:
+        return False  # Safe fallback
+
+def validate_apfs_filesystem(path: Path) -> None:
+    """Validate APFS requirement for clone operations.
+
+    Called lazily - only on first clone operation, not at CLI startup.
+    """
+    if not is_apfs_filesystem(path):
+        raise FilesystemError(
+            f"APFS required for clone operations. "
+            f"Path: {path}\n"
+            f"Solution: Use APFS volume or external APFS SSD."
+        )
+```
+
+### 3. Clone Operation Implementation
+
+```python
+def clone_operation(model_spec: str, target_dir: str) -> Dict[str, Any]:
+    """Clone with isolated temp cache strategy."""
+
+    target_path = Path(target_dir).resolve()
+
+    # 1. Validate APFS requirement
+    validate_apfs_filesystem(target_path.parent)
+
+    # 2. Create temp cache on same volume as target
+    temp_cache = create_temp_cache_same_volume(target_path)
+
+    try:
+        # 3. Pull to isolated temp cache
+        with patch_hf_home(temp_cache):
+            pull_result = pull_operation(model_spec)
+
+        if pull_result["status"] != "success":
+            return handle_pull_error(pull_result)
+
+        # 4. Resolve temp cache snapshot path
+        resolved_model = pull_result["data"]["model"]
+        temp_snapshot = resolve_latest_snapshot(temp_cache, resolved_model)
+
+        # 5. APFS clone to workspace (instant, CoW)
+        target_path.mkdir(parents=True, exist_ok=True)
+        clone_success = apfs_clone_directory(temp_snapshot, target_path)
+
+        if not clone_success:
+            return handle_clone_error()
+
+        # 6. Success - temp cache auto-cleanup via context manager
+        return {
+            "status": "success",
+            "command": "clone",
+            "data": {
+                "model": resolved_model,
+                "target_dir": str(target_path),
+                "clone_status": "completed",
+                "cache_preserved": True,  # User cache never touched
+                "copy_method": "apfs_cow"
+            }
+        }
+
+    finally:
+        # Cleanup temp cache
+        shutil.rmtree(temp_cache, ignore_errors=True)
+```
+
+### 4. User Experience: Push Workflow Warning
+
+```python
+def push_operation(...) -> Dict[str, Any]:
+    # ... normal push logic ...
+
+    # Conditional APFS hint based on cache filesystem
+    if not is_apfs_filesystem(get_hf_cache_dir()):
+        message = "Push successful. Clone operations require APFS filesystem."
+    else:
+        message = "Push successful."
+
+    result = {
+        "status": "success",
+        "command": "push",
+        "data": {
+            "repo_id": repo_id,
+            "branch": branch,
+            "message": message,
+            # ... existing fields ...
+        }
+    }
+
+    return result
+```
+
+### 5. APFS Copy-on-Write Implementation
+
+```python
+def apfs_clone_directory(source: Path, target: Path) -> bool:
+    """Clone directory using APFS copy-on-write via clonefile."""
+    try:
+        for item in source.rglob("*"):
+            if item.is_file():
+                relative_path = item.relative_to(source)
+                target_file = target / relative_path
+                target_file.parent.mkdir(parents=True, exist_ok=True)
+
+                # Use cp -c for clonefile (APFS CoW)
+                subprocess.run(['cp', '-c', str(item), str(target_file)],
+                             check=True, capture_output=True)
+        return True
+
+    except subprocess.CalledProcessError as e:
+        logger.error(f"APFS clone failed: {e}")
+        return False
+```
+
+## Pros and Cons
+
+### Pros
+
+1. **User Cache Preservation:** Never touches existing user cache
+2. **Consistency:** Always gets latest/specified model version
+3. **Performance:** APFS CoW provides instant copy with minimal space
+4. **Isolation:** Temp cache prevents pollution of user environment
+5. **Predictable:** Clone behaves like standard file copy operation
+6. **Robust:** Clear filesystem requirements with early validation
+
+### Cons
+
+1. **APFS Requirement:** Users on non-APFS setups need migration
+2. **Temporary Disk Usage:** Brief full model copy in temp cache before CoW
+3. **Implementation Complexity:** Volume detection and temp cache management
+4. **Platform Specific:** Relies on macOS/iOS APFS features
+
+## Migration from ADR-006
+
+### Breaking Changes
+
+1. **Cache Behavior:** User cache is preserved (not deleted after clone)
+2. **Filesystem Requirements:** APFS validation added
+3. **Performance Profile:** May use more temporary disk space
+
+### User Migration
+
+**Before (ADR-006):**
+```bash
+mlxk2 clone org/model ./workspace  # Deleted model from cache
+```
+
+**After (ADR-007):**
+```bash
+mlxk2 clone org/model ./workspace  # Preserves model in cache
+# User cache remains intact for other operations
+```
+
+### Error Handling
+
+**Non-APFS Cache:**
+```
+Error: Filesystem 'nfs' not supported
+MLX-Knife requires APFS for clone operations.
+
+Current path: /Volumes/NetworkShare/cache
+Solution: Use APFS volume:
+  export HF_HOME="/Users/you/.cache/huggingface"
+```
+
+## Testing Strategy
+
+### Unit Tests
+
+1. **Volume Detection:** Verify mount point resolution across scenarios
+2. **APFS Validation:** Test filesystem detection and error handling
+3. **Temp Cache Creation:** Validate same-volume placement
+4. **Copy-on-Write:** Test clonefile success and fallback behavior
+
+### Integration Tests
+
+1. **Cross-Volume Scenarios:** Cache on external APFS, workspace on internal
+2. **Large Model Performance:** Verify CoW benefits with multi-GB models
+3. **Error Recovery:** Temp cache cleanup on failures
+4. **Concurrent Access:** Multiple clone operations
+
+### Real-World Validation
+
+1. **External APFS SSDs:** Thunderbolt/USB-C attached storage
+2. **iOS Simulator:** Validate iOS filesystem assumptions
+3. **Network Limitations:** Ensure clear errors for unsupported setups
+
+## Implementation Timeline
+
+### Phase 1: Same-Volume APFS (2.0.0-beta.3)
+**Target:** Stable clone functionality with strict constraints
+- ✅ Volume detection utilities (`get_volume_mount_point`, `is_same_volume`)
+- ✅ APFS validation framework (`validate_apfs_filesystem`)
+- ✅ Temp cache management on same volume
+- ✅ APFS copy-on-write implementation (`apfs_clone_directory`)
+- ✅ Error handling for unsupported scenarios
+- ✅ Performance optimization for direct CoW path
+
+**Success Criteria:**
+- Clone works reliably when cache and workspace on same APFS volume
+- Clear error messages for unsupported filesystem combinations
+- Performance benchmarks show near-instant copy for large models
+
+### Phase 2: Cross-Filesystem Support (eventually, when clone and push is non-Alpha)
+**Target:** Flexible workspace placement with graceful degradation
+- 🔄 Cross-filesystem copy implementation
+- 🔄 Performance monitoring for different copy methods
+- 🔄 Network filesystem handling and warnings
+- 🔄 User experience improvements for mixed scenarios
+- 🔄 Configuration flags for behavior control
+
+**Success Criteria:**
+- Clone works across all supported filesystem combinations
+- Performance degradation is predictable and documented
+- User guidance for optimal setup configurations
+
+### Phase 3: Advanced Features (future, no version commitment)
+**Target:** Production hardening and edge case handling
+**Status:** Nice-to-have features, implement based on user demand
+
+- 🔄 Incremental clone support (delta updates)
+- 🔄 Resume capability for interrupted operations
+- 🔄 Bandwidth limiting for network operations
+- 🔄 Comprehensive logging and diagnostics
+- 🔄 Advanced caching strategies
+
+## Decision Rationale
+
+This strategy addresses the fundamental flaws in ADR-006 while leveraging the strengths of the Apple Silicon ecosystem. By requiring APFS and using isolated temp caches, we achieve:
+
+- **Correctness:** No data loss or inconsistent states
+- **Performance:** Copy-on-write optimization
+- **Simplicity:** Clear requirements and predictable behavior
+
+The APFS requirement is justified given MLX's Apple Silicon dependency and the target use case focus on iOS development.
+
+## Status
+
+- **Implementation:** To be started
+- **Testing:** Required before release
+- **Documentation:** Needs update for filesystem requirements
+- **Release:** Blocks 2.0.0-beta.3 until complete
\ No newline at end of file
diff --git a/docs/ADR/ADR-009-Stop-Token-Detection-Fix.md b/docs/ADR/ADR-009-Stop-Token-Detection-Fix.md
new file mode 100644
index 0000000..57ec973
--- /dev/null
+++ b/docs/ADR/ADR-009-Stop-Token-Detection-Fix.md
@@ -0,0 +1,356 @@
+# ADR-009: Stop Token Detection Fix
+
+**Status:** Accepted
+**Date:** 2025-10-21
+**Supersedes:** Issue #32 discussions (September 2025)
+**Affects:** Runner (Beta.6)
+**Related:** ADR-010 (Reasoning Content API - Future)
+
+---
+
+## Context
+
+### Problem Statement
+
+Issue #32 requests migration from model-specific workarounds to **generic stop token detection** using native chat templates and mlx-lm APIs.
+
+**Current State:**
+- ✅ MXFP4 works (via hardcoded `<|end|>` skip in `stop_tokens.py:49`)
+- ❌ Not state-of-the-art (model-specific "Gebastel")
+- ❌ Every new model needs custom pattern
+- ❌ Runner uses singular `eos_token_id` instead of `eos_token_ids` Set
+
+**Goal:**
+Use **mlx-lm TokenizerWrapper APIs** as primary mechanism, fall back to model-specific handling only when needed.
+
+### Root Cause Analysis
+
+**Runner Bug (mlxk2/core/runner/__init__.py:468, 589):**
+```python
+# CURRENT (checks only singular ID)
+if token_id == self.tokenizer.eos_token_id:
+    break
+
+# SHOULD BE (checks Set of IDs)
+if token_id in self.tokenizer.eos_token_ids:
+    break
+```
+
+**Why `eos_token_ids` is better:**
+- mlx-lm `TokenizerWrapper` normalizes `eos_token_id` → `eos_token_ids` (Set)
+- Handles models with multiple EOS tokens (e.g., Llama 3: `[128001, 128009]`)
+- Generic mechanism, no model-specific code needed
+
+**Example (MXFP4):**
+```python
+# HuggingFace config (upstream bug)
+tokenizer.eos_token_id = 200002  # Only <|return|>, missing 200007 (<|end|>)
+
+# But added_tokens_decoder has both:
+{
+  200002: "<|return|>",
+  200007: "<|end|>"
+}
+
+# Current workaround (stop_tokens.py:49):
+if token_content == '<|end|>':
+    continue  # Skip adding to stop_tokens
+
+# Hypothesis: 2-LOC fix may be sufficient
+# If not, fallback to add_eos_token():
+tokenizer.add_eos_token("<|end|>")  # Adds 200007 to eos_token_ids set
+```
+
+### Current Workarounds
+
+**Model-Specific Code:**
+1. `stop_tokens.py:49` - Hardcoded `<|end|>` skip for MXFP4
+2. `stop_tokens.py:92` - Hardcoded `<|return|>` add for gpt-oss
+3. `reasoning.py:22-33` - MXFP4 reasoning patterns
+
+**These work, but are not scalable for future models.**
+
+### Constraints
+
+1. **Generic First:** Use mlx-lm APIs, avoid model-specific code when possible
+2. **Pragmatic Fallback:** Keep model-specific handling if needed (not all models are perfect)
+3. **No Breaking Changes:** Existing models must continue working
+4. **Focus Models Only:** Test MXFP4, Qwen 2.5, Llama 3.2 (not all models)
+
+---
+
+## Decision
+
+### Test-Driven Fix Strategy
+
+**Step 1: Implement Real-Model Test Suite**
+
+Required before any code changes - we need empirical data to validate the fix.
+
+```python
+# tests_2.0/test_stop_tokens_live.py (see Test Strategy section below)
+```
+
+**Step 2: Baseline Measurement**
+
+Document current behavior with existing workarounds:
+- MXFP4: Does `<|end|>` appear in output? (expected: NO, via workaround)
+- Qwen 2.5: Does self-conversation occur? (expected: document pattern)
+- Llama 3.2: Does generation work correctly? (expected: YES)
+
+**Step 3: Apply 2-LOC Fix**
+
+```python
+# mlxk2/core/runner/__init__.py:468 (generate_streaming)
+if token_id in self.tokenizer.eos_token_ids:  # Changed: == to in
+    break
+
+# mlxk2/core/runner/__init__.py:589 (generate_batch)
+if token_id in self.tokenizer.eos_token_ids:  # Changed: == to in
+    break
+```
+
+**Step 4: Re-Test & Evaluate**
+
+Run test suite again. Three possible outcomes:
+
+| Outcome | Action |
+|---------|--------|
+| ✅ All tests pass | Remove obsolete workarounds, ship Beta.6 |
+| ⚠️ Some tests fail | Investigate: Need `add_eos_token()` integration? |
+| ❌ Tests still fail | Document findings, implement targeted fixes |
+
+**Step 5: Conditional Cleanup**
+
+```python
+# stop_tokens.py:49 - Remove IF tests pass without it
+if token_content == '<|end|>':
+    continue  # ← DELETE if generic fix works
+
+# stop_tokens.py:92 - Keep IF still needed
+if model_type == 'gpt-oss':
+    stop_tokens.add('<|return|>')  # Keep with comment: "Upstream config bug"
+```
+
+**Step 6 (Optional): add_eos_token() Integration**
+
+If tests reveal that `eos_token_ids` doesn't contain all necessary EOS tokens:
+
+```python
+# Option A: In stop_tokens.py:extract_stop_tokens() (after line 55)
+# When we find EOS-like tokens in added_tokens_decoder, register them:
+if token_content in ['<|end|>', '<|return|>']:  # Derived from added_tokens_decoder values we flag as EOS
+    tokenizer.add_eos_token(token_content)
+    # NOTE: Modifies tokenizer state, but needed for upstream config bugs
+
+# Option B: In runner/__init__.py:load_model() (after line 192)
+# Model-specific fixes after tokenizer load:
+if 'mxfp4' in str(model_path).lower():
+    self.tokenizer.add_eos_token("<|end|>")
+```
+
+**Decision Point:** Only implement Step 6 if empirical testing shows it's necessary.
+
+**Philosophy:**
+- **Test-driven** (measure before fixing)
+- **Generic first** (2-LOC fix should work for most models)
+- **Pragmatic fallback** (`add_eos_token()` only if needed)
+- **Not our job to fix all models** (focus on priority models)
+
+### Implementation Status (2025-10-21)
+
+**Steps 1-4: ✅ COMPLETED**
+- Real-model test suite implemented in `tests_2.0/test_stop_tokens_live.py` (4 tests, 3 models)
+- 2-LOC fix applied in `mlxk2/core/runner/__init__.py:468,590`
+- Empirical validation executed (see `stop_token_config_report.json`)
+- Results: Generic fix alone **not sufficient** - MXFP4 still requires `add_eos_token()` workaround
+
+**Step 5: ⏸️ SKIPPED**
+- Conditional cleanup deferred (workarounds still active)
+- Rationale: Step 6 became necessary, re-evaluate cleanup after stabilization
+
+**Step 6: 🔧 ACTIVE (Deterministic Guard)**
+- `add_eos_token()` implemented in `mlxk2/core/runner/stop_tokens.py:49-56`
+- **Implementation differs from "optional" plan:**
+  - Originally planned: "only if empirical tests show it's needed"
+  - Actually implemented: Unconditional call whenever `<|end|>` appears in config
+  - Rationale: Deterministic guard for MXFP4-class models (pragmatic workaround)
+- No tokenizer state mutation side-effects observed (callable check + exception guard)
+
+**Outstanding Work:**
+- Portfolio discovery not yet implemented (hard-coded 3 models in test suite)
+- Workaround cleanup evaluation (lines 49, 99 in `stop_tokens.py`)
+- Empirical validation scope expansion (currently 3 models, aim for full cache coverage)
+
+### Non-Goals (Beta.6)
+
+- ❌ Test all models (unrealistic)
+- ❌ Remove all workarounds (only remove obsolete ones)
+- ❌ Fix upstream HuggingFace configs (report issues, but don't block on them)
+- ❌ Reasoning API changes (see ADR-010)
+
+### Test Strategy
+
+**Real-Model Test Suite Required:**
+
+```python
+# tests_2.0/test_stop_tokens_live.py
+
+@pytest.mark.live_stop_tokens
+def test_mxfp4_stop_tokens():
+    """Verify <|end|> doesn't appear in output."""
+    runner = MLXRunner("mlx-community/gpt-oss-20b-MXFP4-Q8")
+    response = runner.generate_batch("Write one sentence about cats.", max_tokens=50)
+
+    assert "<|end|>" not in response  # Should be filtered
+    assert "<|return|>" not in response  # Should stop before this
+
+@pytest.mark.live_stop_tokens
+def test_qwen_self_conversation():
+    """Verify model stops before generating turn-taking markers (no self-conversation).
+
+    Self-conversation occurs when the model generates the next user turn prompt
+    instead of stopping after its own response. This manifests as chat template
+    role markers appearing in the output (e.g., "\\nUser:", "\\nHuman:").
+
+    Expected behavior: Model stops cleanly after its response, before any role markers.
+    """
+    runner = MLXRunner("mlx-community/Qwen2.5-0.5B-Instruct-4bit")
+
+    # Test with simple prompt that might trigger multi-turn continuation
+    response = runner.generate_batch("Hello", max_tokens=50)
+
+    # Assert no role markers from chat template appear in output
+    # (These would indicate the model is generating the next turn)
+    chat_turn_markers = [
+        '\nUser:', '\nHuman:', '\nYou:', '\nAssistant:',
+        '\n\nUser:', '\n\nHuman:', '\n\nYou:', '\n\nAssistant:',
+        '<|im_start|>user', '<|im_start|>assistant'  # Qwen-specific markers
+    ]
+
+    for marker in chat_turn_markers:
+        assert marker not in response, (
+            f"Self-conversation detected: Found '{marker}' in response. "
+            f"Model should stop before generating next turn."
+        )
+
+    # Baseline: Verify we got a non-empty response
+    assert response.strip(), "Response should not be empty"
+
+@pytest.mark.live_stop_tokens
+def test_llama_regression():
+    """Ensure Llama still works (control)."""
+    runner = MLXRunner("mlx-community/Llama-3.2-3B-Instruct-4bit")
+    response = runner.generate_batch("Hi", max_tokens=20)
+
+    assert response  # Should generate something
+    assert "<|eot_id|>" not in response  # Stop token filtered
+```
+
+**Test Phases:**
+1. **Baseline:** Document current behavior (with workarounds)
+2. **Generic Fix:** Apply 2-LOC change, test all 3 models
+3. **Cleanup:** Remove obsolete workarounds if tests pass
+
+**See:** `docs/ADR/appendix/ADR-009-test-plan.md` for details
+
+---
+
+## Consequences
+
+### Positive
+
+- ✅ **State-of-the-Art:** Uses mlx-lm APIs (same as reference implementation)
+- ✅ **Minimal Code Change:** 2 LOC fix (`==` → `in`, twice)
+- ✅ **Scalable:** New models automatically supported if configs are correct
+- ✅ **Pragmatic:** Model-specific code stays if needed (with clear comments)
+- ✅ **Non-Breaking:** Existing models continue working
+
+### Negative
+
+- ⚠️ **Upstream Bugs Remain:** HuggingFace configs may be incomplete
+- ⚠️ **Test Dependency:** Requires real models (~3GB download for CI)
+- ⚠️ **Partial Coverage:** Only focus models validated, not all models
+
+### Risks & Mitigation
+
+| Risk | Mitigation |
+|------|------------|
+| Generic approach breaks MXFP4 | Keep workaround if tests fail |
+| Unknown models have issues | Users report, we fix incrementally |
+| CI becomes slow (model downloads) | Use cached models, mark tests as slow |
+
+### Trade-offs
+
+**Accepted:**
+- Not testing all models (focus on priority models only)
+- Keeping some workarounds if needed (pragmatism over purity)
+- Incremental improvement (not perfect, but better than status quo)
+
+**Rejected:**
+- Testing all models (unrealistic)
+- Removing all workarounds blindly (risky)
+- Waiting for upstream fixes (blocks progress)
+
+---
+
+## Implementation Plan
+
+**Priority:** CRITICAL (Issue #32 open since September)
+
+**Tasks:**
+1. ✅ Research findings documented (this ADR)
+2. ⏳ Implement real-model test suite (`test_stop_tokens_live.py`)
+3. ⏳ Baseline measurement (document current behavior with all 3 models)
+4. ⏳ Apply 2-LOC fix (runner/__init__.py:468, 589)
+5. ⏳ Re-test & evaluate (compare before/after behavior)
+6. ⏳ Conditional: Implement `add_eos_token()` integration (ONLY if tests fail)
+7. ⏳ Conditional: Remove obsolete workarounds (ONLY if tests pass without them)
+8. ⏳ Update TESTING.md + CHANGELOG.md + close Issue #32
+
+**Estimated Effort:** 2-3 sessions (test suite implementation is non-trivial)
+**Blocker for:** 2.0.0 stable release
+
+**Key Decision Gate:** Step 5 → Step 6 (empirical testing determines if `add_eos_token()` is needed)
+
+---
+
+## References
+
+### mlx-lm APIs Used
+
+**TokenizerWrapper** (returned by `mlx_lm.load()`):
+```python
+# Property
+tokenizer.eos_token_ids -> set[int]  # All EOS token IDs
+
+# Method
+tokenizer.add_eos_token(token: str) -> None  # Add token to EOS set
+```
+
+**Source:**
+- `mlx_lm/tokenizer_utils.py:254` (TokenizerWrapper class)
+- `mlx_lm/generate.py:701` (usage example: `if token in tokenizer.eos_token_ids`)
+
+### Internal Documents
+
+- **Research Findings (historical background):** `docs/ADR/appendix/ADR-009-research-findings.md`
+- **Live Test Plan (authoritative):** `docs/ADR/appendix/ADR-009-test-plan.md`
+- **Historical Transcripts:** `docs/ADR/appendix/ADR-009-gpt-oss-interview.md`, `docs/ADR/appendix/ADR-009-september-reasoning-discussion.md`
+
+### External References
+
+- **mlx-lm Source:** https://github.com/ml-explore/mlx-lm
+- **HuggingFace MXFP4:** https://huggingface.co/mlx-community/gpt-oss-20b-MXFP4-Q8
+
+### Related Issues
+
+- **GitHub Issue #32:** Replace custom chat format with native Chat Templates
+- **Issue #20:** End-Token filtering (defense-in-depth)
+- **ADR-010:** Reasoning Content API (Phase 2)
+
+---
+
+**Next Review:** After test suite implementation
+**Decision Makers:** Project maintainer
+**Stakeholders:** Beta.6 testers, downstream users
diff --git a/docs/ADR/ADR-011-E2E-Live-Test-Architecture.md b/docs/ADR/ADR-011-E2E-Live-Test-Architecture.md
new file mode 100644
index 0000000..a98502b
--- /dev/null
+++ b/docs/ADR/ADR-011-E2E-Live-Test-Architecture.md
@@ -0,0 +1,249 @@
+# ADR-011: E2E Live Test Architecture
+
+**Status:** Proposed (Planned for Post-Beta.6 / Stable 2.0)
+**Date:** 2025-10-21
+**Supersedes:** 1.1.1 `test_end_token_issue.py` comprehensive testing
+**Affects:** Test Suite (Stable 2.0)
+**Related:** ADR-009 (Stop Token Detection - provides Portfolio Discovery infrastructure)
+
+---
+
+## Context
+
+### Problem Statement
+
+**1.1.1 Had Comprehensive E2E Testing:**
+- `test_end_token_issue.py` validated full model portfolio
+- Server/HTTP API endpoints tested
+- Streaming vs. Non-Streaming parity (Issue #20)
+- CLI integration (`run`, `show`)
+- RAM-aware portfolio testing
+
+**2.0 Beta Gaps:**
+- 95%+ unit tests with mocks/stubs
+- <5% live tests (3 hard-coded models in `test_stop_tokens_live.py`)
+- No E2E validation for Server/HTTP/CLI paths
+- No systematic portfolio coverage beyond stop tokens
+
+**Risk:**
+Without E2E live tests, we cannot validate production behavior before Stable release:
+- Server API correctness across model portfolio
+- Streaming vs. batch parity (Issue #20 regression)
+- CLI integration with real models
+- Real-world usage patterns
+
+---
+
+## Decision
+
+### E2E Live Test Suite for Stable 2.0
+
+**Reuses ADR-009 Infrastructure:**
+- Portfolio discovery (`discover_mlx_models_in_cache()`)
+- RAM gating (`get_safe_ram_budget_gb()`, `should_skip_model()`)
+- Test fixtures (`_use_real_mlx_modules`, `requires_hf_home`)
+
+**New Test Areas:**
+
+#### 1. Server/HTTP API Validation
+```python
+# tests_2.0/live/test_server_e2e.py
+
+@pytest.mark.live_e2e
+def test_server_streaming_portfolio():
+    """Validate /v1/chat/completions SSE streaming over portfolio."""
+    for model in discover_portfolio():
+        with LocalServer(model) as server:
+            response = requests.post(f"{server.url}/v1/chat/completions",
+                                    json={"stream": True, ...})
+            # Validate SSE format, stop tokens, no visible EOS
+```
+
+#### 2. Streaming vs. Non-Streaming Parity (Issue #20)
+```python
+# tests_2.0/live/test_streaming_parity.py
+
+@pytest.mark.live_e2e
+def test_streaming_nonstreaming_parity_portfolio():
+    """Validate streaming and non-streaming produce identical output (Issue #20)."""
+    for model in discover_portfolio():
+        runner = MLXRunner(model)
+        batch_output = runner.generate_batch(prompt, max_tokens=50)
+        stream_output = "".join(runner.generate_streaming(prompt, max_tokens=50))
+
+        # Issue #20: non-streaming previously had visible stop tokens
+        assert batch_output == stream_output
+```
+
+#### 3. CLI Integration
+```python
+# tests_2.0/live/test_cli_e2e.py
+
+@pytest.mark.live_e2e
+def test_run_command_portfolio():
+    """Validate mlxk run across portfolio."""
+    for model in discover_portfolio():
+        result = subprocess.run(
+            ["mlxk", "run", model.id, "--prompt", "Test"],
+            capture_output=True, text=True
+        )
+        assert result.returncode == 0
+        assert "<|end|>" not in result.stdout
+```
+
+### Safety Requirements
+
+**Read-Only Cache Access:**
+- No pull/rm operations during tests
+- Sentinel protection (`TEST-CACHE-SENTINEL` abort)
+- Reuses ADR-007 CoW constraints
+
+**RAM Gating:**
+- Progressive budget (40%-70%, already implemented in ADR-009)
+- Auto-skip models exceeding available RAM
+
+---
+
+## Dependencies
+
+**Requires ADR-009 (Beta.6):**
+- Portfolio discovery infrastructure
+- RAM gating logic
+- Test fixtures
+
+**Relationship:**
+- **ADR-009:** Develops portfolio infrastructure, tests Runner-level stop tokens
+- **ADR-011:** Reuses portfolio infrastructure, tests E2E APIs
+
+**No overlap:** ADR-009 = Runner tests, ADR-011 = E2E tests
+
+---
+
+## Implementation Plan
+
+**Priority:** HIGH (Required for Stable 2.0)
+**Timeline:** Post-Beta.6, before Stable release
+
+**Tasks:**
+1. ⏳ **Implement ADR-009 Portfolio Discovery** (prerequisite for E2E tests)
+   - `discover_mlx_models_in_cache()` helper
+   - RAM gating logic (`should_skip_model()`)
+2. ⏳ **Server E2E Tests** (`test_server_e2e.py`)
+   - HTTP API validation
+   - SSE streaming format
+3. ⏳ **Streaming Parity Tests** (`test_streaming_parity.py`)
+   - Issue #20 regression protection
+4. ⏳ **CLI Integration Tests** (`test_cli_e2e.py`)
+   - `mlxk run` validation
+   - Exit codes, error messages
+5. ⏳ **Documentation Updates**
+   - TESTING.md: E2E test coverage section
+
+---
+
+## Implementation Status (2025-10-21)
+
+**Status: NOT STARTED**
+
+All tasks above are pending. This ADR documents the **planned architecture** for E2E tests.
+
+**Current Reality:**
+- No E2E test suite exists (`tests_2.0/live/test_server_e2e.py` etc. not created)
+- Portfolio discovery not implemented (hard-coded 3 models in `test_stop_tokens_live.py:174`)
+- ADR-009 provides **test plan** for portfolio discovery, but implementation deferred
+
+**Blocker:**
+- Requires Portfolio Discovery implementation (ADR-009 Step 1, currently incomplete)
+
+**Next Steps:**
+1. Complete ADR-009 Portfolio Discovery (Beta.6 scope)
+2. Implement E2E test suite (Post-Beta.6, pre-Stable 2.0)
+
+**Estimated Effort:** 2-3 sessions (reuses ADR-009 infrastructure)
+
+---
+
+## Test Organization
+
+**File Structure:**
+```
+tests_2.0/
+├── test_stop_tokens_live.py       # ADR-009: Runner stop tokens + portfolio
+├── live/
+│   ├── test_server_e2e.py         # ADR-011: Server/HTTP
+│   ├── test_streaming_parity.py   # ADR-011: Issue #20
+│   └── test_cli_e2e.py            # ADR-011: CLI
+```
+
+**Markers:**
+```python
+@pytest.mark.live_e2e        # E2E tests (ADR-011)
+@pytest.mark.live_stop_tokens # Stop token tests (ADR-009)
+@pytest.mark.slow            # Both
+```
+
+**Run Strategy** (see TESTING.md for details):
+```bash
+pytest -m live_stop_tokens  # ADR-009 only
+pytest -m live_e2e          # ADR-011 only
+pytest                      # Unit tests (skips all live)
+```
+
+---
+
+## Consequences
+
+### Positive
+- ✅ Production confidence before Stable release
+- ✅ 1.1.1 test parity restored
+- ✅ Issue #20/#32 regression protection
+- ✅ Portfolio coverage (not limited to 3 models)
+- ✅ Reusable infrastructure from ADR-009
+
+### Negative
+- ⚠️ Portfolio tests may take 10-30 minutes (10-50 models)
+- ⚠️ Maintenance overhead if Server API changes
+
+### Trade-offs
+
+**Accepted:**
+- Live tests remain opt-in (see TESTING.md)
+- Portfolio limited to user's cache (not all HF models)
+
+**Rejected:**
+- Testing all HuggingFace Hub models (unrealistic)
+- Hard-coding model lists (not scalable)
+
+---
+
+## References
+
+### Related Issues
+- Issue #20: End token filtering (streaming vs. non-streaming)
+- Issue #32: Stop token detection (ADR-009)
+
+### Related ADRs
+- ADR-009: Stop Token Detection Fix (provides portfolio infrastructure)
+- ADR-007: Clone Implementation (CoW constraints)
+- ADR-004: Enhanced Error Handling (error envelope validation)
+
+### 1.1.1 Test Suite
+- `test_end_token_issue.py`: Original comprehensive test (reference)
+
+---
+
+## Success Criteria
+
+**Beta.6 → Stable Transition:**
+1. ✅ ADR-009 portfolio discovery implemented
+2. ✅ Server E2E tests cover ≥3 models (MXFP4, Qwen, Llama)
+3. ✅ Streaming parity validated (Issue #20)
+4. ✅ CLI integration tested
+5. ✅ Documentation updated
+
+**Definition of Done:**
+```bash
+pytest -m live_e2e -v  # All tests pass or skip gracefully
+```
+
+No failures - only passes or skips (RAM/availability).
diff --git a/docs/ADR/README.md b/docs/ADR/README.md
new file mode 100644
index 0000000..10e5403
--- /dev/null
+++ b/docs/ADR/README.md
@@ -0,0 +1,29 @@
+# Architecture Decision Records (ADRs)
+
+## Overview
+
+This directory contains Architecture Decision Records (ADRs) that document significant architectural and design decisions for the MLX-Knife project.
+
+## Active ADRs
+
+| ADR | Title | Status | Date |
+|-----|-------|--------|------|
+| [ADR-001](ADR-001-json-api-strategy.md) | JSON API Strategy & 2.0 Migration Path | Accepted | 2025-08-28 |
+| [ADR-002](ADR-002-edge-cases.md) | Edge Cases from 1.x Test Suite | Accepted | 2025-08-28 |
+| [ADR-003](ADR-003-Server-Run-Port-to-2.0.md) | Server and Run Functionality Port from 1.x to 2.0 | Accepted | 2025-09-10 |
+| [ADR-004](ADR-004-Enhanced-Error-Logging.md) | Enhanced Error Handling & Logging | Accepted  | 2025-10-19 |
+| [ADR-005](ADR-005-Clone-Implementation-Beta3.md) | Clone Implementation Beta3 | Superseded by ADR-007 | 2025-09-18 |
+| [ADR-006](ADR-006-Clone-Implementation-Revised.md) | Clone Implementation Revised | Superseded by ADR-007 | 2025-09-18 |
+| [ADR-007](ADR-007-Clone-Implementation-Fixed.md) | Clone Implementation Fixed Strategy | Accepted | 2025-09-18 |
+| [ADR-008](ADR-008-MLXModel-Package-Format.md) | MLXModel Package Format | Accepted | 2025-10-17 |
+| [ADR-009](ADR-009-Stop-Token-Detection-Fix.md) | Stop Token Detection Fix | Accepted | 2025-10-21 |
+| [ADR-010](ADR-010-Reasoning-Content-API.md) | Reasoning Content API | Draft | 2025-10-21 |
+| [ADR-011](ADR-011-E2E-Live-Test-Architecture.md) | E2E Live Test Architecture | Accepted | 2025-10-21 |
+
+## ADR Format
+
+Each ADR follows this structure:
+- **Status**: Proposed / Accepted / Rejected / Superseded
+- **Context**: Why this decision is needed
+- **Decision**: What we decided to do
+- **Consequences**: What happens as a result
diff --git a/docs/ADR/appendix/ADR-009-gpt-oss-interview.md b/docs/ADR/appendix/ADR-009-gpt-oss-interview.md
new file mode 100644
index 0000000..3fb8f36
--- /dev/null
+++ b/docs/ADR/appendix/ADR-009-gpt-oss-interview.md
@@ -0,0 +1,239 @@
+# ADR-009 Appendix: GPT-OSS Interview (Historical Transcript)
+
+**Status:** Archived discovery log – kept for provenance; not part of the current implementation scope
+
+```bash
+mlxk run gpt-oss-20b-MXFP4-Q8
+```
+# The Chat Asking gpt-oss :D
+
+Starting interactive chat. Type 'exit' or 'quit' to end.
+
+**You:** What is a Chat Template and how to use it with MLX/chat to retrieve the proper end-token for a chat bot ?                              
+
+**Assistant:** <|channel|>analysis<|message|>We need to explain what a Chat Template is, likely referring to chat templates in the context of MLX/chat? Maybe MLX refers to a library for machine learning e.g., "mlx" by Meta? Wait, there is "mlx" open-source library for machine learning from Meta. Or "MLX chat" might refer to a specific chat interface built on top of MLX? Alternatively, Chat Template could be in the context of LLMs and tokenization: It's a template that defines how to format a conversation, including role tags like system, user, assistant, and maybe includes end tokens. The question: "What is a Chat Template and how to use it with MLX/chat to retrieve the proper end-token for a chat bot?" So we need to explain that Chat Template is like prompt template specifying the format of conversation, used by tokenization and decoding to know which tokens signal end-of-message. In MLX/chat, you would use the ChatTemplate class to format input and get end token ID. Provide code example.
+
+Let's recall MLX chat: I think there's a library "mlx.chat" that provides utilities for LLM chat. The ChatTemplate might be a class to format messages into tokens, with placeholders like {role} and {content}. You can retrieve the end token via tokenizer.eos_token_id or chat_template.get_end_token().
+
+Let's search memory: In Hugging Face tokenizers, you have .eos_token_id. For chat templates, you might use "get_end_token" method.
+
+We can propose:
+
+- ChatTemplate is a tool that maps the chat history into token IDs, respecting role-specific prefixes and suffixes defined by the underlying model's tokenizer.
+
+- It ensures that you use the correct end-of-message token (e.g., for Llama2 it's 128001 or something). The chat template might provide a property `eos_token_id`.
+
+- In MLX/chat, you instantiate a ChatTemplate with the tokenizer and optionally a template string. Then you can call `chat_template.encode(messages)` to get tokens, and `chat_template.eos_token` gives the token ID for message termination.
+
+- Example:
+
+```python
+from mlx import chat
+
+tokenizer = chat.load_tokenizer('meta/llama2-7b')
+chat_template = chat.ChatTemplate(tokenizer)
+
+messages = [
+    {"role": "system", "content":"You are a helpful assistant."},
+    {"role": "user", "content":"Hello!"}
+]
+
+input_ids = chat_template.encode(messages)
+eos_id = chat_template.eos_token  # or tokenizer.eos_token_id
+
+output_ids = model.generate(input_ids, eos_token_id=eos_id)
+```
+
+- The proper end-token ensures that the model stops generating when it reaches the token that signals message termination, preventing truncation or over-generation.
+
+Also mention that some models use separate end-of-message tokens per role; the template may provide mapping like `message_end_ids` for user, assistant.
+
+Explain usage: ChatTemplate can be used to format prompts and retrieve end token. It may provide `ChatTemplate.get_end_token(role)`.
+
+Let's craft a detailed answer: definition, purpose, how to use with MLX/chat. Provide steps, code.
+
+Also mention difference between end-of-sequence (EOS) and message-end token. In some models like Llama 2, there's a separate "end of user message" (e.g., 128001) and "assistant message" (128000). ChatTemplate ensures correct usage.
+
+Let's produce final answer.<|end|><|start|>assistant<|channel|>final<|message|>## 1. What is a **Chat Template**?
+
+A *chat template* is a small piece of logic that knows how to turn a list of “messages” (system, user, assistant) into the token stream that an LLM expects **and** knows which token(s) signal the end of a message.
+
+*Why it matters*
+
+| What you feed the model | Why it needs a template |
+|------------------------|-------------------------|
+| `["Hello"]` (plain text) | The model doesn’t know where the user’s turn ends. |
+| `{"role":"user","content":"Hello"}` | The model needs the exact string that marks *the end of a user turn* (e.g., `������` or `<|assistant|>`). |
+| Multiple turns | The model needs to keep track of *who* is speaking so it can insert the right role‑prefix and end‑token. |
+
+A chat template does two things:
+
+1. **Formats** the conversation into a sequence of tokens that respects the model’s required role prefixes and separators.
+2. **Provides** the *end‑token id(s)* that you should pass to the generation routine so the model stops at the correct boundary.
+
+### Typical components of a chat template
+
+| Component | What it looks like in code | Example for Llama‑2 |
+|-----------|---------------------------|---------------------|
+| `role_prefix` | `"��<|assistant|>"`, `"��<|user|>"` | `"<s>"` or `<|assistant|>` |
+| `role_suffix` | `"��<|assistant|>"`, `"<|assistant|>"` | `<eos>` or `<|end_of_text|>` |
+| `eos_token_id` | Integer ID of the EOS token | 128001 for user, 128000 for assistant |
+| `system_prompt` | Optional system text that is prepended once | `"You are a helpful assistant."` |
+
+---
+
+## 2. Using the Chat Template in **MLX/chat**
+
+Meta’s `mlx` library ships a lightweight chat helper that abstracts away these details. Below is a step‑by‑step walkthrough of how to:
+
+1. Load a tokenizer and model.
+2. Build the chat template (or load an existing one).
+3. Encode a conversation to get input ids.
+4. Retrieve the proper *end‑token* id for generation.
+
+> **Note**: The examples below assume you have the `mlx` package installed (`pip install mlx`). Replace `"meta/llama2-7b"` with the model you actually want.
+
+### 2.1 Import and load
+
+```python
+from mlx import chat, model, tokenizer   # MLX’s sub‑packages
+
+# Load a pretrained tokenizer
+tok = tokenizer.load("meta/llama2-7b")
+
+# Load the model (weights are loaded lazily on first use)
+llama = model.load("meta/llama2-7b")
+```
+
+### 2.2 Create a ChatTemplate
+
+You can let MLX pick the *canonical* template for the model, or you can hand‑craft one.
+
+```python
+# Option 1: Let MLX infer the default template for this tokenizer/model
+chat_tmpl = chat.ChatTemplate(tok)
+
+# Option 2: Provide a custom template string (rarely needed)
+custom_tmpl_str = (
+    "<s>"
+    "### System:\n{system}\n"
+    "### User:\n{user}"
+    "\n### Assistant: "
+)
+chat_tmpl = chat.ChatTemplate(tok, template=custom_tmpl_str)
+```
+
+> **Tip**: `chat.ChatTemplate` automatically detects the role‑prefixes/suffixes defined in the tokenizer’s `token_to_id` mapping, so you usually just need to pass the tokenizer.
+
+### 2.3 Prepare a conversation
+
+```python
+messages = [
+    {"role": "system",  "content": "You are a helpful assistant."},
+    {"role": "user",    "content": "How do I bake a cake?"},
+    # The assistant will reply next
+]
+```
+
+### 2.4 Encode the conversation
+
+```python
+input_ids = chat_tmpl.encode(messages)
+print("Input token ids:", input_ids)
+```
+
+`encode()` internally:
+
+1. Inserts role prefixes (`��<|assistant|>`, etc.).
+2. Adds the *role‑specific* end‑token after each message.
+3. Returns a `numpy.ndarray` or `torch.Tensor` of ids.
+
+### 2.5 Retrieve the *proper* end‑token id
+
+The chat template knows which token ends a message of a given role. You can fetch it like this:
+
+```python
+# For the assistant turn – the token that marks *end of assistant message*
+assistant_eos_id = chat_tmpl.eos_token_ids["assistant"]
+print("Assistant EOS token id:", assistant_eos_id)
+
+# For the user turn – if you ever need to truncate or stop on a user message
+user_eos_id = chat_tmpl.eos_token_ids["user"]
+```
+
+If you just need the *generic* EOS token (the one that terminates a sequence), most models expose it as:
+
+```python
+generic_eos_id = tok.eos_token_id  # e.g., 128001 for Llama‑2
+```
+
+### 2.6 Generate a response
+
+Now feed everything into the model’s generate routine, making sure to pass the correct `eos_token_id`:
+
+```python
+# Append a placeholder for the assistant reply; some libraries do this automatically
+input_ids = input_ids.tolist()  # convert to Python list if needed
+
+# Generate tokens until the assistant EOS token appears
+generated_ids = llama.generate(
+    input_ids,           # initial conversation
+    max_new_tokens=200,
+    eos_token_id=assistant_eos_id,  # <-- this is the key
+)
+
+# Decode the full output (conversation + assistant reply)
+full_output = tok.decode(generated_ids, skip_special_tokens=False)
+print(full_output)
+```
+
+Because you passed the **assistant‑specific** EOS token, generation stops *exactly* when the model emits that token – you won’t get a partial answer or an extra “\<eos\>” that belongs to the user turn.
+
+---
+
+## 3. Why “proper end‑token” matters
+
+| Scenario | What can go wrong if you use the wrong EOS token? |
+|----------|---------------------------------------------------|
+| **You pass the generic EOS (`<eos>`)** | The model may stop too early (after a user message) or generate garbage after the assistant reply. |
+| **You pass the user EOS** | The model will stop *before* it finishes its answer, cutting off sentences. |
+| **You don’t pass any EOS** | The model might keep generating indefinitely until a hard max‑token limit is hit. |
+
+The chat template abstracts this logic so you never have to remember which integer corresponds to which role.
+
+---
+
+## 4. Quick‑Reference Cheat Sheet
+
+| Task | Code |
+|------|------|
+| Load tokenizer & model | `tok = tokenizer.load("meta/llama2-7b")`<br>`llm = model.load("meta/llama2-7b")` |
+| Build template | `tmpl = chat.ChatTemplate(tok)` |
+| Encode messages | `ids = tmpl.encode(messages)` |
+| Get assistant EOS id | `assistant_eos = tmpl.eos_token_ids["assistant"]` |
+| Generate reply | `out = llm.generate(ids, eos_token_id=assistant_eos)` |
+| Decode output | `text = tok.decode(out)` |
+
+---
+
+## 5. Common Pitfalls & FAQs
+
+| Question | Answer |
+|----------|--------|
+| *Does every LLM have a separate assistant‑EOS token?* | Not always. Models like GPT‑3 use the same EOS for all roles, but Llama‑2, Claude, and others have role‑specific tokens (e.g., `128001` for user, `128000` for assistant). |
+| *What if my tokenizer doesn’t expose `eos_token_id`?* | Use the chat template’s mapping: `tmpl.eos_token_ids[role]`. |
+| *Can I change the role prefixes?* | Yes – pass a custom template string to `ChatTemplate`. |
+| *Why does the assistant reply sometimes start with “<s>”?* | `<s>` is the “start‑of‑sequence” token that some models require before every generation. The template automatically inserts it. |
+| *Do I need to truncate the user’s message?* | The template already adds the correct EOS after each user turn. Just keep feeding the full conversation to `generate`. |
+
+---
+
+## 6. TL;DR
+
+- **Chat Template** = a helper that formats conversations into the exact token stream your LLM expects and tells you which *end‑token* marks the end of a particular role’s turn.
+- In **MLX/chat** you create one with `chat.ChatTemplate(tokenizer)`.
+- Encode your conversation: `ids = tmpl.encode(messages)`.
+- Retrieve the proper EOS id for the role you’re generating (assistant): `eos_id = tmpl.eos_token_ids["assistant"]`.
+- Pass that to the model’s `generate`: `llm.generate(ids, eos_token_id=eos_id)`.
+
+That’s it—now your chat bot will stop generating *exactly* when the model signals that its turn is finished. Happy chatting!
diff --git a/docs/ADR/appendix/ADR-009-research-findings.md b/docs/ADR/appendix/ADR-009-research-findings.md
new file mode 100644
index 0000000..c1a0013
--- /dev/null
+++ b/docs/ADR/appendix/ADR-009-research-findings.md
@@ -0,0 +1,295 @@
+# Chat Templates, Stop Tokens & Reasoning: Konsolidierte Erkenntnisse
+
+**Status:** Historical background – Stop-token findings inform ADR-009; reasoning/API notes preserved for future ADRs  
+**Dokumentiert:** 2025-09-08 (Initial), 2025-10-21 (Konsolidierung)  
+**Related:** Issue #32, ADR-004
+
+---
+
+## Executive Summary
+
+**Problem:** Models generieren andere Stop-Tokens als konfiguriert, Reasoning-Content nicht API-konform
+**Root Cause (gefunden 2025-10-21):** HuggingFace tokenizer configs unvollständig + unser Code nutzt falsche API
+**Lösung:** 2-Phasen Approach (Beta.6: Stop-Token-Fix, 2.1+: Reasoning-API)
+
+---
+
+## Die Kernfrage (September 2025)
+**Welches End-Token gilt für welches Modell?**
+
+## Was wir gelernt haben (September → Oktober)
+
+### 1. Chat Templates sind NICHT Protokolle
+- Chat Templates sind **Jinja2-Formatierungsanweisungen**
+- Sie konvertieren strukturierte Messages zu Token-Sequenzen
+- Sie replizieren das Format aus dem Training
+- Sie definieren NICHT das Stop-Verhalten
+
+### 2. End-Token Verwirrung
+
+#### MXFP4 Modell Beispiel:
+- **EOS Token**: `<|return|>` (tokenizer config)
+- **Generiert aber**: `<|end|>` nach Messages
+- **Problem**: `<|end|>` wird nicht als Stop-Token erkannt
+- **Test erwartet**: `<|end|>` sollte gefiltert werden
+
+#### Token-Typen:
+1. **Control Tokens** (aus Training):
+   - `<|end|>` - Message-Ende Marker (MXFP4)
+   - `<|im_end|>` - Message-Ende (Qwen)
+   
+2. **Stop Tokens** (Generation beenden):
+   - `<|return|>` (MXFP4)
+   - `</s>` (Llama)
+   - `<|endoftext|>` (GPT)
+
+3. **Template Tokens** (nur Formatierung):
+   - `<|start|>`, `<|message|>` etc.
+
+### 3. Das eigentliche Problem
+
+Modelle generieren verschiedene Tokens als "ich bin fertig":
+- Manche nutzen ihr definiertes EOS Token
+- Manche nutzen gelernte Pattern aus dem Training
+- Manche nutzen beides
+
+**MLX Knife muss wissen**: 
+- Was ist das offizielle EOS Token? (aus tokenizer config)
+- Was generiert das Modell tatsächlich? (empirisch)
+- Was sollte gefiltert werden? (beide?)
+
+### 4. Unsere bisherige Implementierung
+
+```python
+# Aktuell in mlx_runner.py:
+- Extrahiert EOS aus tokenizer
+- Sucht nach "end"-ähnlichen Tokens
+- ABER: Verpasst modell-spezifische Patterns wie <|end|>
+```
+
+### 5. Server-Test Failures
+
+- **MXFP4**: Generiert `<|end|>`, wird nicht gefiltert → Test fail
+- **Qwen3**: Self-conversation (vermutlich andere Ursache)
+
+## Offene Fragen
+
+1. Sollten wir ALLE "end-like" Tokens aus dem Training als Stop-Tokens behandeln?
+2. Oder nur die explizit als EOS definierten?
+3. Wie gehen andere Implementierungen (Ollama, vLLM) damit um?
+4. Brauchen wir modell-spezifische Stop-Token Listen?
+5. **Legacy-Modelle**: Was ist mit alten Modellen ohne Chat Templates?
+   - Sind sie mit der neuen Implementation kompatibel?
+   - Brauchen wir einen Fallback auf Human:/Assistant:?
+   - Oder verweigern wir Support für template-lose Modelle?
+
+## Legacy-Modell Kompatibilität
+
+### Aktuelle Implementation
+```python
+# mlx_runner.py _format_conversation():
+if use_chat_template and hasattr(self.tokenizer, 'chat_template') and self.tokenizer.chat_template:
+    # Use chat template
+else:
+    # Fallback to _legacy_format_conversation (Human:/Assistant:)
+```
+
+### Fragen zur Klärung:
+- Gibt es überhaupt MLX-Modelle ohne Chat Templates?
+- Wenn ja, funktioniert Human:/Assistant: für diese?
+- Sollten wir sie überhaupt unterstützen?
+
+## Nächste Schritte
+
+1. **Inventur**: Welche Modelle haben keine Chat Templates?
+2. **Empirisch testen**: Welche Tokens generieren die Modelle tatsächlich?
+3. **Stop-Token Strategie**: Klare Regeln definieren
+4. **Legacy-Strategie**: Fallback oder Deprecation?
+5. **Implementation**: Robuste Token-Erkennung
+6. **Tests anpassen**: Realistische Erwartungen
+
+## Neue Erkenntnisse (Oktober 2025)
+
+### Root Cause gefunden: HuggingFace + mlx_knife Code Bugs
+
+**MXFP4 Tokenizer Config (HuggingFace):**
+```json
+{
+  "eos_token": "<|return|>",       // ID 200002
+  "eos_token_id": 200002,           // SINGLE ID (falsch!)
+  "extra_special_tokens": {}        // Leer!
+}
+```
+
+**Was richtig wäre (wie Llama 3):**
+```json
+{
+  "eos_token_id": [200002, 200007]  // ARRAY: <|return|> UND <|end|>
+}
+```
+
+**Unser Code Bug:**
+```python
+# mlxk2/core/runner/__init__.py:468, 589
+if token_id == self.tokenizer.eos_token_id:  # SINGULAR (falsch!)
+    break
+```
+
+**mlx-lm macht es richtig:**
+```python
+# mlx_lm/generate.py:stream_generate()
+if token in tokenizer.eos_token_ids:  # SET (korrekt!)
+    break
+```
+
+### mlx-lm Architektur-Analyse
+
+**Pattern:** Keine model-spezifischen Workarounds in `mlx_lm/models/*.py`
+- `gpt_oss.py`, `qwen2.py`, `llama.py` - Reine Architektur (forward pass)
+- Stop-Token Handling: Nur in `generate.py` (generisch via tokenizer metadata)
+- API: `tokenizer.add_eos_token(token)` für Runtime-Additions
+
+**Erkenntnis:** mlx-lm vertraut auf korrekte HuggingFace configs. Broken configs → broken generation.
+
+### Reasoning-Token Analyse
+
+**OpenAI o1 / Responses API:**
+- Reasoning bleibt **hidden** (nur token count sichtbar)
+- Reasoning summaries via `reasoning.summary: "auto"`
+- Keine `reasoning_content` im Chat Completions API
+
+**DeepSeek R1 API:**
+```python
+response.choices[0].message.reasoning_content  # Separates Feld!
+response.choices[0].message.content            # Final answer
+```
+
+**Status Quo (mlx_knife):**
+- Inline filtering via `StreamingReasoningParser`
+- `hide_reasoning` Parameter (bereits vorhanden)
+- Marker-basiert: `<|channel|>analysis<|message|>...` → entfernt
+
+**Problem:** Nicht API-standard-konform, Client kann Reasoning nicht separat rendern
+
+## Roadmap: 2-Phasen Approach
+
+### Phase 1: Beta.6 - Stop Token Fix (BLOCKER)
+
+**Scope:** Generische Mechanismen implementieren (KEIN Workaround-Gefrickel)
+
+**Changes:**
+1. ✅ **Fix Runner Stop-Check:**
+   ```python
+   # Vorher (broken):
+   if token_id == self.tokenizer.eos_token_id:
+
+   # Nachher (correct):
+   if token_id in self.tokenizer.eos_token_ids:
+   ```
+
+2. ✅ **Add Stop Tokens via API:**
+   ```python
+   # In _extract_stop_tokens():
+   for stop_token in self._stop_tokens:
+       self.tokenizer.add_eos_token(stop_token)
+   ```
+
+3. ✅ **Defense-in-Depth behalten:**
+   - String-based filtering (Issue #20) bleibt als Fallback
+   - Reasoning parser bleibt wie ist
+
+**Non-Scope (Beta.6):**
+- ❌ KEINE Reasoning-API Changes (breaking)
+- ❌ KEINE HuggingFace Issues melden (noch nicht)
+- ❌ KEINE model-spezifischen Workarounds (erst nach Real-Model Tests)
+
+**Test Strategy:**
+- Real-Model Test Suite (MXFP4, Qwen3, Llama3.2)
+- Validate stop token detection
+- Measure before/after behavior
+
+### Phase 2: 2.0.1+ - Reasoning API (Enhancement)
+
+**Goal:** API-standard-konforme Reasoning-Unterstützung
+
+**Design:** DeepSeek-Style (Option B)
+```python
+# Response structure:
+{
+  "choices": [{
+    "message": {
+      "content": "Final answer",           # Existing
+      "reasoning_content": "CoT...",       # NEW
+      "role": "assistant"
+    }
+  }]
+}
+```
+
+**Streaming:**
+```python
+# SSE chunks:
+data: {"choices":[{"delta":{"content":"Hello"}}]}
+data: {"choices":[{"delta":{"reasoning":"step 1..."}}]}
+```
+
+**Client Benefits:**
+- Web UI kann Reasoning optional einblenden (wie GPT-5 chat)
+- Lokale Clients haben klare API-Struktur
+- Runner code als Vorlage für broke cluster
+
+**Implementation Tasks:**
+1. Extend `ChatCompletionResponse` model
+2. Modify `StreamingReasoningParser` → separate output streams
+3. Add `include_reasoning` request parameter
+4. Update server endpoints
+5. Write API docs + examples
+
+**Breaking Changes:**
+- Opt-in: Default `include_reasoning=false` (backward compat)
+- Existing clients funktionieren weiter
+
+## Issue #32 Status Update
+
+**Original Problem (September):** Hardcodiertes Human:/Assistant: Format
+- ✅ **Gelöst:** Chat Templates werden verwendet
+
+**Problem 1 (Oktober):** Stop-Token Detection
+- 🔄 **Beta.6:** Generischer Fix (eos_token_ids Set)
+- 📅 **Status:** Implementierung anstehend
+
+**Problem 2 (Future):** Reasoning API
+- 📋 **2.0.1+:** Separate `reasoning_content` field
+- 📅 **Status:** Konzept definiert, Implementation später
+
+## Offene Fragen (für später)
+
+1. **HuggingFace Issues melden?**
+   - MXFP4 tokenizer config fix (`eos_token_id` → array)
+   - Erst nach Validation mit Real-Model Tests
+
+2. **mlx-lm Enhancement vorschlagen?**
+   - Warning wenn chat_template tokens nicht in `eos_token_ids`
+   - Bessere Docs für `--extra-eos-token`
+   - Erst nach Beta.6 Validation
+
+3. **Legacy-Modelle ohne Chat Templates?**
+   - Inventur durchführen (gibt es überhaupt welche?)
+   - Fallback behalten oder deprecaten?
+
+## Referenzen
+
+- **September Research:** docs/GPT5_CHAT_REASONING_FINDINGS.md
+- **Model Interview:** docs/GPT-OSS_EXPLAINS_CHAT-TEMPLATES.md (Appendix)
+- **Upstream Findings:** docs/ISSUES/ISSUE-32-UPSTREAM-FINDINGS.md
+- **mlx-lm Source:** https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/generate.py
+- **DeepSeek API:** https://api-docs.deepseek.com/guides/reasoning_model
+- **OpenAI Responses API:** https://cookbook.openai.com/examples/responses_api/reasoning_items
+
+---
+
+**Next Session:**
+- [ ] Implement stop token fix (Phase 1)
+- [ ] Run Real-Model Test Suite (validation)
+- [ ] Create Issue for Phase 2 (Reasoning API)
+- [ ] Consider upstream issue reports (after validation)
diff --git a/docs/ADR/appendix/ADR-009-september-reasoning-discussion.md b/docs/ADR/appendix/ADR-009-september-reasoning-discussion.md
new file mode 100644
index 0000000..8696059
--- /dev/null
+++ b/docs/ADR/appendix/ADR-009-september-reasoning-discussion.md
@@ -0,0 +1,209 @@
+## ADR-009 Appendix: Reasoning Scope Notes (September 2025)
+
+**Status:** Archived exploration – captures early reasoning API considerations now owned by future ADRs
+
+## Meine Fragen im ChatGPT-Dialog (GPT5-Thinking):
+
+```plaintext
+wie sollte ein OpenAI kompatibler Server optimal mit gpt-oss umgehen?
+...
+ich frage, wie das mlx-knife handhaben soll - eine MLX version ist verfügbar mit MXFP4 unterstützung.
+````
+
+## Antwort
+
+**Kurz gesagt:**
+
+Erwarte Reasoning nie „heimlich“ (keine impliziten Chain-of-Thought-Leaks). Erzeuge Reasoning explizit durch das Modell (structured output / function call / separate pass) und mache die Freigabe/Anzeige kontrollierbar (flags, audit, encrypt). Diese Herangehensweise ist kompatibel mit dem, was etablierte Anbieter empfehlen (Responses API → reasoning summaries), und lässt sich lokal für MLX-Modelle sauber umsetzen.  ￼ ￼
+
+⸻
+
+1 — Designprinzipien (high level)
+	1.	Explizit statt implizit: Reasoning ist nur dann Teil der Ausgabe, wenn du das Modell explizit so aufforderst (Prompt oder separate Reasoning-Pass). Default: kein CoT.
+	2.	Strukturierte Antwort bevorzugen: Fordere das Modell an, eine JSON-Struktur zu liefern: { "answer": "...", "reasoning": "...", "score": ... }. Server parst und trennt sicher.
+	3.	Feinsteuerung per Flag: Client gibt include_reasoning=none|summary|full (oder reasoning=summary) an. Server liefert nur das, wozu Client berechtigt/angefragt hat.
+	4.	Streaming mit Kanal-Metadaten: Beim Streamen markiere Reasoning-Deltas ausdrücklich (z. B. delta.reasoning vs delta.content), statt sie als normalen Text zu mischen.
+	5.	Validierung & Audit: Prüfe/validiere JSON (jsonschema). Aufbewahrung von kompletten Reasoning-Traces optional & verschlüsselt (audit).
+	6.	Fallbacks: Wenn kein strukturiertes Format kommt, suche nach Marker-Patterns (z. B. <<REASONING>>...<</REASONING>>) oder versuche post-hoc einen reasoning_summary via zweitem Prompt zu erzeugen.
+
+(Praktisch relevant für MLX-Modelle / MXFP4 — mlx-knife kann Tokenizer/chat_template aus dem Modell verwenden, siehe MF-Modelcards und MLX-ecosystem).  ￼ ￼
+
+⸻
+
+2 — API-Contract (proposal)
+
+/v1/chat/completions (OpenAI-compatible) — zusätzliche, optionale Felder:
+
+Request (relevante Felder)
+
+{
+  "model":"mlx-community/gpt-oss-20b-MXFP4-Q4",
+  "messages":[ ... ],
+  "stream": true,
+  "reasoning": "none" | "summary" | "full",      // NEW: server honoriert das
+  "reasoning_encrypt": true | false,            // optional: server speichert verschlüsselt
+  "response_schema": { "type":"json", "schema": { ... } } // optional zwingt JSON-Ausgabe
+}
+
+Response (streamed SSE chunks): klassische chat.completion.chunk mit erweitertem Delta-Shape:
+
+// Normaler answer-delta
+{
+  "object":"chat.completion.chunk",
+  "model":"...",
+  "choices":[{"index":0,"delta":{"content":"Hello, "},"finish_reason":null}]
+}
+
+// Reasoning-delta (klar gekennzeichnet)
+{
+  "object":"chat.completion.chunk",
+  "model":"...",
+  "choices":[{"index":0,"delta":{"reasoning":"step1: consider X..."},"finish_reason":null}]
+}
+
+Am Stream-Ende wie gewohnt data: [DONE].
+(Alternative OpenAI-kompatible Variante: benutze choices[].delta mit delta.content und delta.meta.reasoning=true — aber ein dedicated delta.reasoning Feld ist klarer für Clients.)
+
+Wichtig: Wenn reasoning=none, streamet der Server ausschließlich delta.content. Wenn reasoning=summary, erzeugt der Server (während oder nach der Antwort) eine kompakte natural-language reasoning_summary und streamt diese in einem eigenen delta.reasoning_summary-Event oder liefert sie als separate API-Antwort.  ￼
+
+⸻
+
+3 — Prompting / Tokenizer-Handling (praktisch für MLX)
+	•	System prompt steuert Erzeugung: wenn du response_schema verlangst, erzeugt dein prompt explizit die JSON-Ausgabe. Nutze HF-chat_template und tokenizer.apply_chat_template(...) wie gehabt (wichtig für MLX-models).  ￼ ￼
+	•	Beispiel-System-Prompt (verkürzt):
+
+System: You must return a JSON object with keys "answer" (string) and "reasoning" (string). 
+If the client requested "reasoning": "none", put reasoning = "".
+Do not include extra text outside the JSON.
+
+	•	Wenn Streaming: instruct model to emit JSON in streamable chunks (choose safe delimiters) or use incremental JSON tokenization conventions (clients must reassemble). Consider response_schema to force well-formed JSON.
+
+⸻
+
+4 — Streaming: technische Details & client expectations
+	•	Server-side: empfange backend-tokenstream; klassifiziere jedes Token/chunk als answer vs reasoning basierend auf:
+	1.	explicit JSON keys (best), oder
+	2.	markers (<<REASONING>>), oder
+	3.	function-call events (siehe unten).
+	•	Client: erwartet SSE events mit delta.content (answer) und optional delta.reasoning (reasoning). Client UI zeigt standardmäßig answer inkrementell. reasoning wird verdeckt/optional angezeigt (z. B. “Show reasoning” button) oder in dev/debug mode automatisch expanded.
+
+Beispiel SSE event payload:
+
+data: {"object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"The result is 42."}}]}
+
+data: {"object":"chat.completion.chunk","choices":[{"index":0,"delta":{"reasoning":"I computed 6*7 because..."} } ]}
+
+data: [DONE]
+
+
+⸻
+
+5 — Drei praktikable Implementationsmuster (mit Vor-/Nachteilen)
+
+A) Structured JSON output (empfohlen)
+	•	Wie: Prompt zwingt JSON {answer, reasoning}. Server json.loads() und trennt Felder.
+	•	Pro: Robust, leicht zu validieren, kein Heuristik-Chaos.
+	•	Contra: JSON-Stream-Parsing kann knifflig; erfordert gute prompt-engineering.
+
+B) Function-Call / Tool pattern
+	•	Wie: Fordere Modell, call einer Pseudo-Funktion report_reasoning({ ... }) auszuführen (ähnlich OpenAI function call). Server fängt function_call ab — behandelt arguments als reasoning.
+	•	Pro: Natürliche Trennung; streaming-fähig (arguments können über Chunks kommen).
+	•	Contra: Erfordert funktionale Unterstützung im Backend (MLX wrappers können das aber handhaben).  ￼
+
+C) Markers in plain text (Fallback)
+	•	Wie: Model schreibt <<REASONING>> ... <</REASONING>> vor/zwischen den Antworten. Server sucht Marker.
+	•	Pro: Einfach umzusetzen bei Models ohne JSON-Disziplin.
+	•	Contra: Brüchig (Modelle vergessen Marker).
+
+⸻
+
+6 — Security / Policy / UX Regeln (wichtig)
+	•	Default: reasoning=none. Reasoning nur auf explizite Nachfrage. Logge/versichere Consent, wenn Reasoning gespeichert wird (es kann sensible Interna enthalten).
+	•	Biete reasoning_summary (automatisch generiert) statt voller CoT als Standard; das ist oft genug für debug/trace und weniger riskant. OpenAI empfiehlt genau so eine Trennung (Responses API bietet reasoning summaries).  ￼ ￼
+
+⸻
+
+7 — Fallbacks & Heuristics (wenn Modell nicht kooperiert)
+	1.	Try parse JSON → success → separate.
+	2.	Else: look for markers <<REASONING>> → split.
+	3.	Else: run post-hoc prompt: Given the answer above, summarize the reasoning steps that justify it. → attach reasoning_summary. (Das ist sicherer als rohe CoT-Leaks.)
+	4.	Wenn client requested full CoT and model refuses or output undecodable → return warning + reasoning_unavailable flag.
+
+⸻
+
+8 — Mini-FastAPI-Sketch (Streaming + JSON-separation)
+
+Nachfolgend ein kurzes Beispiel wie mlx-knife serve das trennen könnte — konzeptionell, anpassbar an Deine Backend-Adapter (MLX/mlx_lm).
+
+# server_reasoning.py (sketch)
+from fastapi import FastAPI, Request
+from fastapi.responses import StreamingResponse
+import json, time
+
+app = FastAPI()
+
+# backend stream generator -> yields raw token chunks (strings)
+def backend_stream_generate(prompt, model_id):
+    # >>> adapt to mlx_lm streaming API
+    for token in ["{","\"answer\":","\"42\"",",","\"reasoning\":\"","I multiplied...","\"","}"]:
+        yield token
+        time.sleep(0.01)
+
+def assemble_stream(prompt, model_id, reasoning_mode):
+    """
+    Reads backend token stream and yields SSE events with typed deltas:
+     - delta.content for answer
+     - delta.reasoning for reasoning
+    Uses simple JSON reassembly for demo.
+    """
+    buf = ""
+    for tok in backend_stream_generate(prompt, model_id):
+        buf += tok
+        # try to parse JSON once complete-ish
+        try:
+            obj = json.loads(buf)
+            # we assume model returned full JSON obj
+            if "answer" in obj:
+                # stream the answer as one chunk (or split further)
+                yield json.dumps({"choices":[{"index":0,"delta":{"content":obj["answer"]}}]})
+            if reasoning_mode != "none" and "reasoning" in obj:
+                yield json.dumps({"choices":[{"index":0,"delta":{"reasoning": obj["reasoning"]}}]})
+            buf = ""  # reset
+        except json.JSONDecodeError:
+            # not yet complete: do nothing (or stream tokens as best_effort)
+            continue
+    # finish
+    yield "[DONE]"
+
+@app.post("/v1/chat/completions")
+async def chat(req: Request):
+    body = await req.json()
+    reasoning = body.get("reasoning","none")  # none|summary|full
+    prompt = "<built-from-messages>"  # build via HF tokenizer in real impl
+    model_id = body.get("model")
+    def sse():
+        for event in assemble_stream(prompt, model_id, reasoning):
+            if event == "[DONE]":
+                yield "data: [DONE]\n\n"
+            else:
+                yield f"data: {event}\n\n"
+    return StreamingResponse(sse(), media_type="text/event-stream")
+
+
+⸻
+
+9 — Concrete recommendations für mlx-knife (konkret & priorisiert)
+	1.	Implementiere --reasoning flag für mlx-knife run / serve (none/summary/full). Default none.
+	2.	Support response_schema/json output im run path — benutze HF chat_template + system prompt, parse JSON, separate fields. (Das ist die robusteste Methode für MLX runs — siehe HF model pages wie mlx-community/gpt-oss-20b-MXFP4-Q4 die tokenizer.chat_template beschreiben).  ￼
+	3.	Streaming: add field types to SSE (delta.reasoning), so UIs can decide whether to render reasoning inline.
+	4.	Tool/Function approach: wenn dein backend (mlx_lm) das kann, support function-call style report_reasoning(...) so you can get reasoning as a function result.
+	5.	Post-hoc summary: wenn full CoT not available/undesired, always offer --reasoning summary that triggers a small second pass to produce a concise reasoning summary. This is lightweight & safe.  ￼
+	6.	Logging & audit: store full CoT only with explicit opt-in and encryption.
+
+⸻
+
+Quellen / weiterlese (relevant)
+	•	MLX model example (MXFP4 model card, shows mlx-lm usage & chat_template).  ￼
+	•	MLX / mlx-lm docs (how to load/generate on Apple silicon).  ￼
+	•	OpenAI Responses API / Reasoning summaries (official guidance why reasoning should be a separate capability).  ￼ ￼
+	•	Community guidance re: how to handle raw CoT in Chat Completions (advice and conventions).  ￼
diff --git a/docs/ADR/appendix/ADR-009-test-plan.md b/docs/ADR/appendix/ADR-009-test-plan.md
new file mode 100644
index 0000000..3f31ac3
--- /dev/null
+++ b/docs/ADR/appendix/ADR-009-test-plan.md
@@ -0,0 +1,134 @@
+# ADR-009 Appendix: Test Plan
+
+**Status:** Active – authoritative live-test blueprint for ADR-009  
+**Related:** ADR-009 Stop Token Detection Fix  
+**Purpose:** Real-model validation strategy for Beta.6
+
+---
+
+## Test Models
+
+### Representative Models (Initial Validation)
+
+| Model | ID | Expected Issue | Purpose |
+|-------|----|----|---------|
+| **MXFP4** | `mlx-community/gpt-oss-20b-MXFP4-Q8` | `<|end|>` visible in output | Validate stop token fix |
+| **Qwen 2.5** | `mlx-community/Qwen2.5-0.5B-Instruct-4bit` | Self-conversation (?) | Validate chat template handling |
+| **Llama 3.2** | `mlx-community/Llama-3.2-3B-Instruct-4bit` | None (control) | Regression testing |
+
+**Note:** These 3 models serve as initial validation. Full portfolio testing (below) extends coverage to all MLX models in user cache.
+
+### Portfolio Discovery (Production Validation)
+
+Instead of hard-coded models, iterate over all MLX-compatible models in user cache:
+
+```python
+def discover_mlx_models_in_cache(hf_home: str) -> List[ModelInfo]:
+    """Scan HF_HOME/hub/models--*/snapshots/* for MLX models.
+
+    Filters:
+    - MLX-compatible: Has safetensors + config.json
+    - RAM-aware: Estimates model size, skips if exceeds budget
+
+    Returns: List of discovered models with metadata
+    """
+```
+
+**RAM Gating** (already implemented in `test_stop_tokens_live.py`):
+- Progressive budget: 40% (16GB), 50% (32GB), 60% (64GB), 70% (96GB+)
+- Auto-skip models exceeding available RAM
+- See `get_safe_ram_budget_gb()`, `should_skip_model()` helpers
+
+**Safety:**
+- Read-only cache access (no pull/rm)
+- Sentinel protection (`TEST-CACHE-SENTINEL`)
+- See ADR-007 for CoW constraints
+
+---
+
+## Test Phases
+
+### Phase 1: Baseline Measurement
+
+**Goal:** Document current broken behavior
+
+**Test Case:**
+```python
+prompt = "Write one sentence about cats."
+output = runner.generate_streaming(prompt, max_tokens=50)
+```
+
+**Collect:**
+- Full generated text
+- Token IDs (if accessible)
+- Stop condition (why stopped?)
+- Visible stop tokens
+
+**Expected Baseline Results:**
+- MXFP4: `<|end|>` appears in output ✗
+- Qwen: TBD (may self-converse) ?
+- Llama: Clean output ✓
+
+### Phase 2: Fix Validation
+
+**After implementing fix, same test case**
+
+**Expected After-Fix Results:**
+- MXFP4: No stop tokens visible ✓
+- Qwen: No self-conversation ✓
+- Llama: Still works (no regression) ✓
+
+### Phase 3: Empirical Mapping
+
+**Document tokenizer configs:**
+```python
+{
+  "model": "gpt-oss",
+  "configured_eos": ["<|return|>"],     # From tokenizer
+  "generated_tokens": ["<|end|>", ...], # Empirically observed
+  "workaround_needed": True/False
+}
+```
+
+---
+
+## Test Implementation
+
+**File:** `tests_2.0/test_stop_tokens_live.py`
+
+**Markers:**
+```python
+@pytest.mark.live_stop_tokens  # Requires models downloaded
+@pytest.mark.slow              # >1 min per model
+```
+
+**Run:**
+```bash
+# Baseline
+pytest tests_2.0/test_stop_tokens_live.py::test_baseline -v -m live_stop_tokens
+
+# After fix
+pytest tests_2.0/test_stop_tokens_live.py::test_validation -v -m live_stop_tokens
+```
+
+---
+
+## Success Criteria
+
+**Initial Validation (3 Models):**
+✅ **Phase 1 Complete:** Baseline measurements documented
+✅ **Phase 2 Complete:** All 3 models pass validation tests
+✅ **Phase 3 Complete:** Empirical mapping generated (test artifact: `stop_token_config_report.json`)
+
+**Portfolio Validation (All Models in Cache):**
+⏳ **Portfolio Discovery:** Planned (currently hard-coded 3-model `TEST_MODELS` dict)
+⏳ **Cache Iterator:** Planned (`discover_mlx_models_in_cache()` not yet implemented)
+⏳ **Dynamic Validation:** Planned (scale to all models in user cache, not just 3)
+
+---
+
+## Related Documentation
+
+- **ADR-009 Main:** Implementation details, 2-LOC fix, `add_eos_token()` fallback
+- **ADR-011:** E2E Live Test Architecture (Server/HTTP/CLI validation, reuses portfolio discovery)
+- **TESTING.md:** Live test execution, markers, environment setup
diff --git a/docs/MLX-Knife-2.0-Versioning-Strategy.md b/docs/MLX-Knife-2.0-Versioning-Strategy.md
new file mode 100644
index 0000000..9ad4db3
--- /dev/null
+++ b/docs/MLX-Knife-2.0-Versioning-Strategy.md
@@ -0,0 +1,213 @@
+# MLX-Knife 2.0 Versioning Strategy
+
+**Document Status:** Living (pre-release)  
+**Purpose:** Principles for versioning and deployment of MLX‑Knife 2.0 until stable
+
+## Versioning Schema
+
+### **2.0.0-alpha** (Feature-Complete for JSON-Only)
+**Scope:** Core JSON operations without server/run functionality
+
+**Features:**
+- ✅ All 5 Operations: `list`, `health`, `show`, `pull`, `rm`
+- ✅ JSON API fully implemented per specification
+- ✅ Core functionality working (broke-cluster compatible)
+- ⚠️ Experimental features MAY be present; they MUST be clearly labeled "experimental", safe by default, and must not break existing behavior
+- ❌ Pre-release level testing
+- ❌ No `server` or `run` commands
+
+**Quality Gate (alpha):**
+- Core operations functional in isolation
+- JSON schema stable and documented
+- Basic edge case handling
+
+**Target Users:**
+- Broke-cluster integration (POC environment)
+- Early adopters for JSON automation
+- Parallel deployment alongside 1.x
+
+### **2.0.0-beta** (Feature-Complete with Server/Run)
+**Scope:** All alpha features PLUS server/run functionality from 1.x
+
+**New in Beta:**
+- ✅ `server` command - OpenAI-compatible API from 1.x
+- ✅ `run` command - Interactive model execution from 1.x  
+- ✅ Reasoning model support (GPT-OSS/MXFP4)
+- ✅ Human output backend (already in alpha.3)
+- ✅ **100% test coverage** including server/run tests
+
+**Version Strategy:**
+- `2.0.0-beta.1-local` - Initial server/run port (git tag only)
+- `2.0.0-beta.2-local` - Full reasoning support (git tag only)
+- `2.0.0-beta.3` - First public beta release (PyPI)
+
+**Quality Gate:**
+- Feature parity with 1.1.1-beta.3
+- All server/run tests passing
+- Reasoning models working
+- Documentation complete
+
+**Target Users:**
+- Internal testing and validation
+- Beta.3: Public beta testers
+- Full MLX-Knife functionality seekers
+
+### **2.0.0-rc** (Feature-Complete vs 1.x)
+**Scope:** Full feature parity with MLX-Knife 1.x
+
+**New Features:**
+- ✅ `server` command - OpenAI-compatible API server
+- ✅ `run` command - Interactive model execution
+- ✅ `embed` command - Embedding generation (if merged from 1.x)
+- ✅ Human-readable output via CLI layer formatting
+
+**Quality Gate:**
+- All 1.x functionality replicated
+- Migration path documented
+- Performance parity or better
+- Server functionality validated
+
+**Target Users:**
+- Full 1.x replacement candidates
+- Users requiring both JSON and human output
+- Server-mode applications
+
+### **2.0.0-stable**
+**Scope:** Production-ready replacement for MLX-Knife 1.x
+
+**Requirements:**
+- ✅ All RC features stable and documented
+- ✅ Migration guide with examples
+- ✅ Community feedback incorporated
+- ✅ Long-term support commitment
+- ✅ Package management (pip/brew) ready
+
+**Target Users:**
+- All MLX-Knife users
+- General availability deployment
+
+## Deployment Strategy
+
+### Broke-Cluster POC Environment
+
+**Parallel Deployment Architecture:**
+```bash
+# System-wide: MLX-Knife 1.1.0 (stable server functionality)
+pip install mlx-knife==1.1.0
+
+# Local development: MLX-Knife 2.0.0-alpha (JSON management)
+pip install -e /path/to/mlx-knife  # Local install (current 2.0 feature branch)
+```
+
+**Usage Pattern:**
+```bash
+# Server operations: Use 1.x (stable, proven)
+mlxk server --model "Phi-3-mini" --port 8000
+
+# Management operations: Use 2.0.0-alpha (JSON automation)
+mlxk-json list --json | jq '.data.models[].name'
+mlxk-json health --json | jq '.data.summary'
+mlxk-json pull "new-model" --json
+```
+
+**Benefits:**
+- ✅ **Risk mitigation**: Server stability maintained with 1.x
+- ✅ **Feature validation**: JSON API tested in production environment  
+- ✅ **Gradual migration**: Teams can adopt 2.0 features incrementally
+- ✅ **Rollback safety**: Can disable 2.0 without affecting server operations
+
+### Package Naming Strategy
+
+**Development Phase:**
+- `mlx-knife` (1.1.0) - Stable production version
+- `mlxk2` / `mlxk-json` - Development 2.0.0-alpha local install (single long‑lived 2.0 branch; releases via annotated tags)
+
+**Production Phase:**
+- `mlx-knife` (2.0.0+) - New major version
+- `mlx-knife-v1` (1.1.0) - Legacy support if needed
+
+## Quality Gates Summary
+
+| Version | Test Coverage | Features | Server Mode | Production Ready |
+|---------|---------------|----------|-------------|------------------|
+| **alpha** | ~70% (mock issues) | JSON-only (5 ops) | ❌ | Limited |
+| **beta** | 100% | JSON-only (5 ops) | ❌ | Yes (JSON) |
+| **rc** | 100% | Full parity | ✅ | Yes (All) |
+| **stable** | 100% + community | Full parity | ✅ | Yes (LTS) |
+
+## Success Metrics
+
+### Alpha Success Criteria
+- [ ] Broke-cluster integration working
+- [ ] Core JSON operations stable
+- [ ] No user cache corruption in testing
+- [ ] JSON schema documentation complete
+
+### Beta Success Criteria  
+- [ ] 100% test pass rate
+- [ ] Performance benchmarks established
+- [ ] All ADR-002 edge cases handled
+- [ ] Production deployment successful
+
+### RC Success Criteria
+- [ ] Feature parity with 1.x achieved
+- [ ] Migration guide validated
+- [ ] Server mode performance acceptable
+- [ ] Community feedback positive
+
+### Stable Success Criteria
+- [ ] 6+ months beta stability
+- [ ] Multiple production deployments
+- [ ] Documentation comprehensive
+- [ ] Long-term support plan
+
+## Timeline Estimates
+
+**Current Status:** Active alpha cycle with tagged pre‑releases
+  - JSON CLI stable for broke‑cluster use
+
+**Projected Milestones:**
+- **2.0.0-alpha**: rolling alphas (tagged), experimental features allowed but clearly marked and safe by default
+- **2.0.0-beta**: 4-6 weeks (robust testing)
+- **2.0.0-rc**: 8-12 weeks (server/run implementation)  
+- **2.0.0-stable**: 16-20 weeks (community validation)
+
+## Risk Mitigation
+
+### HuggingFace Cache Compatibility (CRITICAL)
+
+**Apple MLX Team & HuggingFace Hub Integration:**
+- **~20+ MLX ecosystem users** depend on cache stability
+- **HuggingFace Hub attention** - changes monitored by upstream
+- **Cache structure**: MLX-Knife follows HuggingFace standards
+
+**Cache Safety Guidelines:**
+```markdown
+### Shared Cache Environment Best Practices
+- **Read operations** (`list`, `health`, `show`): Always safe with concurrent processes
+- **Write operations** (`pull`, `rm`): Coordinate with team during maintenance windows
+- **Lock cleanup**: Automatic in MLX-Knife, avoid during active HuggingFace downloads
+- **User responsibility**: Coordinate cache access, no special flags needed
+```
+
+### Parallel Deployment Risks
+- **Configuration conflicts**: Different cache paths, environment variables
+- **User confusion**: Clear naming and documentation required
+- **Maintenance burden**: Supporting two codebases temporarily
+
+### Mitigation Strategies
+- **Clear separation**: Different package names, installation paths
+- **Comprehensive docs**: Usage examples, best practices, cache guidelines
+- **Automated testing**: Both versions in CI/CD pipeline
+- **Community support**: Active communication about roadmap
+
+## Decision Authority
+
+**Architecture Decisions:** Development team consensus required
+**Version Releases:** Lead maintainer approval + community review
+**Breaking Changes:** Major version bump + migration period
+**Support Policy:** LTS for stable versions, best-effort for pre-release
+
+---
+
+This versioning strategy provides a clear path from current alpha-quality code to production-ready 2.0.0 while maintaining stability through parallel deployment with 1.x versions.
\ No newline at end of file
diff --git a/docs/json-api-schema.json b/docs/json-api-schema.json
new file mode 100644
index 0000000..f34ecad
--- /dev/null
+++ b/docs/json-api-schema.json
@@ -0,0 +1,284 @@
+{
+  "$schema": "http://json-schema.org/draft-07/schema#",
+  "$id": "https://example.com/mlxk-json-api.schema.json",
+  "title": "MLX-Knife 2.0 JSON API 0.1.5 (current)",
+  "type": "object",
+  "additionalProperties": false,
+  "properties": {
+    "status": {"type": "string", "enum": ["success", "error"]},
+    "command": {"type": "string", "enum": ["list", "show", "health", "pull", "rm", "clone", "version", "push", "run", "server"]},
+    "api_version": {"type": "string", "pattern": "^json-[0-9]+\\.[0-9]+\\.[0-9]+$"},
+    "data": {"type": ["object", "null"]},
+    "error": {
+      "type": ["object", "null"],
+      "properties": {
+        "type": {"type": "string"},
+        "message": {"type": "string"},
+        "matches": {"type": "array", "items": {"type": "string"}},
+        "available_hashes": {"type": "array", "items": {"type": "string"}}
+      },
+      "additionalProperties": true
+    }
+  },
+  "required": ["status", "command", "data", "error"],
+  "allOf": [
+    {"$ref": "#/definitions/byCommand"}
+  ],
+  "definitions": {
+    "hash40": {"type": "string", "pattern": "^[A-Za-z0-9]{40}$"},
+    "isoUtcZ": {"type": "string", "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z$"},
+    "healthEntry": {
+      "type": "object",
+      "properties": {
+        "name": {"type": "string"},
+        "status": {"type": "string", "enum": ["healthy", "unhealthy"]},
+        "reason": {"type": "string"}
+      },
+      "required": ["name", "status", "reason"],
+      "additionalProperties": false
+    },
+    "modelObject": {
+      "type": "object",
+      "additionalProperties": false,
+      "properties": {
+        "name": {"type": "string"},
+        "hash": {"anyOf": [{"$ref": "#/definitions/hash40"}, {"type": "null"}]},
+        "size_bytes": {"type": "integer", "minimum": 0},
+        "last_modified": {"$ref": "#/definitions/isoUtcZ"},
+        "framework": {"type": "string", "enum": ["MLX", "GGUF", "PyTorch", "Unknown"]},
+        "model_type": {"type": "string", "enum": ["chat", "embedding", "base", "unknown"]},
+        "capabilities": {
+          "type": "array",
+          "items": {"type": "string", "enum": ["text-generation", "chat", "embeddings", "completion"]}
+        },
+        "health": {"type": "string", "enum": ["healthy", "unhealthy"]},
+        "runtime_compatible": {"type": "boolean"},
+        "reason": {"type": ["string", "null"]},
+        "cached": {"type": "boolean"}
+      },
+      "required": [
+        "name", "hash", "size_bytes", "last_modified", "framework",
+        "model_type", "capabilities", "health", "runtime_compatible", "cached"
+      ],
+      "description": "Extended in 0.1.5: added runtime_compatible (boolean) and reason (string|null) fields. Both checks always performed in JSON mode. reason describes the first problem found (integrity or runtime). Gate logic: runtime check requires integrity check first. CLI flags (--health, --runtime-check) control human output columns only, not JSON data collection."
+    },
+    "fileEntry": {
+      "type": "object",
+      "additionalProperties": false,
+      "properties": {
+        "name": {"type": "string"},
+        "size": {"type": "string"},
+        "type": {"type": "string"}
+      },
+      "required": ["name", "size", "type"]
+    },
+    "byCommand": {
+      "allOf": [
+        {
+          "if": {
+            "allOf": [
+              {"properties": {"status": {"const": "success"}}},
+              {"properties": {"command": {"const": "version"}}}
+            ]
+          },
+          "then": {
+            "properties": {
+              "data": {
+                "type": "object",
+                "additionalProperties": false,
+                "properties": {
+                  "cli_version": {"type": "string"},
+                  "json_api_spec_version": {"type": "string", "pattern": "^[0-9]+\\.[0-9]+\\.[0-9]+$"}
+                },
+                "required": ["cli_version", "json_api_spec_version"]
+              }
+            }
+          },
+          "else": {}
+        },
+        {
+          "if": {
+            "allOf": [
+              {"properties": {"status": {"const": "success"}}},
+              {"properties": {"command": {"const": "list"}}}
+            ]
+          },
+          "then": {
+            "properties": {
+              "data": {
+                "type": "object",
+                "additionalProperties": false,
+                "properties": {
+                  "models": {"type": "array", "items": {"$ref": "#/definitions/modelObject"}},
+                  "count": {"type": "integer", "minimum": 0}
+                },
+                "required": ["models", "count"]
+              }
+            }
+          },
+          "else": {}
+        },
+        {
+          "if": {
+            "allOf": [
+              {"properties": {"status": {"const": "success"}}},
+              {"properties": {"command": {"const": "show"}}}
+            ]
+          },
+          "then": {
+            "properties": {
+              "data": {
+                "type": "object",
+                "additionalProperties": true,
+                "properties": {
+                  "model": {"$ref": "#/definitions/modelObject"},
+                  "metadata": {"type": ["object", "null"]},
+                  "files": {"type": "array", "items": {"$ref": "#/definitions/fileEntry"}},
+                  "config": {"type": ["object", "null"]}
+                },
+                "required": ["model"]
+              }
+            }
+          },
+          "else": {}
+        },
+        {
+          "if": {
+            "allOf": [
+              {"properties": {"status": {"const": "success"}}},
+              {"properties": {"command": {"const": "health"}}}
+            ]
+          },
+          "then": {
+            "properties": {
+              "data": {
+                "type": "object",
+                "additionalProperties": false,
+                "properties": {
+                  "healthy": {
+                    "type": "array",
+                    "items": {"$ref": "#/definitions/healthEntry"}
+                  },
+                  "unhealthy": {
+                    "type": "array",
+                    "items": {"$ref": "#/definitions/healthEntry"}
+                  },
+                  "summary": {
+                    "type": "object",
+                    "properties": {
+                      "total": {"type": "integer", "minimum": 0},
+                      "healthy_count": {"type": "integer", "minimum": 0},
+                      "unhealthy_count": {"type": "integer", "minimum": 0}
+                    },
+                    "required": ["total", "healthy_count", "unhealthy_count"],
+                    "additionalProperties": false
+                  }
+                },
+                "required": ["healthy", "unhealthy", "summary"]
+              }
+            }
+          },
+          "else": {}
+        },
+        {
+          "if": {
+            "allOf": [
+              {"properties": {"status": {"const": "success"}}},
+              {"properties": {"command": {"const": "pull"}}}
+            ]
+          },
+          "then": {
+            "properties": {
+              "data": {
+                "type": "object",
+                "additionalProperties": true,
+                "properties": {
+                  "model": {"type": ["string", "null"]},
+                  "download_status": {"type": "string"},
+                  "message": {"type": "string"},
+                  "expanded_name": {"type": ["string", "null"]}
+                },
+                "required": ["download_status", "message"]
+              }
+            }
+          },
+          "else": {}
+        },
+        {
+          "if": {
+            "allOf": [
+              {"properties": {"status": {"const": "success"}}},
+              {"properties": {"command": {"const": "rm"}}}
+            ]
+          },
+          "then": {
+            "properties": {
+              "data": {
+                "type": "object",
+                "additionalProperties": true,
+                "properties": {
+                  "model": {"type": ["string", "null"]},
+                  "action": {"type": "string"},
+                  "message": {"type": "string"}
+                },
+                "required": ["action"]
+              }
+            }
+          },
+          "else": {}
+        },
+        {
+          "if": {
+            "allOf": [
+              {"properties": {"status": {"const": "success"}}},
+              {"properties": {"command": {"const": "push"}}}
+            ]
+          },
+          "then": {
+            "properties": {
+              "data": {
+                "type": "object",
+                "additionalProperties": true,
+                "properties": {
+                  "repo_id": {"type": "string"},
+                  "branch": {"type": "string"},
+                  "commit_sha": {"type": ["string", "null"]},
+                  "repo_url": {"type": "string"},
+                  "uploaded_files_count": {"type": ["integer", "null"], "minimum": 0},
+                  "experimental": {"type": "boolean"},
+                  "disclaimer": {"type": "string"}
+                },
+                "required": ["repo_id", "branch", "repo_url"]
+              }
+            }
+          },
+          "else": {}
+        },
+        {
+          "if": {
+            "allOf": [
+              {"properties": {"status": {"const": "success"}}},
+              {"properties": {"command": {"const": "run"}}}
+            ]
+          },
+          "then": {
+            "properties": {
+              "data": {
+                "type": "object",
+                "additionalProperties": true,
+                "properties": {
+                  "model": {"type": "string"},
+                  "prompt": {"type": "string"},
+                  "response": {"type": "string"},
+                  "tokens_generated": {"type": "integer"},
+                  "generation_time_s": {"type": "number"}
+                }
+              }
+            }
+          },
+          "else": {}
+        }
+      ]
+    }
+  }
+}
diff --git a/docs/json-api-specification.md b/docs/json-api-specification.md
new file mode 100644
index 0000000..900275d
--- /dev/null
+++ b/docs/json-api-specification.md
@@ -0,0 +1,1085 @@
+# MLX-Knife 2.0 JSON API Specification
+
+**Specification Version:** 0.1.5
+**Status:** Alpha - Subject to change
+**Target:** MLX-Knife 2.0.0-beta.4
+
+> Based on [GitHub Issue #8](https://github.com/mzau/mlx-knife/issues/8) - Comprehensive JSON output support for all commands
+
+## Motivation
+
+MLX Knife is promoted as a "scriptable" tool, but formatted terminal output makes automation difficult. JSON output enables robust scripting integration and broke-cluster compatibility.
+
+## Health Check Concepts (0.1.5)
+
+MLX Knife distinguishes between two levels of model validation:
+
+### Integrity Check (`health` field)
+- **Purpose:** Verify that downloaded model files are complete and uncorrupted
+- **Scope:** File-level validation only
+- **Checks:**
+  - Required files present (config.json, weights, tokenizer files)
+  - No Git LFS pointers instead of actual files
+  - JSON files are valid JSON
+- **States:** `"healthy"` | `"unhealthy"`
+- **Always included:** In all `modelObject` instances
+
+### Runtime Compatibility Check (`runtime_compatible` field)
+- **Purpose:** Verify that model can be executed with `mlx-lm`
+- **Scope:** Framework and model architecture validation
+- **Checks:**
+  - Framework is MLX (GGUF/PyTorch models fail)
+  - Model architecture supported by current mlx-lm version
+  - Respects `MODEL_REMAPPING` (e.g., `mistral` → `llama`)
+- **States:** `true` | `false`
+- **Always included:** In all `modelObject` instances
+
+### Gate Logic & Reason Field
+- Runtime compatibility check **requires** integrity check first
+- If integrity check fails (`health: "unhealthy"`), runtime check is skipped (`runtime_compatible: false`)
+- `reason` field describes the **first problem found**:
+  - Integrity problems take precedence
+  - Runtime problems only shown if files are healthy
+  - `null` when both checks pass (`health: "healthy"` AND `runtime_compatible: true`)
+
+### Example Scenarios
+
+**Healthy MLX Model (Compatible):**
+```json
+/* Illustrative snippet - not a complete response */
+{
+  "health": "healthy",
+  "runtime_compatible": true,
+  "reason": null
+}
+```
+
+**GGUF Model (Files OK, Not Executable):**
+```json
+/* Illustrative snippet - not a complete response */
+{
+  "health": "healthy",
+  "runtime_compatible": false,
+  "reason": "Framework GGUF not executable with mlx-lm (requires MLX)"
+}
+```
+
+**Unsupported Architecture:**
+```json
+/* Illustrative snippet - not a complete response */
+{
+  "health": "healthy",
+  "runtime_compatible": false,
+  "reason": "Model architecture 'qwen3_next' requires mlx-lm >= 0.28.0 (current: 0.27.1)"
+}
+```
+
+**Incomplete Download (Runtime Check Skipped):**
+```json
+/* Illustrative snippet - not a complete response */
+{
+  "health": "unhealthy",
+  "runtime_compatible": false,
+  "reason": "config.json missing"
+}
+```
+
+## CLI Usage
+
+All commands require the `--json` flag for JSON output:
+
+```bash
+mlxk2 list --json                      # JSON output
+mlxk2 list                             # Human-readable output
+```
+
+### Version Reporting
+
+- CLI version (human):
+  - `mlxk2 --version`
+- CLI version (JSON):
+  - `mlxk2 --version --json`
+
+JSON output example:
+```json
+{
+  "status": "success",
+  "command": "version",
+  "data": {
+    "cli_version": "2.0.0-alpha",
+    "json_api_spec_version": "0.1.2"
+  },
+  "error": null
+}
+```
+
+Notes:
+- Regular command responses (e.g., `list`, `show`) do not include a separate protocol tag; the spec version is reported by the `version` command in `data.json_api_spec_version`.
+
+## Commands Overview
+
+All commands support consistent JSON output with standardized error handling and exit codes.
+
+### Core Schema Pattern
+
+```jsonc
+{
+  "status": "success" | "error",
+  "command": "list" | "show" | "health" | "pull" | "rm" | "clone" | "version" | "push" | "run" | "server",
+  "data": { /* command-specific data */ },
+  "error": null | { "type": "string", "message": "string" }
+}
+```
+
+## Common Model Object
+
+All commands that return model information use the same minimal model object.
+
+- `name`: Full HF name `org/model`.
+- `hash`: 40-char snapshot commit of the selected snapshot, or `null`.
+- `size_bytes`: Total size in bytes of files under the selected path (snapshot preferred, else model root).
+- `last_modified`: ISO-8601 UTC timestamp (with `Z`) of the selected path.
+- `framework`: "MLX" | "GGUF" | "PyTorch" | "Unknown".
+- `model_type`: "chat" | "embedding" | "base" | "unknown".
+- `capabilities`: e.g., ["text-generation", "chat"] or ["embeddings"].
+- `health`: "healthy" | "unhealthy" (always present).
+- `runtime_compatible`: `true` | `false` (0.1.5+, always present).
+- `reason`: `string | null` (0.1.5+, describes first problem found, null when both checks pass).
+- `cached`: true.
+
+Notes:
+- No human-readable `size` field; only `size_bytes`.
+- No human-readable "modified" field; `last_modified` is authoritative.
+- No absolute filesystem paths are exposed.
+- `runtime_compatible` and `reason` fields added in spec version 0.1.5 (Issue #36).
+
+### Supported Commands
+
+| Command | Description | JSON-Only in 2.0 | Alpha Feature |
+|---------|-------------|------------------|---------------|
+| `list` | List models with metadata and hash codes | ✅ | - |
+| `show` | Detailed model inspection with files/config | ✅ | - |
+| `health` | Check model integrity and corruption | ✅ | - |
+| `pull` | Download models from HuggingFace | ✅ | - |
+| `rm` | Delete models from cache | ✅ | - |
+| `clone` | Clone models to workspace directory | ✅ | `MLXK2_ENABLE_ALPHA_FEATURES=1` |
+| `push` | Upload a local folder to Hugging Face (experimental) | ✅ | `MLXK2_ENABLE_ALPHA_FEATURES=1` |
+| `run` | Execute model inference | ✅ | - |
+| `serve`/`server` | OpenAI-compatible API server | ✅ | - |
+
+**Note:** Commands marked with Alpha Feature require `MLXK2_ENABLE_ALPHA_FEATURES=1` environment variable to be available.
+
+## Model Discovery & Metadata
+
+### Model Type & Capabilities
+
+**Model Types:**
+- `"chat"` - Language models with chat/instruction capability
+- `"embedding"` - Embedding models for vector representations
+- `"completion"` - Base models for text completion (no chat template)
+- `"unknown"` - Cannot determine model type from config
+
+**Capabilities Array:**
+- `"text-generation"` - Can generate text
+- `"chat"` - Supports chat template/instruction format
+- `"embeddings"` - Can generate embeddings
+- `"completion"` - Text completion without chat format
+
+### `mlxk-json list [pattern] --json`
+
+**Basic Usage:**
+```bash
+mlxk-json list --json                        # All models with full validation
+mlxk-json list "mlx-community" --json        # Filter by pattern
+mlxk-json list "Llama" --json                # Fuzzy matching
+```
+
+**Behavior:**
+- Returns all cached models with complete metadata
+- Performs both integrity and runtime compatibility checks (0.1.5+)
+- Pattern filter is a case-insensitive substring match on `name`
+
+**JSON Schema:**
+```json
+{
+  "status": "success",
+  "command": "list",
+  "data": {
+    "models": [
+      {
+        "name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+        "hash": "a5339a41b2e3abcdefgh1234567890ab12345678",
+        "size_bytes": 4613734656,
+        "last_modified": "2024-10-15T08:23:41Z",
+        "framework": "MLX",
+        "model_type": "chat",
+        "capabilities": ["text-generation", "chat"],
+        "health": "healthy",
+        "runtime_compatible": true,
+        "reason": null,
+        "cached": true
+      },
+      {
+        "name": "mlx-community/mxbai-embed-large-v1",
+        "hash": "b5679a5f90abcdef1234567890abcdef12345678",
+        "size_bytes": 1200000000,
+        "last_modified": "2024-10-20T10:30:15Z",
+        "framework": "MLX",
+        "model_type": "embedding",
+        "capabilities": ["embeddings"],
+        "health": "healthy",
+        "runtime_compatible": true,
+        "reason": null,
+        "cached": true
+      },
+      {
+        "name": "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF",
+        "hash": "e96c7a5f90abcdef1234567890abcdef12345678",
+        "size_bytes": 16900000000,
+        "last_modified": "2024-09-20T14:15:22Z",
+        "framework": "GGUF",
+        "model_type": "chat",
+        "capabilities": ["text-generation", "chat"],
+        "health": "healthy",
+        "runtime_compatible": false,
+        "reason": "Framework GGUF not executable with mlx-lm (requires MLX)",
+        "cached": true
+      },
+      {
+        "name": "mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit",
+        "hash": "f1234a5f90abcdef1234567890abcdef12345678",
+        "size_bytes": 45000000000,
+        "last_modified": "2024-10-01T09:15:30Z",
+        "framework": "MLX",
+        "model_type": "chat",
+        "capabilities": ["text-generation", "chat"],
+        "health": "healthy",
+        "runtime_compatible": false,
+        "reason": "Model architecture 'qwen3_next' requires mlx-lm >= 0.28.0 (current: 0.27.1)",
+        "cached": true
+      },
+      {
+        "name": "corrupted/incomplete-download",
+        "hash": "c9876a5f90abcdef1234567890abcdef12345678",
+        "size_bytes": 2500000000,
+        "last_modified": "2024-09-15T12:00:00Z",
+        "framework": "MLX",
+        "model_type": "unknown",
+        "capabilities": [],
+        "health": "unhealthy",
+        "runtime_compatible": false,
+        "reason": "config.json missing",
+        "cached": true
+      }
+    ],
+    "count": 12
+  },
+  "error": null
+}
+```
+
+**Empty Cache:**
+```json
+{
+  "status": "success",
+  "command": "list",
+  "data": {
+    "models": [],
+    "count": 0
+  },
+  "error": null
+}
+```
+
+### `mlxk-json health [pattern] --json`
+
+**Usage:**
+```bash
+mlxk-json health --json                      # Check all models
+mlxk-json health "Phi-3" --json              # Check specific pattern
+mlxk-json health "Qwen3@e96" --json          # Check specific hash
+```
+
+**Healthy Models:**
+```json
+{
+  "status": "success",
+  "command": "health",
+  "data": {
+    "healthy": [
+      {
+        "name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+        "status": "healthy",
+        "reason": "Model is healthy"
+      }
+    ],
+    "unhealthy": [],
+    "summary": {
+      "total": 1,
+      "healthy_count": 1,
+      "unhealthy_count": 0
+    }
+  },
+  "error": null
+}
+```
+
+**Unhealthy Models (Real Scenario):**
+```json
+{
+  "status": "success", 
+  "command": "health",
+  "data": {
+    "healthy": [],
+    "unhealthy": [
+      {
+        "name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+        "status": "unhealthy",
+        "reason": "config.json missing"
+      },
+      {
+        "name": "corrupted/model", 
+        "status": "unhealthy",
+        "reason": "LFS pointers instead of files: model.safetensors"
+      }
+    ],
+    "summary": {
+      "total": 2,
+      "healthy_count": 0,
+      "unhealthy_count": 2
+    }
+  },
+  "error": null
+}
+```
+
+**Ambiguous Pattern:**
+```json
+{
+  "status": "error",
+  "command": "health", 
+  "data": null,
+  "error": {
+    "type": "ambiguous_match",
+    "message": "Multiple models match 'Llama'",
+    "matches": [
+      "mlx-community/Llama-3.2-1B-Instruct-4bit",
+      "mlx-community/Llama-3.2-3B-Instruct-4bit"
+    ]
+  }
+}
+```
+
+### `mlxk-json show <model> --json`
+
+**Usage:**
+```bash
+mlxk-json show "Phi-3-mini" --json               # Short name expansion
+mlxk-json show "mlx-community/Phi-3-mini" --json # Full name
+mlxk-json show "Qwen3@e96" --json                # Specific hash
+mlxk-json show "Phi-3-mini" --files --json       # Include file listing
+mlxk-json show "Phi-3-mini" --config --json      # Include config.json content
+```
+
+**Basic Model Information:**
+```json
+{
+  "status": "success",
+  "command": "show",
+  "data": {
+    "model": {
+      "name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+      "hash": "a5339a41b2e3abcdefgh1234567890ab12345678",
+      "size_bytes": 4613734656,
+      "framework": "MLX",
+      "model_type": "chat",
+      "capabilities": ["text-generation", "chat"],
+      "last_modified": "2024-10-15T08:23:41Z",
+      "health": "healthy",
+      "runtime_compatible": true,
+      "reason": null,
+      "cached": true
+    },
+    "metadata": {
+      "model_type": "phi3",
+      "quantization": "4bit",
+      "context_length": 4096,
+      "vocab_size": 32064,
+      "hidden_size": 3072,
+      "num_attention_heads": 32,
+      "num_hidden_layers": 32
+    }
+  },
+  "error": null
+}
+```
+
+**With Files Listing (--files):**
+```json
+{
+  "status": "success",
+  "command": "show",
+  "data": {
+    "model": {
+      "name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+      "hash": "a5339a41b2e3abcdefgh1234567890ab12345678",
+      "size_bytes": 4613734656,
+      "framework": "MLX",
+      "model_type": "chat",
+      "capabilities": ["text-generation", "chat"],
+      "last_modified": "2024-10-15T08:23:41Z",
+      "health": "healthy",
+      "runtime_compatible": true,
+      "reason": null,
+      "cached": true
+    },
+    "files": [
+      {"name": "config.json", "size": "1.2KB", "type": "config"},
+      {"name": "model.safetensors", "size": "2.3GB", "type": "weights"},
+      {"name": "model-00001-of-00002.safetensors", "size": "1.8GB", "type": "weights"},
+      {"name": "model-00002-of-00002.safetensors", "size": "200MB", "type": "weights"},
+      {"name": "tokenizer.json", "size": "2.1MB", "type": "tokenizer"},
+      {"name": "tokenizer_config.json", "size": "3.4KB", "type": "config"},
+      {"name": "special_tokens_map.json", "size": "588B", "type": "config"}
+    ],
+    "metadata": null
+  },
+  "error": null
+}
+```
+
+**With Config Content (--config):**
+```json
+{
+  "status": "success",
+  "command": "show",
+  "data": {
+    "model": {
+      "name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+      "hash": "a5339a41b2e3abcdefgh1234567890ab12345678",
+      "size_bytes": 4613734656,
+      "framework": "MLX",
+      "model_type": "chat",
+      "capabilities": ["text-generation", "chat"],
+      "last_modified": "2024-10-15T08:23:41Z",
+      "health": "healthy",
+      "runtime_compatible": true,
+      "reason": null,
+      "cached": true
+    },
+    "config": {
+      "architectures": ["Phi3ForCausalLM"],
+      "model_type": "phi3",
+      "vocab_size": 32064,
+      "hidden_size": 3072,
+      "intermediate_size": 8192,
+      "num_hidden_layers": 32,
+      "num_attention_heads": 32,
+      "max_position_embeddings": 4096,
+      "rope_theta": 10000.0,
+      "quantization": {
+        "bits": 4,
+        "group_size": 64
+      }
+    },
+    "metadata": null
+  },
+  "error": null
+}
+```
+
+**Model Not Found:**
+```json
+{
+  "status": "error",
+  "command": "show",
+  "data": null,
+  "error": {
+    "type": "model_not_found",
+    "message": "No model found matching 'nonexistent-model'"
+  }
+}
+```
+
+**Ambiguous Match:**
+```json
+{
+  "status": "error",
+  "command": "show",
+  "data": null,
+  "error": {
+    "type": "ambiguous_match",
+    "message": "Multiple models match 'Llama'",
+    "matches": [
+      "mlx-community/Llama-3.2-1B-Instruct-4bit",
+      "mlx-community/Llama-3.2-3B-Instruct-4bit"
+    ]
+  }
+}
+```
+
+## Changes in 0.1.5 (Alpha)
+
+**Issue #36: Separate Integrity and Runtime Compatibility Checks**
+
+- Added `runtime_compatible: boolean` field to `modelObject`
+- Added `reason: string | null` field to `modelObject`
+- Both fields always present in JSON output
+- `runtime_compatible` checks:
+  - Framework must be MLX (GGUF/PyTorch fail)
+  - Model architecture must be supported by installed mlx-lm version
+  - Respects `MODEL_REMAPPING` for aliased architectures
+- Gate logic: Runtime check requires passing integrity check first
+- `reason` field describes first problem found (integrity > runtime priority)
+
+## Changes in 0.1.2 (Alpha)
+
+- Introduced a common minimal Model Object for consistency across commands.
+- Replaced human-readable `size` with machine-friendly `size_bytes`.
+- Removed human-readable `modified`; `last_modified` (ISO-8601 UTC) is authoritative.
+
+## Operations
+
+### `mlxk-json pull <model> --json`
+
+**Usage:**
+```bash
+mlxk-json pull "Phi-3-mini" --json               # Short name expansion
+mlxk-json pull "mlx-community/Phi-3-mini" --json # Full name
+mlxk-json pull "microsoft/DialoGPT-small" --json # Non-MLX model
+```
+
+**Successful Download:**
+```json
+{
+  "status": "success",
+  "command": "pull",
+  "data": {
+    "model": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+    "download_status": "success",
+    "message": "Successfully downloaded model",
+    "expanded_name": "mlx-community/Phi-3-mini-4k-instruct-4bit"
+  },
+  "error": null
+}
+```
+
+**Already Exists (Bug - doesn't detect corruption):**
+```json
+{
+  "status": "success",
+  "command": "pull",
+  "data": {
+    "model": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+    "download_status": "already_exists",
+    "message": "Model mlx-community/Phi-3-mini-4k-instruct-4bit already exists in cache",
+    "expanded_name": null
+  },
+  "error": null
+}
+```
+
+**Download Failed:**
+```json
+{
+  "status": "error",
+  "command": "pull",
+  "data": {
+    "model": "nonexistent/model",
+    "download_status": "failed",
+    "message": "",
+    "expanded_name": null
+  },
+  "error": {
+    "type": "download_failed",
+    "message": "Repository not found for url: https://huggingface.co/api/models/nonexistent/model"
+  }
+}
+```
+
+**Validation Error:**
+```json
+{
+  "status": "error",
+  "command": "pull", 
+  "data": {
+    "model": null,
+    "download_status": "error",
+    "message": "",
+    "expanded_name": null
+  },
+  "error": {
+    "type": "ValidationError",
+    "message": "Model name too long: 105/96 characters"
+  }
+}
+```
+
+**Ambiguous Match:**
+```json
+{
+  "status": "error",
+  "command": "pull",
+  "data": {
+    "model": null,
+    "download_status": "unknown",
+    "message": "",
+    "expanded_name": null
+  },
+  "error": {
+    "type": "ambiguous_match",
+    "message": "Multiple models match 'Llama'",
+    "matches": [
+      "mlx-community/Llama-3.2-1B-Instruct-4bit",
+      "mlx-community/Llama-3.2-3B-Instruct-4bit"
+    ]
+  }
+}
+```
+
+### `mlxk-json rm <model> [--force] --json`
+
+**Usage:**
+```bash
+mlxk-json rm "Phi-3-mini" --json                 # Direct deletion (no locks)
+mlxk-json rm "Phi-3-mini" --force --json         # Force deletion (ignores locks)
+mlxk-json rm "locked-model" --json               # Error: requires --force due to locks
+```
+
+**Successful Deletion:**
+```json
+{
+  "status": "success", 
+  "command": "rm",
+  "data": {
+    "model": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+    "action": "deleted",
+    "message": "Successfully deleted mlx-community/Phi-3-mini-4k-instruct-4bit"
+  },
+  "error": null
+}
+```
+
+**Model has Active Locks (requires --force):**
+```json
+{
+  "status": "error",
+  "command": "rm", 
+  "data": {
+    "model": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+    "locks_detected": true,
+    "lock_files": [".locks/model-lock-12345.lock"]
+  },
+  "error": {
+    "type": "locks_present",
+    "message": "Model has active locks. Use --force to override."
+  }
+}
+```
+
+**Model Not Found:**
+```json
+{
+  "status": "error",
+  "command": "rm",
+  "data": null,
+  "error": {
+    "type": "model_not_found",
+    "message": "No models found matching 'nonexistent-model'"
+  }
+}
+```
+
+**Ambiguous Pattern:**
+```json
+{
+  "status": "error",
+  "command": "rm",
+  "data": {
+    "matches": [
+      "mlx-community/Llama-3.2-1B-Instruct-4bit",
+      "mlx-community/Llama-3.2-3B-Instruct-4bit"
+    ]
+  },
+  "error": {
+    "type": "ambiguous_match", 
+    "message": "Multiple models match 'Llama'. Please specify which model to delete."
+  }
+}
+```
+
+**Permission Error:**
+```json
+{
+  "status": "error",
+  "command": "rm",
+  "data": {
+    "model": "mlx-community/Phi-3-mini-4k-instruct-4bit"
+  },
+  "error": {
+    "type": "PermissionError",
+    "message": "Permission denied: Cannot delete read-only files"
+  }
+}
+```
+
+### `mlxk-json clone <model> <target_dir> --json`
+
+**Requires:** `MLXK2_ENABLE_ALPHA_FEATURES=1`
+
+**Usage:**
+```bash
+mlxk-json clone "Phi-3-mini" ./workspace --json              # Clone to workspace directory
+mlxk-json clone "mlx-community/Phi-3-mini" ./my-model --json # Full name to custom directory
+mlxk-json clone "microsoft/DialoGPT-small" ./workspace --json # Non-MLX model
+```
+
+**Successful Clone:**
+```json
+{
+  "status": "success",
+  "command": "clone",
+  "data": {
+    "model": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+    "clone_status": "success",
+    "message": "Cloned to ./workspace",
+    "target_dir": "./workspace",
+    "expanded_name": "mlx-community/Phi-3-mini-4k-instruct-4bit"
+  },
+  "error": null
+}
+```
+
+**Target Directory Not Empty:**
+```json
+{
+  "status": "error",
+  "command": "clone",
+  "data": {
+    "model": null,
+    "clone_status": "error",
+    "target_dir": "./workspace"
+  },
+  "error": {
+    "type": "ValidationError",
+    "message": "Target directory './workspace' already exists and is not empty"
+  }
+}
+```
+
+**Clone Failed:**
+```json
+{
+  "status": "error",
+  "command": "clone",
+  "data": {
+    "model": "nonexistent/model",
+    "clone_status": "failed",
+    "target_dir": "./workspace"
+  },
+  "error": {
+    "type": "clone_failed",
+    "message": "Repository not found for url: https://huggingface.co/api/models/nonexistent/model"
+  }
+}
+```
+
+**Access Denied:**
+```json
+{
+  "status": "error",
+  "command": "clone",
+  "data": {
+    "model": "gated/model",
+    "clone_status": "access_denied",
+    "target_dir": "./workspace"
+  },
+  "error": {
+    "type": "access_denied",
+    "message": "Access denied: gated/private model 'gated/model'. Accept terms and set HF_TOKEN."
+  }
+}
+```
+
+**APFS Filesystem Error:**
+```json
+{
+  "status": "error",
+  "command": "clone",
+  "data": {
+    "model": "org/model",
+    "clone_status": "filesystem_error",
+    "target_dir": "./workspace"
+  },
+  "error": {
+    "type": "FilesystemError",
+    "message": "APFS required for clone operations."
+  }
+}
+```
+
+### `mlxk-json push <dir> <org/model> [--create] [--private] [--branch <b>] [--commit "..."] [--verbose] [--check-only] --json`
+
+**Requires:** `MLXK2_ENABLE_ALPHA_FEATURES=1`
+
+Behavior:
+- Requires `HF_TOKEN` env.
+- Default branch: `main` (subject to change).
+- Fails if repo missing unless `--create` is provided.
+- Sends folder as-is to the specified branch using `huggingface_hub.upload_folder`.
+ - `--verbose` affects only human output; JSON remains unchanged in structure.
+ - `--check-only` performs a local, content-oriented workspace validation and does not contact the Hub (no token required). Results are included under `data.workspace_health`.
+
+Successful Upload (with changes):
+```json
+{
+  "status": "success",
+  "command": "push",
+  "data": {
+    "repo_id": "org/model",
+    "branch": "main",
+    "commit_sha": "abcdef1234567890abcdef1234567890abcdef12",
+    "commit_url": "https://huggingface.co/org/model/commit/abcdef1",
+    "repo_url": "https://huggingface.co/org/model",
+    "uploaded_files_count": 3,
+    "local_files_count": 11,
+    "no_changes": false,
+    "created_repo": false,
+    "change_summary": {"added": 1, "modified": 2, "deleted": 0},
+    "message": "Push successful. Clone operations require APFS filesystem.",
+    "experimental": true,
+    "disclaimer": "Experimental feature (M0: upload only). No validation/filters; review on the Hub."
+  },
+  "error": null
+}
+```
+
+No Changes (no-op commit avoided):
+```json
+{
+  "status": "success",
+  "command": "push",
+  "data": {
+    "repo_id": "org/model",
+    "branch": "main",
+    "commit_sha": null,
+    "commit_url": null,
+    "repo_url": "https://huggingface.co/org/model",
+    "uploaded_files_count": 0,
+    "local_files_count": 11,
+    "no_changes": true,
+    "created_repo": false,
+    "message": "No files changed; skipped empty commit.",
+    "experimental": true,
+    "disclaimer": "Experimental feature (M0: upload only). No validation/filters; review on the Hub.",
+    "hf_logs": ["No files have been modified since last commit. Skipping to prevent empty commit."]
+  },
+  "error": null
+}
+```
+
+Check-only (no network):
+```json
+{
+  "status": "success",
+  "command": "push",
+  "data": {
+    "repo_id": "org/model",
+    "branch": "main",
+    "commit_sha": null,
+    "commit_url": null,
+    "repo_url": "https://huggingface.co/org/model",
+    "local_files_count": 11,
+    "no_changes": null,
+    "created_repo": false,
+    "message": "Check-only: no upload performed.",
+    "workspace_health": {
+      "files_count": 11,
+      "total_bytes": 289612345,
+      "config": {"exists": true, "valid_json": true, "path": "/path/to/config.json"},
+      "weights": {"count": 3, "formats": ["safetensors"], "index": {"has_index": true, "missing": []}, "pattern_complete": true},
+      "anomalies": [],
+      "healthy": true
+    },
+    "experimental": true,
+    "disclaimer": "Experimental feature (M0: upload only). No validation/filters; review on the Hub."
+  },
+  "error": null
+}
+```
+
+Missing Token:
+```json
+{
+  "status": "error",
+  "command": "push",
+  "data": {
+    "repo_id": "org/model",
+    "branch": "main",
+    "repo_url": "https://huggingface.co/org/model",
+    "uploaded_files_count": null,
+    "local_files_count": null,
+    "no_changes": null,
+    "created_repo": false,
+    "experimental": true,
+    "disclaimer": "Experimental feature (M0: upload only). No validation/filters; review on the Hub."
+  },
+  "error": {
+    "type": "auth_error",
+    "message": "HF_TOKEN not set"
+  }
+}
+```
+
+## Error Handling
+
+**All errors follow consistent format with detailed error types:**
+
+### Error Types
+
+**Validation Errors:**
+- `ValidationError` - Invalid input (96 char limit, empty names)
+- `ambiguous_match` - Multiple models match pattern
+- `model_not_found` - No models match pattern
+
+**Network Errors:**
+- `download_failed` - HuggingFace API errors, network timeouts
+- `NetworkError` - Connection issues
+
+**System Errors:**
+- `PermissionError` - File system permission denied
+- `OperationError` - Cache corruption, disk full
+- `InternalError` - Unexpected system errors
+
+**Error Response Schema:**
+```json
+{
+  "status": "error",
+  "command": "pull",
+  "data": { /* partial data if available */ },
+  "error": {
+    "type": "ValidationError",
+    "message": "Repository name exceeds HuggingFace Hub limit: 105/96 characters"
+  }
+}
+```
+
+### Real-World Error Examples
+
+**Cache Corruption (Health Check Bug):**
+```json
+{
+  "status": "success",
+  "command": "health", 
+  "data": {
+    "healthy": [],
+    "unhealthy": [{
+      "name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+      "status": "unhealthy",
+      "reason": "config.json missing"
+    }],
+    "summary": {
+      "total": 1,
+      "healthy_count": 0,
+      "unhealthy_count": 1
+    }
+  },
+  "error": null
+}
+```
+
+**Pull Refuses Corrupted Model (Bug):**
+```json
+{
+  "status": "success",
+  "command": "pull",
+  "data": {
+    "download_status": "already_exists",
+    "message": "Model already exists in cache"
+  },
+  "error": null
+}
+```
+
+## Agent Integration Examples
+
+**Model Management Automation:**
+```bash
+# List all MLX models with hashes
+mlxk-json list --json | jq -r '.data.models[] | select(.framework=="MLX") | "\(.name)@\(.hash)"'
+
+# Get model hashes for pattern matching
+mlxk-json list "Qwen" --json | jq -r '.data.models[] | .hash'
+
+# Count models by framework
+mlxk-json list --json | jq '.data.models | group_by(.framework) | map({framework: .[0].framework, count: length})'
+
+# Health summary
+mlxk-json health --json | jq '.data.summary'
+
+# Find unhealthy models
+mlxk-json health --json | jq -r '.data.unhealthy[].name'
+
+# Filter by pattern
+mlxk-json list "Llama" --json | jq '.data.count'
+
+# Model sizes with hashes
+mlxk-json list --json | jq -r '.data.models[] | "\(.name)@\(.hash): \(.size_bytes)"'
+
+# Get detailed model info
+mlxk-json show "Phi-3-mini" --json | jq '.data.model'
+
+# List all files in a model
+mlxk-json show "Phi-3-mini" --files --json | jq -r '.data.files[] | "\(.name): \(.size)"'
+
+# Extract model config
+mlxk-json show "Phi-3-mini" --config --json | jq '.data.config.quantization'
+```
+
+**Automated Health Monitoring:**
+```bash
+#!/bin/bash
+# Check if any models are unhealthy
+unhealthy_count=$(mlxk-json health --json | jq '.data.summary.unhealthy_count')
+if [ "$unhealthy_count" -gt 0 ]; then
+  echo "Warning: $unhealthy_count unhealthy models found"
+  mlxk-json health --json | jq -r '.data.unhealthy[] | "UNHEALTHY: \(.name) - \(.reason)"'
+fi
+```
+
+**Batch Operations:**
+```bash
+# Pull multiple models
+for model in "Phi-3-mini" "Llama-3.2-1B"; do
+  echo "Pulling $model..."
+  mlxk-json pull "$model" --json | jq '.data.download_status'
+done
+
+# Clean up old models
+mlxk-json list --json | jq -r '.data.models[] | select(.size | test("GB")) | .name' | while read model; do
+  echo "Found large model: $model"
+done
+```
+
+## Design Principles
+
+- **No implementation details:** No cache paths, internal directories, or implementation specifics
+- **No user-specific data:** No usernames in paths or environment-dependent information  
+- **Consistent schema:** All commands follow same `status/command/data/error` structure
+- **Scriptable output:** Rich structured data optimized for `jq` and automation
+- **Backward compatible:** Exit codes remain unchanged for script compatibility
+
+## Exit Codes
+
+All commands use consistent exit codes for scripting:
+
+- `0` - Success
+- `1` - General error (validation, not found, etc.)
+- `2` - Network/download error
+- `3` - Permission/filesystem error
+
+## Version History
+
+- **2.0.0-alpha:** JSON-only implementation with `mlxk-json --json`
+- **2.0.0-alphha.1:** Full implementation with both JSON and human-readable output
+- **2.0.0-alphha.2:** Push function protocol extension (json-0.1.3)
diff --git a/docs/model-naming-specification.md b/docs/model-naming-specification.md
new file mode 100644
index 0000000..e5ab86e
--- /dev/null
+++ b/docs/model-naming-specification.md
@@ -0,0 +1,67 @@
+# MLX-Knife Model Naming Specification
+
+## Fundamental Mapping Rules
+
+### Basic Conversion
+**Universal conversion:** `--` ↔ `/` (all occurrences)
+
+**External → Internal:** `org/sub/model` becomes `models--org--sub--model`  
+**Internal → External:** `models--org--sub--model` becomes `org/sub/model`
+
+### Character Constraints (Clean Names)
+
+**External names (clean):**
+- ✅ Maximum **one `-`** consecutive (single dashes allowed)
+- ✅ `/` as path separators
+- ❌ Never `--` (double dashes forbidden)
+
+**Internal cache (clean):**  
+- ✅ Maximum **two `-`** consecutive (`--` as separators only)
+- ✅ Single `-` within names
+- ❌ Never `---` or more (triple+ dashes forbidden)
+
+### Why These Rules?
+
+```
+✅ Clean conversion:
+External: org-name/model-v1
+Internal: models--org-name--model-v1
+
+❌ Rule violation creates chaos:  
+External: org--invalid/model  (double dash = forbidden!)
+Internal: models--org----model  (quadruple dash = chaos!)
+```
+
+## Examples (Clean Names)
+
+| External | Internal Cache Directory |
+|----------|--------------------------|
+| `microsoft/DialoGPT-small` | `models--microsoft--DialoGPT-small` |
+| `org/sub/model` | `models--org--sub--model` |
+| `single-model` | `models--single-model` |
+
+## MLX-Knife Implementation: Tolerant Handling
+
+### Robustness Philosophy
+**"Be liberal in what you accept"** - MLX-Knife handles rule violations gracefully.
+
+### Error Handling for Corrupted Cache
+**When reading entries that violate rules:** Mechanical 1:1 conversion without validation
+
+```
+Cache: models--microsoft--DialogGPT---small  (3 dashes = rule violation)
+↓ Mechanical conversion: ALL "--" → "/" 
+External: microsoft/DialogGPT/-small  (empty path segment visible)
+```
+
+**Benefits:**
+- ✅ System remains functional (no crashes)
+- ⚠️ Problems become visible (user sees `DialogGPT/-small`)  
+- 🔍 User can identify and fix corrupted entries
+- 🛠️ No complex error handling required
+
+## Compatibility
+
+✅ **HuggingFace Hub:** Compatible with standard `org/model` format  
+✅ **Future-proof:** Supports deeper hierarchies like `org/sub/model`  
+✅ **Robust:** Converts corrupted cache entries without failing
\ No newline at end of file
diff --git a/mlx_knife/__init__.py b/mlx_knife/__init__.py
deleted file mode 100644
index 07f2230..0000000
--- a/mlx_knife/__init__.py
+++ /dev/null
@@ -1,42 +0,0 @@
-"""MLX Knife - HuggingFace-style cache management for MLX models.
-
-A lightweight, ollama-like CLI for managing and running MLX models on Apple Silicon.
-Provides native MLX execution with streaming output and interactive chat capabilities.
-"""
-
-# Suppress urllib3 LibreSSL warning on macOS system Python 3.9 (must be before any imports that use urllib3)
-import warnings
-
-warnings.filterwarnings('ignore', message='urllib3 v2 only supports OpenSSL 1.1.1+')
-
-__version__ = "1.1.1"
-__author__ = "The BROKE team"
-__email__ = "broke@gmx.eu"
-__license__ = "MIT"
-__description__ = "ollama-style CLI for MLX models on Apple Silicon"
-__url__ = "https://github.com/mzau/mlx-knife"
-
-# Version tuple for programmatic access (major, minor, patch)
-VERSION = (1, 1, 1)
-
-# Core functionality imports
-from .cache_utils import (
-    check_all_models_health,
-    check_model_health,
-    list_models,
-    rm_model,
-    show_model,
-)
-from .hf_download import pull_model
-from .mlx_runner import MLXRunner
-
-__all__ = [
-    "__version__",
-    "list_models",
-    "show_model",
-    "check_model_health",
-    "check_all_models_health",
-    "rm_model",
-    "pull_model",
-    "MLXRunner",
-]
diff --git a/mlx_knife/cache_utils.py b/mlx_knife/cache_utils.py
deleted file mode 100644
index 00a2b74..0000000
--- a/mlx_knife/cache_utils.py
+++ /dev/null
@@ -1,1072 +0,0 @@
-# mlx_knife/cache_utils.py
-
-import datetime
-import json
-import os
-import shutil
-import sys
-from pathlib import Path
-
-# Issue #31 hints reader
-from .model_card import read_readme_front_matter, tokenizer_has_chat_template
-
-DEFAULT_CACHE_ROOT = Path.home() / ".cache/huggingface"
-CACHE_ROOT = Path(os.environ.get("HF_HOME", DEFAULT_CACHE_ROOT))
-MODEL_CACHE = CACHE_ROOT / "hub"
-
-# Global variable to track if warning was shown
-_legacy_warning_shown = False
-
-# Check for models in legacy location and warn user
-_legacy_models = list(CACHE_ROOT.glob("models--*"))
-_is_test_env = "test_cache" in str(CACHE_ROOT) or "PYTEST_CURRENT_TEST" in os.environ
-if _legacy_models and not _legacy_warning_shown and not _is_test_env:
-    print(f"\n⚠️  Found {len(_legacy_models)} models in legacy location: {CACHE_ROOT}")
-    print(f"   Please move them to: {MODEL_CACHE}")
-    print(f"   Command: mv {CACHE_ROOT}/models--* {MODEL_CACHE}/")
-    print("   This warning will appear until models are moved.\n")
-    _legacy_warning_shown = True
-
-
-def hf_to_cache_dir(hf_name: str) -> str:
-    if hf_name.startswith("models--"):
-        return hf_name
-    if "/" in hf_name:
-        org, model = hf_name.split("/", 1)
-        return f"models--{org}--{model}"
-    else:
-        return f"models--{hf_name}"
-
-def cache_dir_to_hf(cache_name: str) -> str:
-    if cache_name.startswith("models--"):
-        remaining = cache_name[len("models--"):]
-        if "--" in remaining:
-            parts = remaining.split("--", 1)
-            return f"{parts[0]}/{parts[1]}"
-        else:
-            return remaining
-    return cache_name
-
-def expand_model_name(model_name):
-    if "/" in model_name:
-        return model_name
-    mlx_candidate = f"mlx-community/{model_name}"
-    mlx_cache_dir = MODEL_CACHE / hf_to_cache_dir(mlx_candidate)
-    if mlx_cache_dir.exists():
-        return mlx_candidate
-    common_mlx_patterns = [
-        "Llama-", "Qwen", "Mistral", "Phi-", "Mixtral", "phi-", "deepseek"
-    ]
-    for pattern in common_mlx_patterns:
-        if pattern in model_name:
-            return f"mlx-community/{model_name}"
-    return model_name
-
-def find_matching_models(pattern):
-    """Find models that match a partial pattern. Returns a list of (model_dir, hf_name) tuples."""
-    all_models = [d for d in MODEL_CACHE.iterdir() if d.name.startswith("models--")]
-    matches = []
-    
-    for model_dir in all_models:
-        hf_name = cache_dir_to_hf(model_dir.name)
-        # Check if the pattern appears in the model name (case insensitive)
-        if pattern.lower() in hf_name.lower():
-            matches.append((model_dir, hf_name))
-    
-    return matches
-
-def hash_exists_in_local_cache(model_name, commit_hash):
-    """Check if a specific commit hash exists in the local cache for a model.
-    
-    Supports both full hashes and short hash prefixes (local resolution only).
-    
-    Args:
-        model_name: Full model name (e.g., 'mlx-community/Phi-3-mini-4k-instruct-4bit')
-        commit_hash: Commit hash to check for (short or full)
-    
-    Returns:
-        Full hash if exists in local cache, None otherwise
-    """
-    base_cache_dir = MODEL_CACHE / hf_to_cache_dir(model_name)
-    if not base_cache_dir.exists():
-        return None
-    
-    snapshots_dir = base_cache_dir / "snapshots"
-    if not snapshots_dir.exists():
-        return None
-    
-    # Check for exact match first (full hash)
-    hash_dir = snapshots_dir / commit_hash
-    if hash_dir.exists():
-        return commit_hash
-    
-    # Check for short hash match (local resolution)
-    if len(commit_hash) < 40:
-        for snapshot_dir in snapshots_dir.iterdir():
-            if snapshot_dir.is_dir() and snapshot_dir.name.startswith(commit_hash):
-                return snapshot_dir.name  # Return full hash
-    
-    return None
-
-def resolve_single_model(model_spec):
-    """
-    Resolve a model spec to a single model, supporting fuzzy matching.
-    Returns (model_path, model_name, commit_hash) or (None, None, None) if failed.
-    Prints appropriate error messages for ambiguous matches.
-    """
-    # Parse the model spec (handles @commit_hash syntax)
-    model_name, commit_hash = parse_model_spec(model_spec)
-    
-    # Try exact match first
-    base_cache_dir = MODEL_CACHE / hf_to_cache_dir(model_name)
-    if base_cache_dir.exists():
-        return get_model_path(model_spec)
-    
-    # Extract the base name (without @commit_hash) for fuzzy matching
-    base_spec = model_spec.split('@')[0] if '@' in model_spec else model_spec
-    
-    # Try fuzzy matching
-    matches = find_matching_models(base_spec)
-    
-    if not matches:
-        print(f"No models found matching '{base_spec}'!")
-        return None, None, None
-    elif len(matches) == 1:
-        # Unambiguous match - use the found model with the original commit hash (if any)
-        found_model_dir, found_hf_name = matches[0]
-        if commit_hash:
-            resolved_spec = f"{found_hf_name}@{commit_hash}"
-        else:
-            resolved_spec = found_hf_name
-        return get_model_path(resolved_spec)
-    elif len(matches) > 1 and commit_hash:
-        # Issue #13: Hash-based disambiguation for ambiguous model names
-        for _model_dir, hf_name in matches:
-            resolved_hash = hash_exists_in_local_cache(hf_name, commit_hash)
-            if resolved_hash:
-                resolved_spec = f"{hf_name}@{resolved_hash}"
-                return get_model_path(resolved_spec)
-        
-        # Hash not found in any candidate model
-        print(f"Hash '{commit_hash}' not found in any model matching '{base_spec}'")
-        print("Available models:")
-        for _, hf_name in sorted(matches, key=lambda x: x[1]):
-            print(f"  {hf_name}")
-        return None, None, None
-    else:
-        # Multiple matches without hash - show error with suggestions
-        print(f"Multiple models match '{base_spec}'. Please be more specific:")
-        for _, hf_name in sorted(matches, key=lambda x: x[1]):
-            print(f"  {hf_name}")
-        return None, None, None
-
-def get_model_path(model_spec):
-    model_name, commit_hash = parse_model_spec(model_spec)
-    base_cache_dir = MODEL_CACHE / hf_to_cache_dir(model_name)
-    if not base_cache_dir.exists():
-        return None, model_name, commit_hash
-    if commit_hash:
-        hash_dir = base_cache_dir / "snapshots" / commit_hash
-        if hash_dir.exists():
-            return hash_dir, model_name, commit_hash
-        else:
-            return None, model_name, commit_hash
-    snapshots_dir = base_cache_dir / "snapshots"
-    if snapshots_dir.exists():
-        snapshots = [d for d in snapshots_dir.iterdir() if d.is_dir()]
-        if snapshots:
-            latest = max(snapshots, key=lambda x: x.stat().st_mtime)
-            return latest, model_name, latest.name
-    # Return base_cache_dir for corrupted models so rm_model can handle them
-    return base_cache_dir, model_name, commit_hash
-
-def parse_model_spec(model_spec):
-    if "@" in model_spec:
-        model_name, commit_hash = model_spec.rsplit("@", 1)
-        model_name = expand_model_name(model_name)
-        return model_name, commit_hash
-    model_name = expand_model_name(model_spec)
-    return model_name, None
-
-def get_model_size(model_path):
-    if not model_path.exists():
-        return "?"
-    total_size = 0
-    for file in model_path.rglob("*"):
-        if file.is_file():
-            total_size += file.stat().st_size
-    if total_size >= 1_000_000_000:
-        return f"{total_size / 1_000_000_000:.1f} GB"
-    elif total_size >= 1_000_000:
-        return f"{total_size / 1_000_000:.1f} MB"
-    else:
-        return f"{total_size / 1_000:.1f} KB"
-
-def get_model_modified(model_path):
-    if not model_path.exists():
-        return "?"
-    mtime = model_path.stat().st_mtime
-    now = datetime.datetime.now()
-    modified = datetime.datetime.fromtimestamp(mtime)
-    diff = now - modified
-    if diff.days > 0:
-        return f"{diff.days} days ago"
-    elif diff.seconds > 3600:
-        hours = diff.seconds // 3600
-        return f"{hours} hours ago"
-    else:
-        minutes = diff.seconds // 60
-        return f"{minutes} minutes ago"
-
-def detect_framework(model_path, hf_name):
-    """Detect model framework with lenient hints (Issue #31)."""
-    # 1) org hint
-    if "mlx-community" in hf_name:
-        return "MLX"
-
-    # 2) README front matter: tags contains 'mlx' OR library_name == 'mlx'
-    try:
-        tags, pipeline, lib = read_readme_front_matter(Path(model_path))
-        if (lib and lib.lower() == "mlx") or (tags and any((t or '').lower() == "mlx" for t in tags)):
-            return "MLX"
-    except Exception:
-        pass
-
-    # 3) Fallback by file types
-    snapshots_dir = Path(model_path) / "snapshots"
-    if not snapshots_dir.exists():
-        return "Unknown"
-    has_gguf = any(snapshots_dir.glob("*/*.gguf"))
-    has_safetensors = any(snapshots_dir.glob("*/*.safetensors"))
-    has_pytorch_bin = any(snapshots_dir.glob("*/pytorch_model.bin"))
-    has_config = any(snapshots_dir.glob("*/*.json"))
-    total_size = get_model_size(Path(model_path))
-    try:
-        size_mb = float(total_size.replace(" GB", "000").replace(" MB", "").replace(" KB", "0").replace(" ", ""))
-    except Exception:
-        size_mb = 0
-    if has_gguf:
-        return "GGUF"
-    if size_mb < 10:
-        return "Tokenizer"
-    if (has_safetensors and has_config) or has_pytorch_bin:
-        return "PyTorch"
-    return "Unknown"
-
-
-def detect_model_type(model_path, hf_name):
-    """Detect model type with priority hints (Issue #31)."""
-    # 1) tokenizer chat_template
-    try:
-        if tokenizer_has_chat_template(Path(model_path)):
-            return "chat"
-    except Exception:
-        pass
-
-    # 2) README hints
-    try:
-        tags, pipeline, _ = read_readme_front_matter(Path(model_path))
-        tset = {t.lower() for t in (tags or [])}
-        if pipeline == "text-generation" or any(k in tset for k in {"chat", "instruct"}):
-            return "chat"
-        if pipeline == "sentence-similarity" or any(k in tset for k in {"embedding", "embeddings"}):
-            return "embedding"
-    except Exception:
-        pass
-
-    # 3) Fallback by name
-    name = str(hf_name).lower()
-    if "instruct" in name or "chat" in name:
-        return "chat"
-    if "embed" in name:
-        return "embedding"
-    return "base"
-
-
-def get_quantization_info(model_path):
-    """Extract quantization information from model config."""
-    try:
-        config_path = Path(model_path) / "config.json"
-        if not config_path.exists():
-            return None
-        with open(config_path) as f:
-            cfg = json.load(f)
-        return cfg.get("quantization")
-    except Exception:
-        return None
-
-def get_model_hash(model_path):
-    snapshots_dir = model_path / "snapshots"
-    if not snapshots_dir.exists():
-        return "--------"
-    snapshots = [d for d in snapshots_dir.iterdir() if d.is_dir()]
-    if not snapshots:
-        return "--------"
-    latest = max(snapshots, key=lambda x: x.stat().st_mtime)
-    return latest.name[:8]
-
-def is_model_healthy(model_spec):
-    """Strict health check for 1.x (backport of #27 rules).
-
-    Rules:
-    - config.json must exist and be valid non-empty JSON object.
-    - If a safetensors or PyTorch index exists, all referenced shards must exist, be non-empty,
-      and not be Git LFS pointer files.
-    - Without an index: if multi-shard pattern files exist (model-XXXXX-of-YYYYY.*), require index (unhealthy without index).
-      Single-file weights (*.safetensors/*.bin/*.gguf) are allowed if non-empty and not LFS pointers.
-    - Any '.partial'/'partial' or '.tmp' artifacts anywhere => unhealthy.
-    - Recursive LFS pointer scan for suspiciously small files (<200B).
-    """
-
-    # Resolve model path: accept direct directory paths or model specs
-    candidate = Path(str(model_spec))
-    if candidate.exists() and candidate.is_dir():
-        model_path = candidate
-    else:
-        model_path, _, _ = resolve_single_model(model_spec)
-    if not model_path:
-        return False
-
-    # 1) config.json must be valid, non-empty dict
-    config_path = model_path / "config.json"
-    if not config_path.exists():
-        return False
-    try:
-        with open(config_path) as f:
-            config_data = json.load(f)
-        if not isinstance(config_data, dict) or not config_data:
-            return False
-    except (OSError, json.JSONDecodeError):
-        return False
-
-    # 2) Fail fast on partial/tmp markers anywhere in the snapshot
-    for p in model_path.rglob("*"):
-        name = p.name.lower()
-        if ".partial" in name or name.endswith(".partial") or name.endswith(".tmp") or "partial" in name:
-            return False
-
-    # Helper: detect Git LFS pointer file
-    def _is_lfs_pointer(fp: Path) -> bool:
-        try:
-            if fp.stat().st_size >= 200:
-                return False
-            with open(fp, "rb") as f:
-                head = f.read(200)
-                return b"version https://git-lfs.github.com/spec/v1" in head
-        except Exception:
-            return False
-
-    # Helper: verify referenced shards
-    def _verify_shards(files: list[Path]) -> bool:
-        if not files:
-            return False
-        for f in files:
-            try:
-                if (not f.exists()) or f.stat().st_size == 0:
-                    return False
-                if _is_lfs_pointer(f):
-                    return False
-            except Exception:
-                return False
-        return True
-
-    # 3) Index-aware checks (safetensors or PyTorch)
-    st_index = model_path / "model.safetensors.index.json"
-    pt_index = model_path / "pytorch_model.bin.index.json"
-    if st_index.exists() or pt_index.exists():
-        index_files = [p for p in [st_index, pt_index] if p.exists()]
-        for idx in index_files:
-            try:
-                with open(idx) as f:
-                    idx_data = json.load(f)
-                weight_map = idx_data.get("weight_map")
-                if not isinstance(weight_map, dict) or not weight_map:
-                    return False
-                referenced = sorted(set(weight_map.values()))
-                shard_paths = [model_path / r for r in referenced]
-                if not _verify_shards(shard_paths):
-                    return False
-            except (OSError, json.JSONDecodeError):
-                return False
-        # Also ensure no recursive LFS pointers elsewhere
-        ok, _ = check_lfs_corruption(model_path)
-        return ok
-
-    # 4) No index present — detect multi-shard pattern
-    #    If pattern shards exist, require index (unhealthy without index by policy parity with 2.0)
-    import re
-    shard_re = re.compile(r"model-([0-9]{5})-of-([0-9]{5})\.(safetensors|bin)")
-    pattern_files = []
-    for f in model_path.glob("*"):
-        if f.is_file():
-            m = shard_re.match(f.name)
-            if m:
-                pattern_files.append((f, int(m.group(1)), int(m.group(2))))
-    if pattern_files:
-        # Even if complete by pattern, absence of index => unhealthy
-        return False
-
-    # 5) Single-file weights fallback (includes GGUF)
-    weight_files = list(model_path.rglob("*.safetensors")) + list(model_path.rglob("*.bin")) + list(model_path.rglob("*.gguf"))
-    # Exclude known pattern shards from consideration (handled above)
-    filtered_weights = []
-    for f in weight_files:
-        name = f.name
-        if shard_re.match(name):
-            continue
-        filtered_weights.append(f)
-    if not filtered_weights:
-        return False
-    for wf in filtered_weights:
-        if wf.stat().st_size == 0 or _is_lfs_pointer(wf):
-            return False
-
-    # Final recursive LFS scan
-    ok, _ = check_lfs_corruption(model_path)
-    return ok
-
-def check_lfs_corruption(model_path):
-    """Recursively scan for Git LFS pointer files (suspiciously small files)."""
-    corrupted_files = []
-    for file_path in model_path.rglob("*"):
-        try:
-            if file_path.is_file() and file_path.stat().st_size < 200:
-                with open(file_path, 'rb') as f:
-                    header = f.read(200)
-                    if b'version https://git-lfs.github.com/spec/v1' in header:
-                        corrupted_files.append(str(file_path.relative_to(model_path)))
-        except Exception:
-            # Ignore unreadable files in corruption scan, keep conservative
-            continue
-    if corrupted_files:
-        return False, f"LFS pointers instead of files: {', '.join(corrupted_files)}"
-    return True, "No LFS corruption detected"
-
-def check_model_health(model_spec):
-    model_path, model_name, commit_hash = resolve_single_model(model_spec)
-    if not model_path:
-        # resolve_single_model already printed the appropriate error message
-        return False
-        
-    print(f"Checking model: {model_name}")
-    if commit_hash:
-        print(f"Hash: {commit_hash}")
-    
-    # Use the robust health check
-    if is_model_healthy(model_spec):
-        print("\n[OK] Model is healthy and usable!")
-        return True
-    else:
-        # Detailed diagnosis for WHY it's unhealthy
-        print("\n[ERROR] Model is corrupted. Detailed diagnosis:")
-        
-        # Check config.json
-        config_path = model_path / "config.json"
-        if not config_path.exists():
-            print("   - config.json missing")
-        else:
-            try:
-                with open(config_path) as f:
-                    config_data = json.load(f)
-                if not isinstance(config_data, dict) or len(config_data) == 0:
-                    print("   - config.json is empty or invalid")
-                else:
-                    print("   - config.json found and valid")
-            except (OSError, json.JSONDecodeError):
-                print("   - config.json exists but contains invalid JSON")
-        
-        # Check weight files (including gguf support like is_model_healthy)
-        weight_files = list(model_path.glob("*.safetensors")) + list(model_path.glob("*.bin")) + list(model_path.glob("*.gguf"))
-        if not weight_files:
-            weight_files = list(model_path.glob("**/*.safetensors")) + list(model_path.glob("**/*.bin")) + list(model_path.glob("**/*.gguf"))
-        
-        if weight_files:
-            total_size = sum(f.stat().st_size for f in weight_files)
-            size_mb = total_size / (1024 * 1024)
-            print(f"   - Model weights found ({len(weight_files)} files, {size_mb:.1f}MB)")
-        elif (model_path / "model.safetensors.index.json").exists():
-            # Check multi-file model
-            try:
-                with open(model_path / "model.safetensors.index.json") as f:
-                    index = json.load(f)
-                if 'weight_map' in index:
-                    referenced_files = set(index['weight_map'].values())
-                    existing_files = [f for f in referenced_files if (model_path / f).exists()]
-                    if existing_files:
-                        total_size = sum((model_path / f).stat().st_size for f in existing_files)
-                        size_mb = total_size / (1024 * 1024)
-                        print(f"   - Multi-file weights ({len(existing_files)}/{len(referenced_files)} files, {size_mb:.1f}MB)")
-                        if len(existing_files) < len(referenced_files):
-                            print("   - Incomplete multi-file model")
-                    else:
-                        print("   - Multi-file model index found but no weight files exist")
-                else:
-                    print("   - Multi-file model index is invalid")
-            except Exception as e:
-                print(f"   - Multi-file model index error: {e}")
-        else:
-            print("   - No model weights found (.safetensors, .bin, .gguf)")
-        
-        # Check LFS corruption
-        lfs_ok, lfs_msg = check_lfs_corruption(model_path)
-        if not lfs_ok:
-            print(f"   - {lfs_msg}")
-        else:
-            print(f"   - {lfs_msg}")
-        
-        # Show framework
-        framework = detect_framework(model_path.parent.parent, model_name)
-        print(f"   - Framework: {framework}")
-        
-        # Offer deletion for corrupted models
-        confirm = input("\nModel appears corrupted. Delete? [y/N] ")
-        if confirm.lower() == "y":
-            import errno
-            import shutil
-            try:
-                if commit_hash:
-                    # Delete specific hash/snapshot
-                    shutil.rmtree(model_path)
-                    print(f"Hash {commit_hash} deleted.")
-                else:
-                    # Delete entire model directory (go up from snapshots or use base_cache_dir)
-                    if model_path.name.startswith("models--"):
-                        # model_path is base_cache_dir (corrupted model case)
-                        shutil.rmtree(model_path)
-                    else:
-                        # model_path is snapshot dir
-                        model_base_dir = model_path.parent.parent
-                        shutil.rmtree(model_base_dir)
-                    print(f"Model {model_name} deleted.")
-            except PermissionError as e:
-                print(f"[ERROR] Permission denied: Cannot delete {e.filename}")
-                print("   Try running with appropriate permissions or manually delete the directory.")
-            except OSError as e:
-                if e.errno == errno.ENOTEMPTY:
-                    print(f"[ERROR] Directory not empty: {e.filename}")
-                    print("   Another process may be using this model.")
-                elif e.errno == errno.EACCES:
-                    print(f"[ERROR] Access denied: {e.filename}")
-                else:
-                    print(f"[ERROR] OS Error while deleting: {e}")
-            except Exception as e:
-                print(f"[ERROR] Unexpected error while deleting: {type(e).__name__}: {e}")
-        
-        return False
-
-def check_all_models_health():
-    models = [d for d in MODEL_CACHE.iterdir() if d.name.startswith("models--")]
-    if not models:
-        print("No models found in HuggingFace cache.")
-        return
-    print(f"Checking {len(models)} models for integrity...\n")
-    healthy_models = []
-    problematic_models = []
-    for model_dir in sorted(models, key=lambda x: x.stat().st_mtime, reverse=True):
-        hf_name = cache_dir_to_hf(model_dir.name)
-        model_hash = get_model_hash(model_dir)
-        print(f"{hf_name} ({model_hash})")
-        if is_model_healthy(hf_name):
-            healthy_models.append((hf_name, model_hash))
-            print("   [OK] Healthy\n")
-        else:
-            problematic_models.append((hf_name, model_hash))
-            print("   [ERROR] Problematic\n")
-    print("=" * 50)
-    print("Summary:")
-    print(f"[OK] Healthy models: {len(healthy_models)}")
-    print(f"[ERROR] Problematic models: {len(problematic_models)}")
-    if problematic_models:
-        print("\n[WARNING] Problematic models:")
-        for name, hash_id in problematic_models:
-            print(f"   - {name} ({hash_id})")
-        print("\nRepair tips:")
-        print("   python mlx_knife.cli pull <model-name>  # Re-download")
-        print("   python mlx_knife.cli rm <model-name>    # Delete")
-        print("   python mlx_knife.cli health <model-name> # Show details")
-    return len(problematic_models) == 0
-
-def list_models(show_all=False, framework_filter=None, show_health=False, single_model=None, verbose=False):
-    if single_model:
-        # Try exact match first
-        expanded_model = expand_model_name(single_model)
-        model_dir = MODEL_CACHE / hf_to_cache_dir(expanded_model)
-
-        if model_dir.exists():
-            models = [model_dir]
-        else:
-            # If exact match fails, do partial name matching
-            if not MODEL_CACHE.exists():
-                print(f"No models found matching '{single_model}' - cache directory doesn't exist yet.")
-                print("Use 'mlxk pull <model-name>' to download models first.")
-                return
-            all_models = [d for d in MODEL_CACHE.iterdir() if d.name.startswith("models--")]
-            matching_models = []
-            
-            for model_dir in all_models:
-                hf_name = cache_dir_to_hf(model_dir.name)
-                # Check if the pattern appears in the model name (case insensitive)
-                if single_model.lower() in hf_name.lower():
-                    matching_models.append(model_dir)
-            
-            if not matching_models:
-                print(f"No models found matching '{single_model}'!")
-                return
-            
-            models = matching_models
-    else:
-        if not MODEL_CACHE.exists():
-            print("No models found - cache directory doesn't exist yet.")
-            print("Use 'mlxk pull <model-name>' to download models first.")
-            return
-        models = [d for d in MODEL_CACHE.iterdir() if d.name.startswith("models--")]
-        if not models:
-            print("No models found in HuggingFace cache.")
-            return
-    if show_health:
-        if show_all:
-            print(f"{'NAME':<40} {'ID':<10} {'SIZE':<10} {'MODIFIED':<15} {'FRAMEWORK':<10} {'TYPE':<10} {'HEALTH':<8}")
-        else:
-            print(f"{'NAME':<40} {'ID':<10} {'SIZE':<10} {'MODIFIED':<15} {'HEALTH':<8}")
-    else:
-        if show_all:
-            print(f"{'NAME':<40} {'ID':<10} {'SIZE':<10} {'MODIFIED':<15} {'FRAMEWORK':<10} {'TYPE':<10}")
-        else:
-            print(f"{'NAME':<40} {'ID':<10} {'SIZE':<10} {'MODIFIED':<15}")
-    for m in sorted(models, key=lambda x: x.stat().st_mtime, reverse=True):
-        hf_name = cache_dir_to_hf(m.name)
-        size = get_model_size(m)
-        modified = get_model_modified(m)
-        model_hash = get_model_hash(m)
-        framework = detect_framework(m, hf_name)
-        model_type = detect_model_type(m, hf_name)
-        if framework_filter and framework.lower() != framework_filter:
-            continue
-        # Default (strict) list: show only MLX chat models
-        if not show_all and not framework_filter:
-            if framework != "MLX":
-                continue
-            if model_type != "chat":
-                continue
-        # Handle display name based on verbose flag
-        display_name = hf_name
-        if hf_name.startswith("mlx-community/") and not verbose:
-            # For MLX models, hide prefix unless verbose is set
-            display_name = hf_name[len("mlx-community/"):]
-        health_status = ""
-        if show_health:
-            health_status = "[OK]" if is_model_healthy(hf_name) else "[ERR]"
-            if show_all:
-                print(f"{display_name:<40} {model_hash:<10} {size:<10} {modified:<15} {framework:<10} {model_type:<10} {health_status:<8}")
-            else:
-                print(f"{display_name:<40} {model_hash:<10} {size:<10} {modified:<15} {health_status:<8}")
-        else:
-            if show_all:
-                print(f"{display_name:<40} {model_hash:<10} {size:<10} {modified:<15} {framework:<10} {model_type:<10}")
-            else:
-                print(f"{display_name:<40} {model_hash:<10} {size:<10} {modified:<15}")
-
-def run_model(model_spec, prompt=None, interactive=False, temperature=0.7,
-              max_tokens=500, top_p=0.9, repetition_penalty=1.1, stream=True,
-              use_chat_template=True, hide_reasoning=False, verbose=False):
-    """Run an MLX model with enhanced features.
-    
-    Args:
-        model_spec: Model specification (name[@hash])
-        prompt: Input prompt (if None and not interactive, enters interactive mode)
-        interactive: Force interactive mode
-        temperature: Sampling temperature
-        max_tokens: Maximum tokens to generate
-        top_p: Top-p sampling parameter
-        repetition_penalty: Penalty for repeated tokens
-        stream: Whether to stream output
-    """
-    model_path, model_name, commit_hash = resolve_single_model(model_spec)
-    if not model_path:
-        print(f"Use: mlxk pull {model_spec}")
-        sys.exit(1)
-
-    framework = detect_framework(model_path.parent.parent, model_name)
-    if framework != "MLX":
-        print(f"Model {model_name} is not MLX-compatible (Framework: {framework})!")
-        print("Use MLX-Community models: https://huggingface.co/mlx-community")
-        sys.exit(1)
-
-    # Try to use the enhanced runner (import module to allow monkeypatching in tests)
-    try:
-        from . import mlx_runner as _mr
-
-        _mr.run_model_enhanced(
-            model_path=str(model_path),
-            prompt=prompt,
-            interactive=interactive,
-            max_tokens=max_tokens,
-            temperature=temperature,
-            top_p=top_p,
-            repetition_penalty=repetition_penalty,
-            stream=stream,
-            use_chat_template=use_chat_template,
-            hide_reasoning=hide_reasoning,
-            verbose=verbose,
-        )
-    except ImportError:
-        # Fallback to subprocess if mlx_runner is not available
-        print("[WARNING] Enhanced runner not available, falling back to subprocess mode")
-        print(f"Running model: {model_name}")
-        if commit_hash:
-            print(f"Hash: {commit_hash}")
-        print(f"Cache path: {model_path}")
-
-        if interactive or prompt is None:
-            print("Interactive mode not supported in fallback mode")
-            prompt = prompt or "Hello"
-
-        print(f"Prompt: {prompt}\n")
-        os.system(f'python -m mlx_lm generate --model "{model_path}" --prompt "{prompt}"')
-
-def show_model(model_spec, show_files=False, show_config=False):
-    """Show detailed information about a specific model."""
-    model_path, model_name, commit_hash = resolve_single_model(model_spec)
-
-    if not model_path:
-        return False
-
-    # Basic information
-    print(f"Model: {model_name}")
-    print(f"Path: {model_path}")
-
-    if commit_hash:
-        print(f"Snapshot: {commit_hash}")
-    else:
-        # Show current snapshot hash
-        current_hash = model_path.name
-        print(f"Snapshot: {current_hash}")
-
-    # Size
-    size = get_model_size(model_path)
-    print(f"Size: {size}")
-
-    # Modified time
-    modified = get_model_modified(model_path)
-    print(f"Modified: {modified}")
-
-    # Framework / Type
-    framework = detect_framework(model_path.parent.parent, model_name)
-    model_type = detect_model_type(model_path.parent.parent, model_name)
-    print(f"Framework: {framework}")
-    print(f"Type: {model_type}")
-    
-    # Quantization info (if available)
-    quant_info = get_quantization_info(model_path)
-    if quant_info:
-        if isinstance(quant_info, dict):
-            # Show main quantization config (compact format)
-            main_config = []
-            if "mode" in quant_info:
-                main_config.append(f"mode: {quant_info['mode']}")
-            if "bits" in quant_info:
-                main_config.append(f"{quant_info['bits']}-bit")
-            if "group_size" in quant_info:
-                main_config.append(f"group_size: {quant_info['group_size']}")
-            
-            if main_config:
-                print(f"Quantization: {', '.join(main_config)}")
-                if "mode" in quant_info:
-                    print(f"  Advanced mode '{quant_info['mode']}' (requires MLX ≥0.29.0, MLX-LM ≥0.27.0)")
-        else:
-            print(f"Quantization: {quant_info}")
-
-    # Quantization and Precision info
-    config_path = model_path / "config.json"
-    quantization_info = None
-    precision_info = None
-    gguf_variants = []
-
-    if config_path.exists():
-        try:
-            with open(config_path) as f:
-                config_data = json.load(f)
-
-            # 1. Check for explicit quantization field (MLX style)
-            if "quantization" in config_data and isinstance(config_data["quantization"], dict):
-                quant = config_data["quantization"]
-                if "bits" in quant:
-                    quantization_info = f"{quant['bits']}-bit"
-                    precision_info = f"int{quant['bits']}"
-                if "group_size" in quant:
-                    quantization_info += f" (group_size: {quant['group_size']})"
-
-            # 2. Check torch_dtype (HuggingFace standard)
-            elif "torch_dtype" in config_data:
-                dtype = config_data["torch_dtype"]
-                precision_info = dtype
-                # Check if model name suggests quantization
-                name_lower = model_name.lower()
-                if "4bit" in name_lower or "-4b" in name_lower:
-                    quantization_info = "4-bit (inferred from name)"
-                elif "8bit" in name_lower or "-8b" in name_lower:
-                    quantization_info = "8-bit (inferred from name)"
-                else:
-                    quantization_info = "No quantization detected"
-
-            # 3. Special handling for GGUF files
-            gguf_files = sorted(list(model_path.glob("*.gguf")))
-            if gguf_files and not quantization_info:
-                # Collect all GGUF variants
-                gguf_variants = []
-                for f in gguf_files:
-                    name = f.name
-                    size_mb = f.stat().st_size / (1024 * 1024)
-
-                    # Parse quantization type from filename
-                    name_lower = name.lower()
-                    if "q2_k" in name_lower:
-                        variant_info = f"Q2_K (2-bit, {size_mb:.0f} MB)"
-                    elif "q3_k_s" in name_lower:
-                        variant_info = f"Q3_K_S (3-bit small, {size_mb:.0f} MB)"
-                    elif "q3_k_m" in name_lower:
-                        variant_info = f"Q3_K_M (3-bit medium, {size_mb:.0f} MB)"
-                    elif "q3_k_l" in name_lower:
-                        variant_info = f"Q3_K_L (3-bit large, {size_mb:.0f} MB)"
-                    elif "q3_k" in name_lower:
-                        variant_info = f"Q3_K (3-bit, {size_mb:.0f} MB)"
-                    elif "q4_0" in name_lower:
-                        variant_info = f"Q4_0 (4-bit, {size_mb:.0f} MB)"
-                    elif "q4_k_s" in name_lower:
-                        variant_info = f"Q4_K_S (4-bit small, {size_mb:.0f} MB)"
-                    elif "q4_k_m" in name_lower:
-                        variant_info = f"Q4_K_M (4-bit medium, {size_mb:.0f} MB)"
-                    elif "q4_k" in name_lower:
-                        variant_info = f"Q4_K (4-bit, {size_mb:.0f} MB)"
-                    elif "q5_0" in name_lower:
-                        variant_info = f"Q5_0 (5-bit, {size_mb:.0f} MB)"
-                    elif "q5_k_s" in name_lower:
-                        variant_info = f"Q5_K_S (5-bit small, {size_mb:.0f} MB)"
-                    elif "q5_k_m" in name_lower:
-                        variant_info = f"Q5_K_M (5-bit medium, {size_mb:.0f} MB)"
-                    elif "q5_k" in name_lower:
-                        variant_info = f"Q5_K (5-bit, {size_mb:.0f} MB)"
-                    elif "q6_k" in name_lower:
-                        variant_info = f"Q6_K (6-bit, {size_mb:.0f} MB)"
-                    elif "q8_0" in name_lower:
-                        variant_info = f"Q8_0 (8-bit, {size_mb:.0f} MB)"
-                    else:
-                        variant_info = f"{name} ({size_mb:.0f} MB)"
-
-                    gguf_variants.append(variant_info)
-
-                if len(gguf_variants) > 1:
-                    quantization_info = "Multiple GGUF variants available"
-                    precision_info = "gguf (see variants below)"
-                elif len(gguf_variants) == 1:
-                    quantization_info = gguf_variants[0].split(' (')[0]
-                    precision_info = "gguf"
-                else:
-                    quantization_info = "GGUF format (quantization unknown)"
-                    precision_info = "gguf"
-
-        except (OSError, json.JSONDecodeError, KeyError):
-            pass
-
-    # Display quantization and precision info
-    if quantization_info:
-        print(f"Quantization: {quantization_info}")
-    else:
-        print("Quantization: Unknown (no info in config)")
-
-    if precision_info:
-        print(f"Precision: {precision_info}")
-    else:
-        print("Precision: Unknown")
-
-    # Display GGUF variants if available
-    if gguf_variants and len(gguf_variants) > 1:
-        print("\nAvailable GGUF variants:")
-        for variant in gguf_variants:
-            print(f"   - {variant}")
-
-    # Health status
-    health_ok = is_model_healthy(model_name)
-    if health_ok:
-        print("Health: [OK]")
-    else:
-        print("Health: [ERROR] CORRUPTED")
-        # Check specific issues
-        issues = []
-        if not (model_path / "config.json").exists():
-            issues.append("config.json missing")
-
-        weight_files = list(model_path.glob("*.safetensors")) + list(model_path.glob("*.bin")) + list(model_path.glob("*.gguf"))
-        if not weight_files:
-            weight_files = list(model_path.glob("**/*.safetensors")) + list(model_path.glob("**/*.bin")) + list(model_path.glob("**/*.gguf"))
-        if not weight_files:
-            index_file = model_path / "model.safetensors.index.json"
-            if not index_file.exists():
-                issues.append("No model weights found")
-
-        lfs_ok, lfs_msg = check_lfs_corruption(model_path)
-        if not lfs_ok:
-            issues.append(lfs_msg)
-
-        if issues:
-            print("   Issues:")
-            for issue in issues:
-                print(f"   - {issue}")
-
-    # Show files if requested
-    if show_files:
-        print("\nFiles:")
-        files = []
-        for file in sorted(model_path.rglob("*")):
-            if file.is_file():
-                relative_path = file.relative_to(model_path)
-                file_size = file.stat().st_size
-                if file_size >= 1_000_000_000:
-                    size_str = f"{file_size / 1_000_000_000:.2f} GB"
-                elif file_size >= 1_000_000:
-                    size_str = f"{file_size / 1_000_000:.2f} MB"
-                elif file_size >= 1_000:
-                    size_str = f"{file_size / 1_000:.2f} KB"
-                else:
-                    size_str = f"{file_size} B"
-                files.append((str(relative_path), size_str))
-
-        # Print files in a nice table format
-        if files:
-            max_name_len = max(len(f[0]) for f in files)
-            for file_path, file_size in files:
-                print(f"   {file_path:<{max_name_len}}  {file_size:>10}")
-        else:
-            print("   No files found")
-
-    # Show config if requested
-    if show_config:
-        config_path = model_path / "config.json"
-        if config_path.exists():
-            print("\nConfig:")
-            try:
-                with open(config_path) as f:
-                    config_data = json.load(f)
-                print(json.dumps(config_data, indent=2))
-            except Exception as e:
-                print(f"   Error reading config: {e}")
-        else:
-            print("\nConfig: Not found")
-
-    return True
-
-def rm_model(model_spec, force=False):
-    original_spec = model_spec
-    
-    # First try to resolve using fuzzy matching
-    resolved_path, resolved_name, resolved_hash = resolve_single_model(model_spec)
-    
-    if not resolved_path:
-        # resolve_single_model already printed the error message for most cases
-        # But ensure we always provide feedback to the user
-        print(f"Model '{original_spec}' not found or corrupted.")
-        return
-        
-    # Use the resolved model name for deletion
-    model_name = resolved_name
-    commit_hash = resolved_hash
-    
-    
-    # Confirm on auto-expansion (if the resolved name is different from input)
-    base_spec = original_spec.split("@")[0] if "@" in original_spec else original_spec
-    if base_spec != model_name and "/" not in base_spec:
-        confirm = input(f"Delete '{model_name}' (matched from '{base_spec}')? [Y/n] ")
-        if confirm.lower() == "n":
-            print("Delete aborted.")
-            return
-    
-    base_cache_dir = MODEL_CACHE / hf_to_cache_dir(model_name)
-    # This should exist since resolve_single_model succeeded, but double-check
-    if not base_cache_dir.exists():
-        print(f"[ERROR] Model directory disappeared: {model_name}")
-        return
-    # Specific hash to delete?
-    if commit_hash:
-        hash_dir = base_cache_dir / "snapshots" / commit_hash
-        if not hash_dir.exists():
-            print(f"Hash {commit_hash} for model {model_name} not found!")
-            print("\nAvailable hashes:")
-            snapshots_dir = base_cache_dir / "snapshots"
-            if snapshots_dir.exists():
-                for snapshot in sorted(snapshots_dir.iterdir()):
-                    if snapshot.is_dir():
-                        print(f"  {snapshot.name[:8]}")
-            return
-        if force:
-            confirm_delete = True
-        else:
-            confirm = input(f"Delete hash {commit_hash} of model {model_name}? [y/N] ")
-            confirm_delete = confirm.lower() == "y"
-        
-        if confirm_delete:
-            # Issue #23 Fix: Delete entire model directory, not just the snapshot
-            # This prevents the double-execution problem where refs/ remain intact
-            shutil.rmtree(base_cache_dir)
-            print(f"{model_name}@{commit_hash} deleted")
-            
-            # Clean up associated lock files
-            try:
-                _cleanup_model_locks(model_name, force)
-            except Exception as e:
-                print(f"Warning: Could not clean up cache files: {e}")
-        else:
-            print("Aborted.")
-    else:
-        # Delete entire model
-        if force:
-            confirm_delete = True
-        else:
-            confirm = input(f"Delete entire model {model_name} ({base_cache_dir})? [y/N] ")
-            confirm_delete = confirm.lower() == "y"
-        
-        if confirm_delete:
-            shutil.rmtree(base_cache_dir)
-            print(f"Model {model_name} completely deleted.")
-            
-            # Clean up associated lock files
-            try:
-                _cleanup_model_locks(model_name, force)
-            except Exception as e:
-                print(f"Warning: Could not clean up cache files: {e}")
-        else:
-            print("Aborted.")
-
-
-def _cleanup_model_locks(model_name, force=False):
-    """Clean up HuggingFace lock files for a deleted model.
-    
-    Args:
-        model_name: The model name (e.g. 'microsoft/DialoGPT-small')
-        force: If True, delete without asking. If False, prompt user.
-    """
-    locks_dir = MODEL_CACHE / ".locks" / hf_to_cache_dir(model_name)
-    
-    if not locks_dir.exists():
-        return  # No locks to clean up
-    
-    # Count lock files
-    try:
-        lock_files = list(locks_dir.iterdir())
-        if not lock_files:
-            return  # Empty directory
-        
-        if force:
-            # Delete without asking
-            shutil.rmtree(locks_dir)
-            print(f"Cleaned up cache files ({len(lock_files)} files).")
-        else:
-            # Ask user
-            confirm = input("Clean up cache files? [Y/n] ")
-            if confirm.lower() != "n":
-                shutil.rmtree(locks_dir)
-                print(f"Cache files cleaned up ({len(lock_files)} files).")
-            else:
-                print("Cache files left intact.")
-                
-    except Exception as e:
-        print(f"Warning: Could not clean up cache files: {e}")
diff --git a/mlx_knife/cli.py b/mlx_knife/cli.py
deleted file mode 100644
index f2dea54..0000000
--- a/mlx_knife/cli.py
+++ /dev/null
@@ -1,135 +0,0 @@
-# mlx_knife/cli.py
-
-import argparse
-import sys
-
-from . import __version__
-from .cache_utils import (
-    check_all_models_health,
-    check_model_health,
-    list_models,
-    rm_model,
-    run_model,
-    show_model,
-)
-from .hf_download import pull_model
-from .server import run_server
-
-
-def main():
-    parser = argparse.ArgumentParser(
-        description="MLX Knife CLI (HuggingFace-style cache management for MLX models)"
-    )
-    parser.add_argument('--version', action='version', version=f'MLX Knife {__version__}')
-    subparsers = parser.add_subparsers(dest="cmd")
-
-    # list
-    list_p = subparsers.add_parser("list", help="List available models in cache")
-    list_p.add_argument("model", nargs="?", help="Specific model to list (optional)")
-    list_p.add_argument("--all", action="store_true", help="Show all models (not just MLX)")
-    list_p.add_argument("--framework", choices=["mlx", "pytorch", "tokenizer"], help="Filter by framework")
-    list_p.add_argument("--health", action="store_true", help="Show health status")
-    list_p.add_argument("--verbose", action="store_true", help="Show detailed information (requires model argument)")
-
-    # pull
-    pull_p = subparsers.add_parser("pull", help="Download a model from HuggingFace")
-    pull_p.add_argument("model_spec", help="Model[@hash] (e.g. mlx-community/Qwen2.5-0.5B-Instruct-4bit@a5339a41)")
-
-    # run
-    run_p = subparsers.add_parser("run", help="Run a model with prompt")
-    run_p.add_argument("model_spec", help="Model[@hash] (e.g. mlx-community/Qwen2.5-0.5B-Instruct-4bit@a5339a41)")
-    run_p.add_argument("prompt", nargs="?", default=None, help="Prompt text (if not provided, enters interactive mode)")
-    run_p.add_argument("--interactive", "-i", action="store_true", help="Force interactive dialog mode")
-    run_p.add_argument("--temperature", type=float, default=0.7, help="Sampling temperature (default: 0.7)")
-    run_p.add_argument("--max-tokens", type=int, default=None, help="Maximum tokens to generate (default: model context length)")
-    run_p.add_argument("--top-p", type=float, default=0.9, help="Top-p sampling parameter (default: 0.9)")
-    run_p.add_argument("--repetition-penalty", type=float, default=1.1, help="Penalty for repeated tokens (default: 1.1)")
-    run_p.add_argument("--no-stream", action="store_true", help="Disable streaming output")
-    run_p.add_argument("--no-chat-template", action="store_true", help="Disable chat template formatting (use raw prompt)")
-    run_p.add_argument("--hide-reasoning", action="store_true", help="Hide reasoning section for reasoning models (show only final answer)")
-    run_p.add_argument("--verbose", "-v", action="store_true", help="Show detailed output (model loading, memory usage, token stats)")
-
-    # rm
-    rm_p = subparsers.add_parser("rm", help="Delete a model from cache")
-    rm_p.add_argument("model_spec", help="Model[@hash] (e.g. mlx-community/Qwen2.5-0.5B-Instruct-4bit@a5339a41)")
-    rm_p.add_argument("--force", action="store_true", help="Skip confirmation and clean up cache files automatically")
-
-    # health
-    health_p = subparsers.add_parser("health", help="Check model integrity")
-    health_p.add_argument("model_spec", nargs="?", help="Model[@hash] (optional)")
-    health_p.add_argument("--all", action="store_true", help="Check all models in cache")
-
-    # show
-    show_p = subparsers.add_parser("show", help="Show detailed information about a specific model")
-    show_p.add_argument("model_spec", help="Model[@hash] (e.g. mlx-community/Qwen2.5-0.5B-Instruct-4bit@a5339a41)")
-    show_p.add_argument("--files", action="store_true", help="List all files and sizes under the model path")
-    show_p.add_argument("--config", action="store_true", help="Print pretty-formatted config.json")
-
-    # server
-    server_p = subparsers.add_parser("server", help="Start OpenAI-compatible API server")
-    server_p.add_argument("--host", default="127.0.0.1", help="Server host (default: 127.0.0.1)")
-    server_p.add_argument("--port", type=int, default=8000, help="Server port (default: 8000)")
-    server_p.add_argument("--max-tokens", type=int, default=None, help="Default max tokens for completions (default: model-aware dynamic limits)")
-    server_p.add_argument("--reload", action="store_true", help="Enable auto-reload for development")
-    server_p.add_argument("--log-level", default="info", choices=["debug", "info", "warning", "error"], help="Log level (default: info)")
-
-    args = parser.parse_args()
-
-    if args.cmd == "list":
-        if args.model:
-            if args.verbose and not args.all and not args.framework and not args.health:
-                # Show detailed info for a specific model (same as show command)
-                show_model(args.model)
-            else:
-                # Show just the single model row
-                list_models(show_all=args.all, framework_filter=args.framework, show_health=args.health, single_model=args.model, verbose=args.verbose)
-        else:
-            # Normal list behavior - verbose works with MLX models too
-            list_models(show_all=args.all, framework_filter=args.framework, show_health=args.health, verbose=args.verbose)
-    elif args.cmd == "pull":
-        pull_model(args.model_spec)
-    elif args.cmd == "run":
-        run_model(
-            args.model_spec,
-            prompt=args.prompt,
-            interactive=args.interactive,
-            temperature=args.temperature,
-            max_tokens=args.max_tokens,
-            top_p=args.top_p,
-            repetition_penalty=args.repetition_penalty,
-            stream=not args.no_stream,
-            use_chat_template=not args.no_chat_template,
-            hide_reasoning=args.hide_reasoning,
-            verbose=args.verbose
-        )
-    elif args.cmd == "rm":
-        rm_model(args.model_spec, force=args.force)
-    elif args.cmd == "health":
-        if args.model_spec:
-            check_model_health(args.model_spec)
-        else:
-            # Default to checking all models if no specific model is provided
-            check_all_models_health()
-    elif args.cmd == "show":
-        show_model(args.model_spec, show_files=args.files, show_config=args.config)
-    elif args.cmd == "server":
-        # Validate server arguments
-        if args.max_tokens is not None and args.max_tokens <= 0:
-            print(f"Error: --max-tokens must be positive, got: {args.max_tokens}")
-            sys.exit(1)
-        if args.port <= 0 or args.port > 65535:
-            print(f"Error: --port must be between 1-65535, got: {args.port}")
-            sys.exit(1)
-
-        run_server(
-            host=args.host,
-            port=args.port,
-            max_tokens=args.max_tokens,
-            reload=args.reload,
-            log_level=args.log_level
-        )
-    else:
-        parser.print_help()
-
-if __name__ == "__main__":
-    main()
diff --git a/mlx_knife/hf_download.py b/mlx_knife/hf_download.py
deleted file mode 100644
index c0aa217..0000000
--- a/mlx_knife/hf_download.py
+++ /dev/null
@@ -1,141 +0,0 @@
-import json
-import os
-import subprocess
-import sys
-import tempfile
-
-try:
-    from .cache_utils import (
-        MODEL_CACHE,
-        hf_to_cache_dir,
-        is_model_healthy,
-        parse_model_spec,
-    )
-except ImportError:
-    from pathlib import Path
-    def parse_model_spec(x): return (x, None)
-    def hf_to_cache_dir(x): return x
-    if "HF_HOME" in os.environ:
-        MODEL_CACHE = Path(os.environ["HF_HOME"]) / "hub"
-    else:
-        MODEL_CACHE = Path(os.path.expanduser("~/.cache/huggingface/hub"))
-    def is_model_healthy(x): return False
-
-def describe_http_exception(exc):
-    if hasattr(exc, "response") and exc.response is not None:
-        status = getattr(exc.response, "status_code", None)
-        url = getattr(exc.response, "url", None)
-        if status == 401:
-            return f"[ERROR] Unauthorized (401): Check your HuggingFace token or login.\nURL: {url}"
-        elif status == 403:
-            return f"[ERROR] Forbidden (403): Access denied.\nURL: {url}"
-        elif status == 404:
-            return f"[ERROR] Not Found (404): Resource does not exist.\nURL: {url}"
-        elif status >= 500:
-            return f"[ERROR] Server Error ({status}): Problem on HuggingFace's side.\nURL: {url}\nTry again later."
-        else:
-            return f"[ERROR] HTTP Error {status}: {exc}\nURL: {url}"
-    return f"[ERROR] HTTP Error: {exc}"
-
-def configure_download_environment():
-    os.environ['HF_HUB_DOWNLOAD_THREADS'] = '1'
-    os.environ['HF_HUB_DOWNLOAD_CHUNK_SIZE'] = '524288'  # 512KB chunks for household-friendly downloads
-    os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = 'false'
-
-def pull_model(model_spec):
-    original_spec = model_spec
-    model_name, commit_hash = parse_model_spec(model_spec)
-    
-    # Validate HuggingFace Hub repository name length limit (Issue #6)
-    if len(model_name) > 96:
-        print(f"[ERROR] Repository name exceeds HuggingFace Hub limit: {len(model_name)}/96 characters")
-        print("Repository names longer than 96 characters cannot exist on HuggingFace Hub.")
-        print(f"Invalid name: '{model_name}'")
-        return False
-
-    if "/" not in original_spec.split("@")[0] and "/" in model_name:
-        confirm = input(f"Download '{model_name}'? [Y/n] ")
-        if confirm.lower() == "n":
-            print("Download cancelled.")
-            return
-
-    base_cache_dir = MODEL_CACHE / hf_to_cache_dir(model_name)
-    if commit_hash:
-        hash_dir = base_cache_dir / "snapshots" / commit_hash
-        if hash_dir.exists() and is_model_healthy(f"{model_name}@{commit_hash}"):
-            print("Model already exists")
-            return
-    else:
-        if base_cache_dir.exists() and is_model_healthy(model_name):
-            print("Model already exists")
-            return
-
-    print(f"Downloading {model_name}...")
-
-    # Build kwargs dict for the worker
-    kwargs_dict = {
-        "repo_id": model_name,
-        "local_dir_use_symlinks": False,
-        "max_workers": 1
-    }
-    if commit_hash:
-        kwargs_dict["revision"] = commit_hash
-    if "mlx-community" in model_name:
-        kwargs_dict["allow_patterns"] = [
-            "*.json", "*.txt", "*.safetensors", "*.md", "*.gitattributes", "LICENSE"
-        ]
-    if "mlx-community" not in model_name:
-        confirm = input(f"[WARNING] {model_name} is not an MLX model (may be >1GB). Continue? [y/N] ")
-        if confirm.lower() != "y":
-            print("Download cancelled.")
-            return
-
-    kwargs_str = json.dumps(kwargs_dict, indent=2)
-    with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
-        f.write(kwargs_str)
-        kwargs_file = f.name
-
-    # Call the worker as subprocess with nice priority
-    worker_path = os.path.join(os.path.dirname(__file__), "throttled_download_worker.py")
-    try:
-        result = subprocess.run(
-            ['nice', '-n', '19', sys.executable, worker_path, kwargs_file],
-            check=False
-        )
-        if result.returncode == 0:
-            print("Download completed successfully.")
-        elif result.returncode in (10, 11, 12, 13, 14, 15):
-            # Already handled in worker, do NOT retry fallback
-            print("[WARNING] Fatal error encountered in throttled download, not attempting fallback.")
-            return
-        else:
-            print("[WARNING] Throttled download failed or was interrupted.")
-            print("Attempting fallback download with standard throttling...")
-            try:
-                import requests
-                from huggingface_hub import snapshot_download
-                configure_download_environment()
-                snapshot_download(**kwargs_dict)
-                print("Download completed successfully.")
-            except requests.exceptions.HTTPError as e:
-                print(describe_http_exception(e))
-                return
-            except requests.exceptions.ConnectionError:
-                print("[ERROR] Network connection error. Please check your internet connection and try again.")
-                return
-            except requests.exceptions.Timeout:
-                print("[ERROR] Download timed out. Please try again.")
-                return
-            except KeyboardInterrupt:
-                print("\n[WARNING] Download cancelled by user.")
-                return
-            except Exception as e:
-                print(f"[ERROR] Unexpected error during fallback download: {type(e).__name__}: {e}")
-                return
-    except KeyboardInterrupt:
-        print("\n[WARNING] Download cancelled by user.")
-        return
-    except ImportError:
-        print("huggingface-hub is not installed. Please install it with: pip install huggingface-hub")
-    except Exception as e:
-        print(f"[ERROR] Unexpected error: {type(e).__name__}: {e}")
diff --git a/mlx_knife/mlx_runner.py b/mlx_knife/mlx_runner.py
deleted file mode 100644
index 1408168..0000000
--- a/mlx_knife/mlx_runner.py
+++ /dev/null
@@ -1,1028 +0,0 @@
-# mlx_knife/mlx_runner.py
-"""
-Enhanced MLX model runner with direct API integration.
-Provides ollama-like run experience with streaming and interactive chat.
-"""
-
-import json
-import os
-import time
-from collections.abc import Iterator
-from pathlib import Path
-from typing import Dict, Optional
-
-import mlx.core as mx
-from mlx_lm import load
-from mlx_lm.generate import generate_step
-from mlx_lm.sample_utils import make_repetition_penalty, make_sampler
-
-from .reasoning_utils import ReasoningExtractor, StreamingReasoningParser
-
-
-def get_model_context_length(model_path: str) -> int:
-    """Extract max_position_embeddings from model config.
-    
-    Args:
-        model_path: Path to the MLX model directory
-        
-    Returns:
-        Maximum context length for the model (defaults to 4096 if not found)
-    """
-    config_path = os.path.join(model_path, "config.json")
-    
-    try:
-        with open(config_path) as f:
-            config = json.load(f)
-        
-        # Try various common config keys for context length
-        context_keys = [
-            "max_position_embeddings",
-            "n_positions",
-            "context_length",
-            "max_sequence_length",
-            "seq_len"
-        ]
-        
-        for key in context_keys:
-            if key in config:
-                return config[key]
-                
-        # If no context length found, return reasonable default
-        return 4096
-        
-    except (FileNotFoundError, json.JSONDecodeError, KeyError):
-        # Return default if config can't be read
-        return 4096
-
-
-class MLXRunner:
-    """Direct MLX model runner with streaming and interactive capabilities."""
-
-    def __init__(self, model_path: str, adapter_path: Optional[str] = None, verbose: bool = False):
-        """Initialize the runner with a model.
-        
-        Args:
-            model_path: Path to the MLX model directory
-            adapter_path: Optional path to LoRA adapter
-            verbose: Show detailed output
-        """
-        self.model_path = Path(model_path)
-        self.adapter_path = adapter_path
-        self.model = None
-        self.tokenizer = None
-        self._memory_baseline = None
-        self._stop_tokens = None  # Will be populated from tokenizer
-        self._message_end_tokens = None  # Message-end tokens (e.g., <|end|> for MXFP4)
-        self._chat_stop_tokens = None  # Chat-specific stop tokens
-        self._context_length = None  # Will be populated from model config
-        self._is_reasoning_model = False  # Whether model uses reasoning (MXFP4)
-        self._reasoning_start = None  # Reasoning start marker
-        self._reasoning_end = None  # Reasoning end marker
-        self._final_start = None  # Final answer start marker
-        self.verbose = verbose
-        self._model_loaded = False
-        self._context_entered = False  # Prevent nested context usage
-
-    def __enter__(self):
-        """Context manager entry - loads the model."""
-        if self._context_entered:
-            raise RuntimeError("MLXRunner context manager cannot be entered multiple times")
-        
-        self._context_entered = True
-        try:
-            self.load_model()
-            return self
-        except Exception:
-            # If load_model fails, ensure cleanup happens
-            self._context_entered = False
-            self.cleanup()
-            raise
-
-    def __exit__(self, exc_type, exc_val, exc_tb):
-        """Context manager exit - cleans up the model."""
-        self._context_entered = False
-        self.cleanup()
-        return False  # Don't suppress exceptions
-
-    def load_model(self):
-        """Load the MLX model and tokenizer."""
-        if self._model_loaded:
-            if self.verbose:
-                print("Model already loaded, skipping...")
-            return
-
-        if self.verbose:
-            print(f"Loading model from {self.model_path}...")
-        start_time = time.time()
-
-        # Capture baseline memory before loading
-        try:
-            mx.clear_cache()
-        except Exception:
-            pass  # Continue even if cache clear fails
-        self._memory_baseline = mx.get_active_memory() / 1024**3
-
-        try:
-            # Load model and tokenizer
-            self.model, self.tokenizer = load(
-                str(self.model_path),
-                adapter_path=self.adapter_path
-            )
-
-            load_time = time.time() - start_time
-            current_memory = mx.get_active_memory() / 1024**3
-            model_memory = current_memory - self._memory_baseline
-
-            if self.verbose:
-                print(f"Model loaded in {load_time:.1f}s")
-                print(f"Memory: {model_memory:.1f}GB model, {current_memory:.1f}GB total")
-
-            # Extract stop tokens from tokenizer
-            self._extract_stop_tokens()
-            
-            # Extract context length from model config
-            self._context_length = get_model_context_length(str(self.model_path))
-            
-            if self.verbose:
-                print(f"Model context length: {self._context_length} tokens")
-                
-            self._model_loaded = True
-            
-        except Exception as e:
-            # Ensure partial state is cleaned up on failure
-            self.model = None
-            self.tokenizer = None
-            self._stop_tokens = None
-            self._model_loaded = False
-            # Clear any memory that might have been allocated
-            mx.clear_cache()
-            raise RuntimeError(f"Failed to load model from {self.model_path}: {e}") from e
-
-    def _extract_stop_tokens(self):
-        """Extract stop tokens from the tokenizer dynamically.
-        
-        This method identifies ALL tokens that should stop generation:
-        1. Official EOS token from tokenizer config
-        2. Message-end tokens from training (e.g., <|end|> for MXFP4)
-        3. Common stop tokens across models
-        """
-        self._stop_tokens = set()
-        self._message_end_tokens = set()  # Tokens that end messages but not conversations
-
-        # Primary source: eos_token
-        eos_token = getattr(self.tokenizer, 'eos_token', None)
-        if eos_token:
-            self._stop_tokens.add(eos_token)
-
-        # Also check pad_token if it's different from eos_token
-        pad_token = getattr(self.tokenizer, 'pad_token', None)
-        if pad_token and pad_token != eos_token:
-            self._stop_tokens.add(pad_token)
-
-        # Check additional_special_tokens
-        if hasattr(self.tokenizer, 'additional_special_tokens'):
-            for token in self.tokenizer.additional_special_tokens:
-                if token and isinstance(token, str):
-                    # Only add tokens that look like stop/end tokens
-                    if any(keyword in token.lower() for keyword in ['end', 'stop', 'eot']):
-                        self._stop_tokens.add(token)
-        
-        # MLX-LM 0.27.0+: Extract tokens from added_tokens_decoder (comprehensive source)
-        if hasattr(self.tokenizer, 'added_tokens_decoder'):
-            for _token_id, token_info in self.tokenizer.added_tokens_decoder.items():
-                if isinstance(token_info, dict) and 'content' in token_info:
-                    token_content = token_info['content']
-                    if token_content and isinstance(token_content, str):
-                        token_lower = token_content.lower()
-                        
-                        # NOTE: <|end|> is NOT a stop token for MXFP4 models!
-                        # It's a separator between reasoning and final answer
-                        if token_content == '<|end|>':
-                            self._message_end_tokens.add(token_content)
-                            # Do NOT add as stop token - let model continue to final answer
-                        
-                        # Look for tokens that could be end/stop tokens
-                        # Expanded patterns for MLX-LM 0.27.0 token varieties
-                        # EXCLUDE <|end|> for MXFP4 models as it's a reasoning separator
-                        end_patterns = ['stop', 'eot', 'return', 'finish', 'done', 'im_end']
-                        if any(pattern in token_lower for pattern in end_patterns):
-                            # Decide if it's a message-end or conversation-end token
-                            if 'im_end' in token_lower:
-                                self._message_end_tokens.add(token_content)
-                            self._stop_tokens.add(token_content)
-                        # Special handling for 'end' pattern - more selective
-                        elif 'end' in token_lower and token_content != '<|end|>':
-                            # Only add non-<|end|> tokens with 'end' in them
-                            self._stop_tokens.add(token_content)
-                        
-                        # Special case: control tokens in |..| format
-                        elif token_content.startswith('<|') and token_content.endswith('|>'):
-                            # Be inclusive with control tokens that might stop generation
-                            if any(pattern in token_lower for pattern in ['end', 'return', 'stop', 'finish']):
-                                self._stop_tokens.add(token_content)
-
-        # Model-specific handling based on known patterns
-        # Use reasoning_utils for reasoning model detection and patterns
-        from .reasoning_utils import ReasoningExtractor
-        
-        if hasattr(self.tokenizer, 'name_or_path'):
-            name_or_path = str(getattr(self.tokenizer, 'name_or_path', '')).lower()
-            model_type = ReasoningExtractor.detect_model_type(name_or_path)
-            
-            if model_type:
-                # This is a reasoning model
-                self._is_reasoning_model = True
-                
-                # Get patterns from reasoning_utils
-                if model_type in ReasoningExtractor.PATTERNS:
-                    markers = ReasoningExtractor.PATTERNS[model_type]['markers']
-                    self._reasoning_start = markers.get('reasoning_start')
-                    self._reasoning_end = markers.get('reasoning_end')
-                    self._final_start = markers.get('final_marker')
-                
-                # For reasoning models, remove reasoning_end from stop tokens
-                if self._reasoning_end:
-                    self._stop_tokens.discard(self._reasoning_end)
-                
-                # Add proper stop token for this model type
-                if model_type == 'gpt-oss':
-                    if '<|return|>' not in self._stop_tokens:
-                        self._stop_tokens.add('<|return|>')
-            else:
-                self._is_reasoning_model = False
-        else:
-            self._is_reasoning_model = False
-
-        # Add common stop tokens that might not be in special tokens
-        common_stop_tokens = {'</s>', '<|endoftext|>', '<|im_end|>', '<|eot_id|>'}
-        
-        # Add chat-specific stop tokens to prevent model self-conversations
-        # Based on our _format_conversation() format: "Human:" and "Assistant:"
-        # Also include "You:" as models might use UI-visible format
-        # Include single-letter variations (H:, A:, Y:) that some models use
-        chat_stop_tokens = {
-            '\nHuman:', '\nAssistant:', '\nYou:',
-            '\n\nHuman:', '\n\nAssistant:', '\n\nYou:',
-            '\nH:', '\nA:', '\nY:',  # Single-letter variations
-            '\n\nH:', '\n\nA:', '\n\nY:'
-        }
-
-        # Add common stop tokens only if they decode to themselves (i.e., they're single tokens)
-        for token in common_stop_tokens:
-            try:
-                # Try to encode and decode to verify it's a real single token
-                ids = self.tokenizer.encode(token, add_special_tokens=False)
-                if ids and len(ids) == 1:  # Single token ID means it's a special token
-                    decoded = self.tokenizer.decode(ids)
-                    if decoded == token:
-                        self._stop_tokens.add(token)
-            except:
-                pass
-        
-        # Store chat stop tokens separately - only used in interactive chat mode
-        # This prevents stopping mid-story when user asks for dialogues
-        self._chat_stop_tokens = list(chat_stop_tokens)
-
-        # Remove any None values
-        self._stop_tokens.discard(None)
-        self._message_end_tokens.discard(None)
-
-        # Convert to list for easier use
-        self._stop_tokens = list(self._stop_tokens)
-        self._message_end_tokens = list(self._message_end_tokens)
-
-        if self.verbose:
-            if self._stop_tokens:
-                print(f"Stop tokens: {self._stop_tokens}")
-            if self._message_end_tokens:
-                print(f"Message end tokens: {self._message_end_tokens}")
-
-    def cleanup(self):
-        """Clean up model resources and clear GPU memory.
-        
-        This method is safe to call multiple times and handles partial state cleanup.
-        """
-        if self.verbose and self._model_loaded:
-            memory_before = mx.get_active_memory() / 1024**3
-            print(f"Cleaning up model (memory before: {memory_before:.1f}GB)...")
-
-        # Always clean up, even if model wasn't fully loaded
-        self.model = None
-        self.tokenizer = None
-        self._stop_tokens = None
-        self._message_end_tokens = None
-        self._chat_stop_tokens = None
-        self._context_length = None
-        self._is_reasoning_model = False
-        self._reasoning_start = None
-        self._reasoning_end = None
-        self._final_start = None
-        self._model_loaded = False
-
-        # Force garbage collection and clear MLX cache
-        import gc
-        gc.collect()
-        try:
-            mx.clear_cache()
-        except Exception:
-            pass  # Continue cleanup even if cache clear fails
-
-        if self.verbose:
-            memory_after = mx.get_active_memory() / 1024**3
-            if 'memory_before' in locals():
-                memory_freed = memory_before - memory_after
-                print(f"Cleanup complete (memory after: {memory_after:.1f}GB, freed: {memory_freed:.1f}GB)")
-            else:
-                print(f"Cleanup complete (memory after: {memory_after:.1f}GB)")
-
-    def get_effective_max_tokens(self, requested_tokens: Optional[int], interactive: bool = False) -> int:
-        """Get effective max tokens based on model context and usage mode.
-        
-        Args:
-            requested_tokens: The requested max tokens (None if user didn't specify --max-tokens)
-            interactive: True if this is interactive mode (gets full context length)
-            
-        Returns:
-            Effective max tokens to use
-        """
-        if not self._context_length:
-            # Fallback when context length is unknown
-            fallback = 4096 if interactive else 2048
-            if self.verbose:
-                if requested_tokens is None:
-                    print(f"[WARNING] Model context length unknown, using fallback: {fallback} tokens")
-                else:
-                    print(f"[WARNING] Model context length unknown, using user specified: {requested_tokens} tokens")
-            return requested_tokens if requested_tokens is not None else fallback
-            
-        if interactive:
-            if requested_tokens is None:
-                # User didn't specify --max-tokens: use full model context
-                return self._context_length
-            else:
-                # User specified --max-tokens explicitly: respect their choice but cap at context
-                return min(requested_tokens, self._context_length)
-        else:
-            # Server/batch mode uses half context length for DoS protection
-            server_limit = self._context_length // 2
-            return min(requested_tokens or server_limit, server_limit)
-
-    def generate_streaming(
-        self,
-        prompt: str,
-        max_tokens: int = 500,
-        temperature: float = 0.7,
-        top_p: float = 0.9,
-        repetition_penalty: float = 1.1,
-        repetition_context_size: int = 20,
-        use_chat_template: bool = True,
-        use_chat_stop_tokens: bool = False,
-        interactive: bool = False,
-        hide_reasoning: bool = False,
-    ) -> Iterator[str]:
-        """Generate text with streaming output.
-        
-        Args:
-            prompt: Input prompt
-            max_tokens: Maximum tokens to generate
-            temperature: Sampling temperature
-            top_p: Top-p sampling parameter
-            repetition_penalty: Penalty for repeated tokens
-            repetition_context_size: Context size for repetition penalty
-            use_chat_template: Apply tokenizer's chat template if available
-            use_chat_stop_tokens: Include chat turn markers as stop tokens (for interactive mode)
-            interactive: True if this is interactive mode (affects token limits)
-            
-        Yields:
-            Generated tokens as they are produced
-        """
-        if not self.model or not self.tokenizer:
-            raise RuntimeError("Model not loaded. Call load_model() first.")
-            
-        # Initialize reasoning parser if this is a reasoning model
-        reasoning_parser = None
-        if self._is_reasoning_model:
-            model_type = ReasoningExtractor.detect_model_type(
-                getattr(self.tokenizer, 'name_or_path', '') or ''
-            )
-            reasoning_parser = StreamingReasoningParser(model_type, hide_reasoning=hide_reasoning)
-
-        # Apply context-aware token limits
-        effective_max_tokens = self.get_effective_max_tokens(max_tokens, interactive)
-
-        # Apply chat template if available and requested
-        if use_chat_template and hasattr(self.tokenizer, 'chat_template') and self.tokenizer.chat_template:
-            messages = [{"role": "user", "content": prompt}]
-            formatted_prompt = self.tokenizer.apply_chat_template(
-                messages,
-                tokenize=False,
-                add_generation_prompt=True
-            )
-        else:
-            formatted_prompt = prompt
-
-        # Tokenize the prompt
-        prompt_tokens = self.tokenizer.encode(formatted_prompt)
-        prompt_array = mx.array(prompt_tokens)
-
-        # Track generation metrics
-        start_time = time.time()
-        tokens_generated = 0
-
-        # Create sampler with our parameters
-        sampler = make_sampler(temp=temperature, top_p=top_p)
-
-        # Create repetition penalty processor if needed
-        logits_processors = []
-        if repetition_penalty > 1.0:
-            logits_processors.append(
-                make_repetition_penalty(repetition_penalty, repetition_context_size)
-            )
-
-        # Generate tokens one by one for streaming
-        generator = generate_step(
-            prompt=prompt_array,
-            model=self.model,
-            max_tokens=effective_max_tokens,
-            sampler=sampler,
-            logits_processors=logits_processors if logits_processors else None,
-        )
-
-        # Collect tokens and yield text
-        generated_tokens = []
-        previous_decoded = ""
-        accumulated_response = ""  # Track full response for stop token detection
-
-        # Keep a sliding window of recent tokens for context
-        context_window = 10  # Decode last N tokens for proper spacing
-
-        for token, _ in generator:
-            # Token might be an array or an int
-            token_id = token.item() if hasattr(token, 'item') else token
-            generated_tokens.append(token_id)
-
-            # Use a sliding window approach for efficiency
-            start_idx = max(0, len(generated_tokens) - context_window)
-            window_tokens = generated_tokens[start_idx:]
-
-            # Decode the window
-            window_text = self.tokenizer.decode(window_tokens)
-
-            # Figure out what's new
-            if start_idx == 0:
-                # We're still within the context window
-                if window_text.startswith(previous_decoded):
-                    new_text = window_text[len(previous_decoded):]
-                else:
-                    new_text = self.tokenizer.decode([token_id])
-                previous_decoded = window_text
-            else:
-                # We're beyond the context window, just decode the last token with context
-                # This is approximate but should preserve spaces
-                new_text = self.tokenizer.decode(window_tokens)
-                if len(window_tokens) > 1:
-                    prefix = self.tokenizer.decode(window_tokens[:-1])
-                    if new_text.startswith(prefix):
-                        new_text = new_text[len(prefix):]
-                    else:
-                        new_text = self.tokenizer.decode([token_id])
-
-            if new_text:
-                # Update accumulated response for stop token checking
-                accumulated_response += new_text
-                
-                # Filter out stop tokens with priority: native first, then chat fallback
-                # Check native stop tokens FIRST in accumulated response (highest priority)
-                native_stop_tokens = self._stop_tokens if self._stop_tokens else []
-                for stop_token in native_stop_tokens:
-                    if stop_token in accumulated_response:
-                        # Find the stop token position and yield everything before it
-                        stop_pos = accumulated_response.find(stop_token)
-                        # Calculate what text came before the stop token
-                        text_before_stop = accumulated_response[:stop_pos]
-                        # Calculate how much of that is new (not previously yielded)
-                        previously_yielded_length = len(accumulated_response) - len(new_text)
-                        if len(text_before_stop) > previously_yielded_length:
-                            # Yield only the new part before stop token
-                            new_part_before_stop = text_before_stop[previously_yielded_length:]
-                            if new_part_before_stop:
-                                if reasoning_parser:
-                                    # Process through reasoning parser for formatting
-                                    for formatted_token in reasoning_parser.process_token(new_part_before_stop):
-                                        yield formatted_token
-                                else:
-                                    yield new_part_before_stop
-                        return  # Stop generation without yielding stop token
-                
-                # Only check chat stop tokens if no native stop token found (fallback)
-                if use_chat_stop_tokens and self._chat_stop_tokens:
-                    for stop_token in self._chat_stop_tokens:
-                        if stop_token in accumulated_response:
-                            # Find the stop token position and yield everything before it
-                            stop_pos = accumulated_response.find(stop_token)
-                            # Calculate what text came before the stop token
-                            text_before_stop = accumulated_response[:stop_pos]
-                            # Calculate how much of that is new (not previously yielded)
-                            previously_yielded_length = len(accumulated_response) - len(new_text)
-                            if len(text_before_stop) > previously_yielded_length:
-                                # Yield only the new part before stop token
-                                new_part_before_stop = text_before_stop[previously_yielded_length:]
-                                if new_part_before_stop:
-                                    if reasoning_parser:
-                                        # Process through reasoning parser for formatting
-                                        for formatted_token in reasoning_parser.process_token(new_part_before_stop):
-                                            yield formatted_token
-                                    else:
-                                        yield new_part_before_stop
-                            return  # Stop generation without yielding stop token
-
-                # No stop token found, process the new text
-                if reasoning_parser:
-                    # Process through reasoning parser for formatting
-                    for formatted_token in reasoning_parser.process_token(new_text):
-                        yield formatted_token
-                else:
-                    # Normal streaming for non-reasoning models
-                    yield new_text
-                tokens_generated += 1
-
-            # Check for EOS token - don't yield it
-            if token_id == self.tokenizer.eos_token_id:
-                break
-
-        # Finalize reasoning parser if used
-        if reasoning_parser:
-            yield from reasoning_parser.finalize()
-        
-        # Print generation statistics if verbose
-        if self.verbose:
-            generation_time = time.time() - start_time
-            tokens_per_second = tokens_generated / generation_time if generation_time > 0 else 0
-            print(f"\n\nGenerated {tokens_generated} tokens in {generation_time:.1f}s ({tokens_per_second:.1f} tokens/s)")
-
-    def generate_batch(
-        self,
-        prompt: str,
-        max_tokens: int = 500,
-        temperature: float = 0.7,
-        top_p: float = 0.9,
-        repetition_penalty: float = 1.1,
-        repetition_context_size: int = 20,
-        use_chat_template: bool = True,
-        interactive: bool = False,
-    ) -> str:
-        """Generate text in batch mode (non-streaming).
-        
-        Args:
-            prompt: Input prompt
-            max_tokens: Maximum tokens to generate
-            temperature: Sampling temperature
-            top_p: Top-p sampling parameter
-            repetition_penalty: Penalty for repeated tokens
-            repetition_context_size: Context size for repetition penalty
-            use_chat_template: Apply tokenizer's chat template if available
-            interactive: True if this is interactive mode (affects token limits)
-            
-        Returns:
-            Generated text
-        """
-        if not self.model or not self.tokenizer:
-            raise RuntimeError("Model not loaded. Call load_model() first.")
-
-        # Apply context-aware token limits
-        effective_max_tokens = self.get_effective_max_tokens(max_tokens, interactive)
-
-        # Apply chat template if available and requested
-        if use_chat_template and hasattr(self.tokenizer, 'chat_template') and self.tokenizer.chat_template:
-            messages = [{"role": "user", "content": prompt}]
-            formatted_prompt = self.tokenizer.apply_chat_template(
-                messages,
-                tokenize=False,
-                add_generation_prompt=True
-            )
-        else:
-            formatted_prompt = prompt
-
-        start_time = time.time()
-
-        # Tokenize the prompt
-        prompt_tokens = self.tokenizer.encode(formatted_prompt)
-        prompt_array = mx.array(prompt_tokens)
-
-        # Create sampler with our parameters
-        sampler = make_sampler(temp=temperature, top_p=top_p)
-
-        # Create repetition penalty processor if needed
-        logits_processors = []
-        if repetition_penalty > 1.0:
-            logits_processors.append(
-                make_repetition_penalty(repetition_penalty, repetition_context_size)
-            )
-
-        # Generate all tokens at once
-        generated_tokens = []
-        all_tokens = list(prompt_tokens)  # Keep prompt for proper decoding
-
-        generator = generate_step(
-            prompt=prompt_array,
-            model=self.model,
-            max_tokens=effective_max_tokens,
-            sampler=sampler,
-            logits_processors=logits_processors if logits_processors else None,
-        )
-
-        for token, _ in generator:
-            # Token might be an array or an int
-            token_id = token.item() if hasattr(token, 'item') else token
-            generated_tokens.append(token_id)
-            all_tokens.append(token_id)
-
-            # Check for EOS token - don't yield it
-            if token_id == self.tokenizer.eos_token_id:
-                break
-
-        # Decode all tokens together for proper spacing
-        full_response = self.tokenizer.decode(all_tokens)
-
-        # Remove the prompt part
-        if full_response.startswith(formatted_prompt):
-            response = full_response[len(formatted_prompt):]
-        else:
-            # Fallback: just decode generated tokens
-            response = self.tokenizer.decode(generated_tokens)
-
-        # Apply end-token filtering (same logic as streaming mode for Issue #20)
-        response = self._filter_end_tokens_from_response(response, use_chat_stop_tokens=False)
-        
-        # Format reasoning models output
-        response = self._format_reasoning_response(response)
-
-        generation_time = time.time() - start_time
-
-        # Count tokens for statistics
-        if self.verbose:
-            tokens_generated = len(generated_tokens)
-            tokens_per_second = tokens_generated / generation_time if generation_time > 0 else 0
-            print(f"\nGenerated {tokens_generated} tokens in {generation_time:.1f}s ({tokens_per_second:.1f} tokens/s)")
-
-        return response
-
-    def interactive_chat(
-        self,
-        system_prompt: Optional[str] = None,
-        max_tokens: int = 500,
-        temperature: float = 0.7,
-        top_p: float = 0.9,
-        repetition_penalty: float = 1.1,
-        use_chat_template: bool = True,
-    ):
-        """Run an interactive chat session.
-        
-        Args:
-            system_prompt: Optional system prompt to prepend
-            max_tokens: Maximum tokens per response
-            temperature: Sampling temperature
-            top_p: Top-p sampling parameter
-            repetition_penalty: Penalty for repeated tokens
-            use_chat_template: Use tokenizer's chat template if available
-        """
-        print("Starting interactive chat. Type 'exit' or 'quit' to end.\n")
-
-        conversation_history = []
-        if system_prompt:
-            conversation_history.append({"role": "system", "content": system_prompt})
-
-        while True:
-            try:
-                # Get user input
-                user_input = input("You: ").strip()
-
-                if user_input.lower() in ['exit', 'quit', 'q']:
-                    print("\nGoodbye!")
-                    break
-
-                if not user_input:
-                    continue
-
-                # Add user message to history
-                conversation_history.append({"role": "user", "content": user_input})
-
-                # Format conversation for the model using chat template if available
-                prompt = self._format_conversation(conversation_history, use_chat_template=use_chat_template)
-
-                # Generate response with streaming
-                print("\nAssistant: ", end="", flush=True)
-
-                response_tokens = []
-                for token in self.generate_streaming(
-                    prompt=prompt,
-                    max_tokens=max_tokens,
-                    temperature=temperature,
-                    top_p=top_p,
-                    repetition_penalty=repetition_penalty,
-                    use_chat_template=False,  # Already applied in _format_conversation
-                    use_chat_stop_tokens=True,  # Enable chat stop tokens in interactive mode
-                    interactive=True,  # Enable full context length for interactive mode
-                ):
-                    # Stream all tokens directly (already formatted by generate_streaming)
-                    print(token, end="", flush=True)
-                    response_tokens.append(token)
-
-                # Add assistant response to history
-                assistant_response = "".join(response_tokens).strip()
-                conversation_history.append({"role": "assistant", "content": assistant_response})
-
-                print()  # New line after response
-
-            except KeyboardInterrupt:
-                print("\n\nChat interrupted. Goodbye!")
-                break
-            except Exception as e:
-                print(f"\n[ERROR] {e}")
-                continue
-
-    def _format_conversation(self, messages: list, use_chat_template: bool = True) -> str:
-        """Format conversation history into a prompt.
-        
-        Uses the tokenizer's chat template if available, otherwise falls back
-        to the legacy Human:/Assistant: format for compatibility.
-        
-        Args:
-            messages: List of message dictionaries with 'role' and 'content'
-            use_chat_template: Whether to use chat template if available
-            
-        Returns:
-            Formatted conversation string
-        """
-        # Try to use native chat template if available
-        if use_chat_template and hasattr(self.tokenizer, 'chat_template') and self.tokenizer.chat_template:
-            try:
-                # Apply the tokenizer's chat template
-                formatted_prompt = self.tokenizer.apply_chat_template(
-                    messages,
-                    tokenize=False,
-                    add_generation_prompt=True
-                )
-                return formatted_prompt
-            except Exception as e:
-                # If chat template fails, fall back to legacy format
-                if self.verbose:
-                    print(f"[WARNING] Chat template failed, using legacy format: {e}")
-        
-        # Legacy format fallback for compatibility
-        return self._legacy_format_conversation(messages)
-    
-    def _legacy_format_conversation(self, messages: list) -> str:
-        """Legacy conversation formatting for backward compatibility.
-        
-        This format was used in earlier versions and remains as a fallback
-        for models without chat templates.
-        """
-        formatted = []
-
-        for message in messages:
-            role = message["role"]
-            content = message["content"]
-
-            if role == "system":
-                formatted.append(f"System: {content}")
-            elif role == "user":
-                formatted.append(f"Human: {content}")
-            elif role == "assistant":
-                formatted.append(f"Assistant: {content}")
-
-        # Add prompt for next assistant response
-        formatted.append("Assistant:")
-
-        return "\n\n".join(formatted)
-
-    def get_memory_usage(self) -> Dict[str, float]:
-        """Get current memory usage statistics.
-        
-        Returns:
-            Dictionary with memory statistics in GB
-        """
-        try:
-            current_memory = mx.get_active_memory() / 1024**3
-            peak_memory = mx.get_peak_memory() / 1024**3
-        except Exception:
-            # Return zeros if memory stats unavailable
-            current_memory = 0.0
-            peak_memory = 0.0
-
-        return {
-            "current_gb": current_memory,
-            "peak_gb": peak_memory,
-            "model_gb": current_memory - self._memory_baseline if self._memory_baseline else 0,
-        }
-
-    def _format_reasoning_response(self, response: str) -> str:
-        """Format response from reasoning models for better readability.
-        
-        For MXFP4 models that generate reasoning followed by final answer,
-        format it nicely for display.
-        """
-        if not self._is_reasoning_model:
-            return response
-            
-        # Check if response contains reasoning markers
-        if self._reasoning_start in response and self._final_start in response:
-            # Extract reasoning and final parts
-            try:
-                # Split on the reasoning start
-                before_reasoning, after_start = response.split(self._reasoning_start, 1)
-                
-                # Find the reasoning content (until <|end|>)
-                if self._reasoning_end in after_start:
-                    reasoning_content, after_reasoning = after_start.split(self._reasoning_end, 1)
-                    
-                    # Find the final answer
-                    if self._final_start in after_reasoning:
-                        # Extract everything after final marker
-                        final_parts = after_reasoning.split(self._final_start, 1)
-                        if len(final_parts) > 1:
-                            # Remove the <|channel|>final<|message|> marker
-                            final_answer = final_parts[1].replace('<|channel|>final<|message|>', '', 1)
-                            
-                            # Format with clear markers for parsing but minimal visual impact
-                            formatted = []
-                            formatted.append("\n**[Reasoning]**\n")
-                            formatted.append(reasoning_content.strip())
-                            formatted.append("\n\n---\n\n**[Answer]**\n")
-                            formatted.append(final_answer.strip())
-                            
-                            return '\n'.join(formatted)
-            except Exception:
-                # If parsing fails, return original
-                pass
-        
-        # Fallback: just clean up the control tokens
-        cleaned = response
-        for marker in ['<|channel|>analysis<|message|>', '<|end|>', '<|start|>assistant',
-                      '<|channel|>final<|message|>', '<|return|>']:
-            cleaned = cleaned.replace(marker, '')
-        
-        return cleaned.strip()
-    
-    def _filter_end_tokens_from_response(self, response: str, use_chat_stop_tokens: bool = False) -> str:
-        """Filter end tokens from a complete response (batch mode).
-        
-        This method applies the same filtering logic as the streaming mode
-        to ensure consistent behavior between streaming and non-streaming.
-        
-        Args:
-            response: The complete generated response
-            use_chat_stop_tokens: Whether to apply chat stop tokens
-            
-        Returns:
-            Response with end tokens filtered out
-        """
-        # Apply native stop token filtering FIRST (highest priority)
-        native_stop_tokens = self._stop_tokens if self._stop_tokens else []
-        for stop_token in native_stop_tokens:
-            if stop_token in response:
-                # Find the stop token position and return everything before it
-                stop_pos = response.find(stop_token)
-                filtered_response = response[:stop_pos].rstrip()
-                if self.verbose:
-                    print(f"[DEBUG] Filtered stop token '{stop_token}' at position {stop_pos}")
-                return filtered_response
-        
-        # Only check chat stop tokens if no native stop token found (fallback)
-        if use_chat_stop_tokens and self._chat_stop_tokens:
-            for stop_token in self._chat_stop_tokens:
-                if stop_token in response:
-                    # Find the stop token position and return everything before it
-                    stop_pos = response.find(stop_token)
-                    return response[:stop_pos]
-        
-        # No stop tokens found, return original response
-        return response
-
-
-def get_gpu_status() -> Dict[str, float]:
-    """Independent GPU status check - usable from anywhere.
-    
-    Returns:
-        Dictionary with GPU memory statistics in GB
-    """
-    return {
-        "active_memory_gb": mx.get_active_memory() / 1024**3,
-        "peak_memory_gb": mx.get_peak_memory() / 1024**3,
-    }
-
-
-def check_memory_available(required_gb: float) -> bool:
-    """Pre-flight check before model loading.
-    
-    Args:
-        required_gb: Required memory in GB
-        
-    Returns:
-        True if memory is likely available (conservative estimate)
-    """
-    current_memory = mx.get_active_memory() / 1024**3
-
-    # Conservative estimate: assume system has at least 8GB unified memory
-    # and we should leave some headroom (2GB) for system processes
-    estimated_total = 8.0  # This could be improved by detecting actual system memory
-    available = estimated_total - current_memory - 2.0  # 2GB headroom
-
-    return available >= required_gb
-
-
-def run_model_enhanced(
-    model_path: str,
-    prompt: Optional[str] = None,
-    interactive: bool = False,
-    max_tokens: int = 500,
-    temperature: float = 0.7,
-    top_p: float = 0.9,
-    repetition_penalty: float = 1.1,
-    stream: bool = True,
-    use_chat_template: bool = True,
-    hide_reasoning: bool = False,
-    verbose: bool = False,
-) -> Optional[str]:
-    """Enhanced run function with direct MLX integration.
-    
-    Uses context manager pattern for automatic resource cleanup.
-    
-    Args:
-        model_path: Path to the MLX model
-        prompt: Input prompt (if None, enters interactive mode)
-        interactive: Force interactive mode
-        max_tokens: Maximum tokens to generate
-        temperature: Sampling temperature
-        top_p: Top-p sampling parameter
-        repetition_penalty: Penalty for repeated tokens
-        stream: Whether to stream output
-        
-    Returns:
-        Generated text (in non-interactive mode)
-    """
-    try:
-        with MLXRunner(model_path, verbose=verbose) as runner:
-            # Interactive mode
-            if interactive or prompt is None:
-                runner.interactive_chat(
-                    max_tokens=max_tokens,
-                    temperature=temperature,
-                    top_p=top_p,
-                    repetition_penalty=repetition_penalty,
-                    use_chat_template=use_chat_template,
-                )
-                return None
-
-            # Single prompt mode
-            if verbose:
-                print(f"\nPrompt: {prompt}\n")
-                print("Response: ", end="", flush=True)
-
-            if stream:
-                # Streaming generation
-                response_tokens = []
-                try:
-                    for token in runner.generate_streaming(
-                        prompt=prompt,
-                        max_tokens=max_tokens,
-                        temperature=temperature,
-                        top_p=top_p,
-                        repetition_penalty=repetition_penalty,
-                        use_chat_template=use_chat_template,
-                        hide_reasoning=hide_reasoning,
-                    ):
-                        # Stream all tokens directly (already formatted by generate_streaming)
-                        print(token, end="", flush=True)
-                        response_tokens.append(token)
-                except KeyboardInterrupt:
-                    print("\n[INFO] Generation interrupted by user.")
-                response = "".join(response_tokens)
-            else:
-                # Batch generation
-                try:
-                    response = runner.generate_batch(
-                        prompt=prompt,
-                        max_tokens=max_tokens,
-                        temperature=temperature,
-                        top_p=top_p,
-                        repetition_penalty=repetition_penalty,
-                        use_chat_template=use_chat_template,
-                    )
-                except KeyboardInterrupt:
-                    print("\n[INFO] Generation interrupted by user.")
-                    response = ""
-                print(response)
-
-            # Show memory usage if verbose
-            if verbose:
-                memory_stats = runner.get_memory_usage()
-                print(f"\n\nMemory: {memory_stats['model_gb']:.1f}GB model, {memory_stats['current_gb']:.1f}GB total")
-
-            return response
-
-        # Note: cleanup happens automatically due to context manager
-
-    except Exception as e:
-        print(f"\n[ERROR] {e}")
-        return None
diff --git a/mlx_knife/model_card.py b/mlx_knife/model_card.py
deleted file mode 100644
index e43a9eb..0000000
--- a/mlx_knife/model_card.py
+++ /dev/null
@@ -1,164 +0,0 @@
-from __future__ import annotations
-
-# ruff: noqa: UP045
-
-"""
-Lightweight helpers to read model metadata hints from cached Hugging Face models.
-
-No external dependencies; YAML front matter is hand-parsed leniently.
-
-Priority rules (Issue #31):
-- Tokenizer config: if tokenizer_config.json has chat_template -> Type = chat
-- README.md front matter (YAML):
-  - tags contains "mlx" OR library_name == "mlx" -> Framework = MLX
-  - pipeline_tag == text-generation OR tags contain chat/instruct -> Type = chat
-  - pipeline_tag == sentence-similarity OR tags contain embedding -> Type = embedding
-- Fallback for framework/type remains in cache_utils
-"""
-
-import json
-from pathlib import Path
-from typing import Any, Dict, List, Optional, Tuple
-
-
-def _latest_snapshot_dir(model_base_dir: Path) -> Optional[Path]:
-    """Return latest snapshot directory for a cached HF model base dir."""
-    try:
-        snaps = (model_base_dir / "snapshots")
-        if not snaps.exists():
-            return None
-        candidates = [d for d in snaps.iterdir() if d.is_dir()]
-        if not candidates:
-            return None
-        return max(candidates, key=lambda p: p.stat().st_mtime)
-    except Exception:
-        return None
-
-
-def _lenient_yaml_front_matter(text: str) -> Dict[str, Any]:
-    """Very small YAML front matter parser for the fields we need.
-
-    Supports forms:
-    ---
-    tags: [mlx, chat]
-    pipeline_tag: text-generation
-    library_name: mlx
-    ---
-
-    And list style:
-    tags:
-      - mlx
-      - chat
-    """
-    start = text.find("\n---\n")
-    # Accept files starting directly with '---' too
-    if text.startswith('---'):
-        start = 0
-    elif start >= 0:
-        start = start + 1  # move to line start
-    else:
-        # Try at very beginning without newline
-        start = 0 if text[:3] == '---' else -1
-    if start != 0:
-        return {}
-
-    # Find closing '---' after start
-    end = text.find('\n---', 3)
-    if end == -1:
-        return {}
-    header = text[3:end] if text.startswith('---') else text[start + 3:end]
-
-    # Normalize lines
-    lines = [ln.strip() for ln in header.splitlines() if ln.strip()]
-
-    data: Dict[str, Any] = {}
-    current_key: Optional[str] = None
-    list_acc: List[str] = []
-
-    def flush_list():
-        nonlocal list_acc, current_key
-        if current_key is not None and list_acc:
-            data[current_key] = list_acc[:]
-        list_acc = []
-
-    for ln in lines:
-        if ln.startswith('- '):
-            # list item under current_key
-            val = ln[2:].strip().strip('"\'')
-            if current_key is not None:
-                list_acc.append(val)
-            continue
-        # key: value or key: [a, b]
-        if ':' in ln:
-            # Close any previous list
-            flush_list()
-            key, val = ln.split(':', 1)
-            key = key.strip()
-            val = val.strip()
-            current_key = key
-            if not val:
-                # expect multi-line list next
-                data.setdefault(key, [])
-                continue
-            # Inline list [a, b]
-            if val.startswith('[') and val.endswith(']'):
-                inner = val[1:-1].strip()
-                items = [] if not inner else [it.strip().strip('"\'') for it in inner.split(',')]
-                data[key] = [x for x in items if x]
-                continue
-            # Scalar
-            data[key] = val.strip('"\'')
-            continue
-        # Non key-value, ignore
-    # Flush last list
-    flush_list()
-    return data
-
-
-def read_readme_front_matter(model_base_dir: Path) -> Tuple[Optional[List[str]], Optional[str], Optional[str]]:
-    """Read README.md front matter and extract tags, pipeline_tag, library_name.
-
-    Returns (tags, pipeline_tag, library_name) with lowercase normalization where applicable.
-    Any read/parse error results in (None, None, None).
-    """
-    try:
-        snap = _latest_snapshot_dir(model_base_dir)
-        if not snap:
-            return None, None, None
-        readme = snap / 'README.md'
-        if not readme.exists():
-            return None, None, None
-        text = readme.read_text(encoding='utf-8', errors='ignore')
-        fm = _lenient_yaml_front_matter(text)
-        if not fm:
-            return None, None, None
-        tags = fm.get('tags')
-        if isinstance(tags, list):
-            tags = [str(t).strip().lower() for t in tags if str(t).strip()]
-        else:
-            tags = None
-        pipeline = fm.get('pipeline_tag')
-        pipeline = str(pipeline).strip().lower() if pipeline else None
-        lib = fm.get('library_name')
-        lib = str(lib).strip().lower() if lib else None
-        return tags, pipeline, lib
-    except Exception:
-        return None, None, None
-
-
-def tokenizer_has_chat_template(model_base_dir: Path) -> bool:
-    """Check tokenizer_config.json for a non-empty 'chat_template' field in latest snapshot."""
-    try:
-        snap = _latest_snapshot_dir(model_base_dir)
-        if not snap:
-            return False
-        tk = snap / 'tokenizer_config.json'
-        if not tk.exists():
-            return False
-        with open(tk, encoding='utf-8') as f:
-            data = json.load(f)
-        tmpl = data.get('chat_template')
-        return bool(tmpl and isinstance(tmpl, str) and tmpl.strip())
-    except Exception:
-        return False
-
diff --git a/mlx_knife/server.py b/mlx_knife/server.py
deleted file mode 100644
index a031e10..0000000
--- a/mlx_knife/server.py
+++ /dev/null
@@ -1,581 +0,0 @@
-# mlx_knife/server.py
-"""
-OpenAI-compatible API server for MLX models.
-Provides REST endpoints for text generation with MLX backend.
-"""
-
-import json
-import time
-import uuid
-from collections.abc import AsyncGenerator
-from contextlib import asynccontextmanager
-from typing import Any, Dict, List, Optional, Union
-
-import uvicorn
-from fastapi import FastAPI, HTTPException
-from fastapi.middleware.cors import CORSMiddleware
-from fastapi.responses import StreamingResponse
-from pydantic import BaseModel, Field
-
-from .cache_utils import (
-    detect_framework,
-    detect_model_type,
-    get_model_path,
-    is_model_healthy,
-)
-from .mlx_runner import MLXRunner
-
-# Global model cache and configuration
-_model_cache: Dict[str, MLXRunner] = {}
-_current_model_path: Optional[str] = None
-_default_max_tokens: Optional[int] = None  # Use dynamic model-aware limits by default
-
-
-class CompletionRequest(BaseModel):
-    model: str
-    prompt: Union[str, List[str]]
-    max_tokens: Optional[int] = None
-    temperature: Optional[float] = 0.7
-    top_p: Optional[float] = 0.9
-    stream: Optional[bool] = False
-    stop: Optional[Union[str, List[str]]] = None
-    repetition_penalty: Optional[float] = 1.1
-
-
-class ChatMessage(BaseModel):
-    role: str = Field(..., pattern="^(system|user|assistant)$")
-    content: str
-
-
-class ChatCompletionRequest(BaseModel):
-    model: str
-    messages: List[ChatMessage]
-    max_tokens: Optional[int] = None
-    temperature: Optional[float] = 0.7
-    top_p: Optional[float] = 0.9
-    stream: Optional[bool] = False
-    stop: Optional[Union[str, List[str]]] = None
-    repetition_penalty: Optional[float] = 1.1
-
-
-class CompletionResponse(BaseModel):
-    id: str
-    object: str = "text_completion"
-    created: int
-    model: str
-    choices: List[Dict[str, Any]]
-    usage: Dict[str, int]
-
-
-class ChatCompletionResponse(BaseModel):
-    id: str
-    object: str = "chat.completion"
-    created: int
-    model: str
-    choices: List[Dict[str, Any]]
-    usage: Dict[str, int]
-
-
-class ModelInfo(BaseModel):
-    id: str
-    object: str = "model"
-    owned_by: str = "mlx-knife"
-    permission: List = []
-    context_length: Optional[int] = None
-
-
-
-def get_or_load_model(model_spec: str, verbose: bool = False) -> MLXRunner:
-    """Get model from cache or load it if not cached."""
-    global _model_cache, _current_model_path
-
-    # Use the existing model path resolution from cache_utils
-    from .cache_utils import get_model_path
-
-    try:
-        model_path, model_name, commit_hash = get_model_path(model_spec)
-        if not model_path.exists():
-            raise HTTPException(status_code=404, detail=f"Model {model_spec} not found in cache")
-    except Exception as e:
-        raise HTTPException(status_code=404, detail=f"Model {model_spec} not found: {str(e)}")
-
-    # Check if it's an MLX model
-    framework = detect_framework(model_path.parent.parent, model_name)
-    if framework != "MLX":
-        raise HTTPException(status_code=400, detail=f"Model {model_name} is not a valid MLX model (Framework: {framework})")
-
-    model_path_str = str(model_path)
-
-    # Check if we need to load a different model
-    if _current_model_path != model_path_str:
-        # Proactively clean up any previously loaded runner to release memory
-        if _model_cache:
-            try:
-                for _old_runner in list(_model_cache.values()):
-                    try:
-                        _old_runner.cleanup()
-                    except Exception:
-                        pass
-            finally:
-                _model_cache.clear()
-
-        # Load new model
-        if verbose:
-            print(f"Loading model: {model_name}")
-
-        runner = MLXRunner(model_path_str, verbose=verbose)
-        runner.load_model()
-
-        _model_cache[model_path_str] = runner
-        _current_model_path = model_path_str
-
-    return _model_cache[model_path_str]
-
-
-async def generate_completion_stream(
-    runner: MLXRunner,
-    prompt: str,
-    request: CompletionRequest
-) -> AsyncGenerator[str, None]:
-    """Generate streaming completion response."""
-    completion_id = f"cmpl-{uuid.uuid4()}"
-    created = int(time.time())
-
-    # Yield initial response
-    initial_response = {
-        "id": completion_id,
-        "object": "text_completion",
-        "created": created,
-        "model": request.model,
-        "choices": [
-            {
-                "index": 0,
-                "text": "",
-                "logprobs": None,
-                "finish_reason": None
-            }
-        ]
-    }
-
-    yield f"data: {json.dumps(initial_response)}\n\n"
-
-    # Stream tokens
-    try:
-        token_count = 0
-        for token in runner.generate_streaming(
-            prompt=prompt,
-            max_tokens=runner.get_effective_max_tokens(request.max_tokens or _default_max_tokens, interactive=False),
-            temperature=request.temperature,
-            top_p=request.top_p,
-            repetition_penalty=request.repetition_penalty,
-            use_chat_template=False  # Raw completion mode
-        ):
-            token_count += 1
-
-            chunk_response = {
-                "id": completion_id,
-                "object": "text_completion",
-                "created": created,
-                "model": request.model,
-                "choices": [
-                    {
-                        "index": 0,
-                        "text": token,
-                        "logprobs": None,
-                        "finish_reason": None
-                    }
-                ]
-            }
-
-            yield f"data: {json.dumps(chunk_response)}\n\n"
-
-            # Check for stop sequences
-            if request.stop:
-                stop_sequences = request.stop if isinstance(request.stop, list) else [request.stop]
-                if any(stop in token for stop in stop_sequences):
-                    break
-
-    except Exception as e:
-        error_response = {
-            "id": completion_id,
-            "object": "text_completion",
-            "created": created,
-            "model": request.model,
-            "choices": [
-                {
-                    "index": 0,
-                    "text": "",
-                    "logprobs": None,
-                    "finish_reason": "error"
-                }
-            ],
-            "error": str(e)
-        }
-        yield f"data: {json.dumps(error_response)}\n\n"
-
-    # Final response
-    final_response = {
-        "id": completion_id,
-        "object": "text_completion",
-        "created": created,
-        "model": request.model,
-        "choices": [
-            {
-                "index": 0,
-                "text": "",
-                "logprobs": None,
-                "finish_reason": "stop"
-            }
-        ]
-    }
-
-    yield f"data: {json.dumps(final_response)}\n\n"
-    yield "data: [DONE]\n\n"
-
-
-async def generate_chat_stream(
-    runner: MLXRunner,
-    messages: List[ChatMessage],
-    request: ChatCompletionRequest
-) -> AsyncGenerator[str, None]:
-    """Generate streaming chat completion response."""
-    completion_id = f"chatcmpl-{uuid.uuid4()}"
-    created = int(time.time())
-
-    # Convert messages to dict format for runner
-    message_dicts = format_chat_messages_for_runner(messages)
-    
-    # Let the runner format with chat templates
-    prompt = runner._format_conversation(message_dicts, use_chat_template=True)
-
-    # Yield initial response
-    initial_response = {
-        "id": completion_id,
-        "object": "chat.completion.chunk",
-        "created": created,
-        "model": request.model,
-        "choices": [
-            {
-                "index": 0,
-                "delta": {"role": "assistant", "content": ""},
-                "finish_reason": None
-            }
-        ]
-    }
-
-    yield f"data: {json.dumps(initial_response)}\n\n"
-
-    # Stream tokens
-    try:
-        for token in runner.generate_streaming(
-            prompt=prompt,
-            max_tokens=runner.get_effective_max_tokens(request.max_tokens or _default_max_tokens, interactive=False),
-            temperature=request.temperature,
-            top_p=request.top_p,
-            repetition_penalty=request.repetition_penalty,
-            use_chat_template=False,  # Already applied in _format_conversation
-            use_chat_stop_tokens=False  # Server mode shouldn't stop on chat markers
-        ):
-            chunk_response = {
-                "id": completion_id,
-                "object": "chat.completion.chunk",
-                "created": created,
-                "model": request.model,
-                "choices": [
-                    {
-                        "index": 0,
-                        "delta": {"content": token},
-                        "finish_reason": None
-                    }
-                ]
-            }
-
-            yield f"data: {json.dumps(chunk_response)}\n\n"
-
-            # Check for stop sequences
-            if request.stop:
-                stop_sequences = request.stop if isinstance(request.stop, list) else [request.stop]
-                if any(stop in token for stop in stop_sequences):
-                    break
-
-    except Exception as e:
-        error_response = {
-            "id": completion_id,
-            "object": "chat.completion.chunk",
-            "created": created,
-            "model": request.model,
-            "choices": [
-                {
-                    "index": 0,
-                    "delta": {},
-                    "finish_reason": "error"
-                }
-            ],
-            "error": str(e)
-        }
-        yield f"data: {json.dumps(error_response)}\n\n"
-
-    # Final response
-    final_response = {
-        "id": completion_id,
-        "object": "chat.completion.chunk",
-        "created": created,
-        "model": request.model,
-        "choices": [
-            {
-                "index": 0,
-                "delta": {},
-                "finish_reason": "stop"
-            }
-        ]
-    }
-
-    yield f"data: {json.dumps(final_response)}\n\n"
-    yield "data: [DONE]\n\n"
-
-
-def format_chat_messages_for_runner(messages: List[ChatMessage]) -> List[Dict[str, str]]:
-    """Convert chat messages to format expected by MLXRunner.
-    
-    Returns messages in dict format for the runner to apply chat templates.
-    """
-    return [{"role": msg.role, "content": msg.content} for msg in messages]
-
-
-def count_tokens(text: str) -> int:
-    """Rough token count estimation."""
-    return int(len(text.split()) * 1.3)  # Approximation, convert to int
-
-
-@asynccontextmanager
-async def lifespan(app: FastAPI):
-    """Manage application lifespan."""
-    print("MLX Knife Server starting up...")
-    yield
-    print("MLX Knife Server shutting down...")
-    # Clean up model cache
-    global _model_cache
-    try:
-        for _runner in list(_model_cache.values()):
-            try:
-                _runner.cleanup()
-            except Exception:
-                pass
-    finally:
-        _model_cache.clear()
-
-
-# Create FastAPI app
-from . import __version__
-
-app = FastAPI(
-    title="MLX Knife API",
-    description="OpenAI-compatible API for MLX models",
-    version=__version__,
-    lifespan=lifespan
-)
-
-# Add CORS middleware for browser access
-app.add_middleware(
-    CORSMiddleware,
-    allow_origins=["*"],  # Allow all origins for local development
-    allow_credentials=True,
-    allow_methods=["*"],
-    allow_headers=["*"],
-)
-
-
-@app.get("/health")
-async def health_check():
-    """Health check endpoint (OpenAI compatible)."""
-    return {"status": "healthy", "service": "mlx-knife-server"}
-
-
-
-
-@app.get("/v1/models")
-async def list_models():
-    """List available models (conservative, unchanged by Issue #31)."""
-    from .cache_utils import MODEL_CACHE, cache_dir_to_hf
-
-    model_list = []
-    models = [d for d in MODEL_CACHE.iterdir() if d.name.startswith("models--")]
-
-    for model_dir in models:
-        model_name = cache_dir_to_hf(model_dir.name)
-        framework = detect_framework(model_dir, model_name)
-
-        if framework == "MLX" and is_model_healthy(model_name):
-            # Only expose chat-capable models for the chat/completions API
-            try:
-                mtype = detect_model_type(model_dir, model_name)
-            except Exception:
-                mtype = "base"
-            if mtype != "chat":
-                continue
-            # Get model context length (best effort)
-            context_length = None
-            try:
-                model_path_tuple = get_model_path(model_name)
-                if model_path_tuple and model_path_tuple[0]:
-                    from .mlx_runner import get_model_context_length
-                    context_length = get_model_context_length(str(model_path_tuple[0]))
-            except Exception:
-                pass
-
-            model_list.append(ModelInfo(
-                id=model_name,
-                object="model",
-                owned_by="mlx-knife",
-                context_length=context_length
-            ))
-
-    return {"object": "list", "data": model_list}
-
-
-@app.post("/v1/completions")
-async def create_completion(request: CompletionRequest):
-    """Create a text completion."""
-    try:
-        runner = get_or_load_model(request.model)
-
-        # Handle array of prompts
-        if isinstance(request.prompt, list):
-            if len(request.prompt) > 1:
-                raise HTTPException(status_code=400, detail="Multiple prompts not supported yet")
-            prompt = request.prompt[0]
-        else:
-            prompt = request.prompt
-
-        if request.stream:
-            # Streaming response
-            return StreamingResponse(
-                generate_completion_stream(runner, prompt, request),
-                media_type="text/plain",
-                headers={"Cache-Control": "no-cache"}
-            )
-        else:
-            # Non-streaming response
-            completion_id = f"cmpl-{uuid.uuid4()}"
-            created = int(time.time())
-
-            generated_text = runner.generate_batch(
-                prompt=prompt,
-                max_tokens=runner.get_effective_max_tokens(request.max_tokens or _default_max_tokens, interactive=False),
-                temperature=request.temperature,
-                top_p=request.top_p,
-                repetition_penalty=request.repetition_penalty,
-                use_chat_template=False
-            )
-
-            prompt_tokens = count_tokens(prompt)
-            completion_tokens = count_tokens(generated_text)
-
-            return CompletionResponse(
-                id=completion_id,
-                created=created,
-                model=request.model,
-                choices=[
-                    {
-                        "index": 0,
-                        "text": generated_text,
-                        "logprobs": None,
-                        "finish_reason": "stop"
-                    }
-                ],
-                usage={
-                    "prompt_tokens": prompt_tokens,
-                    "completion_tokens": completion_tokens,
-                    "total_tokens": prompt_tokens + completion_tokens
-                }
-            )
-
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=str(e))
-
-
-@app.post("/v1/chat/completions")
-async def create_chat_completion(request: ChatCompletionRequest):
-    """Create a chat completion."""
-    try:
-        runner = get_or_load_model(request.model)
-
-        if request.stream:
-            # Streaming response
-            return StreamingResponse(
-                generate_chat_stream(runner, request.messages, request),
-                media_type="text/plain",
-                headers={"Cache-Control": "no-cache"}
-            )
-        else:
-            # Non-streaming response
-            completion_id = f"chatcmpl-{uuid.uuid4()}"
-            created = int(time.time())
-
-            # Convert messages to dict format for runner
-            message_dicts = format_chat_messages_for_runner(request.messages)
-            
-            # Let the runner format with chat templates
-            prompt = runner._format_conversation(message_dicts, use_chat_template=True)
-
-            generated_text = runner.generate_batch(
-                prompt=prompt,
-                max_tokens=runner.get_effective_max_tokens(request.max_tokens or _default_max_tokens, interactive=False),
-                temperature=request.temperature,
-                top_p=request.top_p,
-                repetition_penalty=request.repetition_penalty,
-                use_chat_template=False  # Already applied in _format_conversation
-            )
-
-            # Token counting
-            total_prompt = "\n\n".join([msg.content for msg in request.messages])
-            prompt_tokens = count_tokens(total_prompt)
-            completion_tokens = count_tokens(generated_text)
-
-            return ChatCompletionResponse(
-                id=completion_id,
-                created=created,
-                model=request.model,
-                choices=[
-                    {
-                        "index": 0,
-                        "message": {
-                            "role": "assistant",
-                            "content": generated_text
-                        },
-                        "finish_reason": "stop"
-                    }
-                ],
-                usage={
-                    "prompt_tokens": prompt_tokens,
-                    "completion_tokens": completion_tokens,
-                    "total_tokens": prompt_tokens + completion_tokens
-                }
-            )
-
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=str(e))
-
-
-def run_server(
-    host: str = "127.0.0.1",
-    port: int = 8000,
-    max_tokens: int = 2000,
-    reload: bool = False,
-    log_level: str = "info"
-):
-    """Run the MLX Knife server."""
-    global _default_max_tokens
-    _default_max_tokens = max_tokens
-
-    print(f"Starting MLX Knife Server on http://{host}:{port}")
-    print(f"API docs available at http://{host}:{port}/docs")
-    print(f"Default max tokens: {'model-aware dynamic limits' if max_tokens is None else max_tokens}")
-
-    uvicorn.run(
-        "mlx_knife.server:app",
-        host=host,
-        port=port,
-        reload=reload,
-        log_level=log_level
-    )
diff --git a/mlx_knife/throttled_download_worker.py b/mlx_knife/throttled_download_worker.py
deleted file mode 100644
index 50b5b6e..0000000
--- a/mlx_knife/throttled_download_worker.py
+++ /dev/null
@@ -1,162 +0,0 @@
-import json
-import os
-import signal
-import sys
-import time
-from typing import Any
-
-# Global tracking for accurate download rate
-_download_stats = {
-    'bytes_downloaded': 0,
-    'start_time': None,
-    'last_update': None,
-    'actual_download_time': 0.0  # Time spent actually downloading (without delays)
-}
-
-
-def signal_handler(signum: int, frame: Any) -> None:
-    print("\n[WARNING] Download cancelled by user.")
-    sys.exit(0)
-
-signal.signal(signal.SIGINT, signal_handler)
-signal.signal(signal.SIGTERM, signal_handler)
-
-os.environ["HF_HUB_DOWNLOAD_THREADS"] = "1"
-os.environ["HF_HUB_DOWNLOAD_CHUNK_SIZE"] = "524288"  # 512KB chunks (half size)
-os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "false"
-
-try:
-    import requests
-    from huggingface_hub import snapshot_download
-except ImportError:
-    print("[ERROR] huggingface_hub or requests not installed in worker environment!")
-    sys.exit(2)
-
-# Throttle all HTTP(S) requests with adaptive delays
-original_get = requests.get
-original_post = requests.post
-
-def get_adaptive_delay(url: str, response: Any) -> float:
-    """Calculate delay based on file type and size"""
-    if not url:
-        return 1.0
-    
-    # Check if this is a large model file download
-    if any(ext in url.lower() for ext in ['.safetensors', '.bin', '.pth']):
-        # For large model files, use more aggressive throttling
-        content_length = response.headers.get('content-length')
-        if content_length:
-            size_mb = int(content_length) / (1024 * 1024)
-            if size_mb > 100:  # Files larger than 100MB
-                return 3.0     # 3 second delay between chunks
-            elif size_mb > 10: # Files larger than 10MB
-                return 2.0     # 2 second delay
-        return 2.0  # Default for model files
-    
-    # Regular files (config.json, tokenizer files, etc.)
-    return 0.5
-
-def throttled_get(*args: Any, **kwargs: Any) -> Any:
-    download_start = time.time()
-    response = original_get(*args, **kwargs)
-    download_end = time.time()
-    
-    # Track actual download time (without delays)
-    actual_download_time = download_end - download_start
-    _download_stats['actual_download_time'] += actual_download_time
-    
-    # Track bytes if we can determine them
-    url = args[0] if args else kwargs.get('url', '')
-    if hasattr(response, 'headers') and 'content-length' in response.headers:
-        content_length = int(response.headers['content-length'])
-        _download_stats['bytes_downloaded'] += content_length
-        
-        # Initialize timing if first download
-        if _download_stats['start_time'] is None:
-            _download_stats['start_time'] = download_start
-        
-        # Print accurate rate every ~5MB or every 10 seconds
-        now = time.time()
-        if (_download_stats['last_update'] is None or
-            now - _download_stats['last_update'] > 10 or
-            _download_stats['bytes_downloaded'] % (5 * 1024 * 1024) < content_length):
-            
-            if _download_stats['actual_download_time'] > 0:
-                real_rate_mbps = (_download_stats['bytes_downloaded'] / _download_stats['actual_download_time']) / (1024 * 1024)
-                total_mb = _download_stats['bytes_downloaded'] / (1024 * 1024)
-                print(f"[THROTTLE] Downloaded {total_mb:.1f}MB at real rate: {real_rate_mbps:.1f}MB/s (excluding delays)")
-            _download_stats['last_update'] = now
-    
-    delay = get_adaptive_delay(url, response)
-    time.sleep(delay)
-    return response
-
-def throttled_post(*args: Any, **kwargs: Any) -> Any:
-    response = original_post(*args, **kwargs)
-    time.sleep(0.5)
-    return response
-
-requests.get = throttled_get
-requests.post = throttled_post
-
-def main() -> None:
-    if len(sys.argv) != 2:
-        print("Usage: python throttled_download_worker.py <kwargs_file.json>")
-        sys.exit(1)
-
-    kwargs_file = sys.argv[1]
-    try:
-        with open(kwargs_file) as f:
-            kwargs_dict = json.load(f)
-    except Exception as e:
-        print(f"[ERROR] Could not read worker kwargs: {e}")
-        sys.exit(1)
-
-    try:
-        snapshot_download(**kwargs_dict)
-    except requests.exceptions.HTTPError as e:
-        status = getattr(e.response, "status_code", None)
-        url = getattr(e.response, "url", None)
-        if status == 401:
-            print(f"[ERROR] Unauthorized (401): Check your HuggingFace token or login.\nURL: {url}")
-            sys.exit(10)
-        elif status == 403:
-            print(f"[ERROR] Forbidden (403): Access denied.\nURL: {url}")
-            sys.exit(11)
-        elif status == 404:
-            print(f"[ERROR] Not Found (404): Resource does not exist.\nURL: {url}")
-            sys.exit(12)
-        else:
-            print(f"[ERROR] HTTP Error: {e}")
-            sys.exit(2)
-    except requests.exceptions.ConnectionError:
-        print("[ERROR] Network connection error. Please check your internet connection and try again.")
-        sys.exit(20)
-    except PermissionError as e:
-        print(f"[ERROR] Permission denied: {e.filename if hasattr(e, 'filename') else 'check file permissions'}")
-        print("   Ensure you have write access to the cache directory.")
-        sys.exit(13)
-    except OSError as e:
-        import errno
-        if e.errno == errno.ENOSPC:
-            print("[ERROR] No space left on device. Please free up disk space and try again.")
-            sys.exit(14)
-        elif e.errno == errno.EACCES:
-            print(f"[ERROR] Access denied: {e.filename if hasattr(e, 'filename') else 'check permissions'}")
-            sys.exit(13)
-        else:
-            print(f"[ERROR] OS Error during download: {e}")
-            sys.exit(15)
-    except Exception as e:
-        print(f"[ERROR] Unexpected error during download: {type(e).__name__}: {e}")
-        sys.exit(2)
-    finally:
-        try:
-            os.unlink(kwargs_file)
-        except Exception:
-            pass
-
-    sys.exit(0)
-
-if __name__ == "__main__":
-    main()
diff --git a/mlxk-demo.gif b/mlxk-demo.gif
index 1fe53fc..8ffb937 100644
Binary files a/mlxk-demo.gif and b/mlxk-demo.gif differ
diff --git a/mlxk-demo.tape b/mlxk-demo.tape
index a87aa5f..4c4e186 100644
--- a/mlxk-demo.tape
+++ b/mlxk-demo.tape
@@ -1,9 +1,9 @@
-# MLX Knife Demo – Mistral 7B 4‑bit
+# MLX Knife 2.0 Demo – Enhanced Human Output
 Output mlxk-demo.gif
 Set FontFamily "Menlo"
-Set FontSize 16
-Set Width 1000
-Set Height 400
+Set FontSize 13
+Set Width 800
+Set Height 500
 Set Padding 12
 Set Margin 0
 Set Theme OneHalfDark
@@ -12,34 +12,39 @@ Set PlaybackSpeed 1.0
 Set TypingSpeed 50ms
 
 # Intro
-Type "echo 'MLX Knife – quick demo'"
+Type "echo 'MLX Knife 2.0 – quick demo'"
 Enter
 Sleep 1200ms
 
-# 1) Health-Listing
-Type "mlxk list --health"
+# 1) Health-Listing (improved 2.0 output)
+Type "mlxk2 list --health"
 Enter
-Sleep 1400ms
+Sleep 1600ms
 
-# 2) start run
-Type "mlxk run Mistral-7B"
+# 2) Start interactive run (2.0 run command)
+Type "mlxk2 run gpt-oss-20b-MXFP4-Q8"
 Enter
 Sleep 2500ms
 
-# 3) enter prompt (short & brief)
+# 3) Enter prompt (show streaming)
 Type "Explain in three sentences how beam search works in LLMs."
 Enter
-Sleep 3200ms
+Sleep 3500ms
 
-# 4) leave chat
+# 4) Leave chat
 Type "exit"
 Enter
-Sleep 800ms
+Sleep 1000ms
 
-# 5) show model details
-Type "mlxk show Mistral-7B-Instruct-v0.2-4bit"
+# 5) Show model details (enhanced formatting)
+Type "mlxk2 show gpt-oss-20b-MXFP4-Q8"
 Enter
-Sleep 1200ms
+Sleep 1400ms
+
+# 6) Show JSON output capability
+Type "mlxk2 show  gpt-oss-20b-MXFP4-Q8 --json"
+Enter
+Sleep 1500ms
 
 # Ende
 Sleep 2000ms
\ No newline at end of file
diff --git a/mlxk2/NOTICE b/mlxk2/NOTICE
new file mode 100644
index 0000000..61d3ff0
--- /dev/null
+++ b/mlxk2/NOTICE
@@ -0,0 +1,5 @@
+MLX-Knife 2.0 (mlxk2)
+Copyright 2025 The BROKE team
+
+This product includes software developed by The BROKE team.
+Licensed under the Apache License, Version 2.0.
diff --git a/mlxk2/__init__.py b/mlxk2/__init__.py
new file mode 100644
index 0000000..76fd534
--- /dev/null
+++ b/mlxk2/__init__.py
@@ -0,0 +1,10 @@
+"""MLX-Knife - HuggingFace model management for MLX."""
+
+# Suppress urllib3 LibreSSL warning on macOS system Python 3.9
+# (must run before any imports that may indirectly import urllib3)
+import warnings
+
+# Issue parity with 1.1.0 (Issue #22)
+warnings.filterwarnings('ignore', message='urllib3 v2 only supports OpenSSL 1.1.1+')
+
+__version__ = "2.0.0"
diff --git a/mlxk2/cli.py b/mlxk2/cli.py
new file mode 100644
index 0000000..339f1e6
--- /dev/null
+++ b/mlxk2/cli.py
@@ -0,0 +1,376 @@
+#!/usr/bin/env python3
+"""MLX-Knife CLI - HuggingFace model management for MLX."""
+
+import argparse
+import json
+import os
+import sys
+from typing import Dict, Any
+
+from . import __version__
+from .operations.list import list_models
+from .operations.health import health_check_operation
+from .operations.pull import pull_operation
+from .operations.rm import rm_operation
+from .operations.push import push_operation
+from .operations.show import show_model_operation
+from .operations.run import run_model_enhanced
+from .spec import JSON_API_SPEC_VERSION
+from .output.human import (
+    render_list,
+    render_health,
+    render_show,
+    render_pull,
+    render_clone,
+    render_rm,
+)
+
+
+def format_json_output(data: Dict[str, Any]) -> str:
+    """Format output as JSON."""
+    return json.dumps(data, indent=2)
+
+
+def handle_error(error_type: str, message: str) -> Dict[str, Any]:
+    """Format error as JSON response."""
+    return {
+        "status": "error",
+        "command": None,
+        "data": None,
+        "error": {
+            "type": error_type,
+            "message": message
+        }
+    }
+
+
+class MLXKArgumentParser(argparse.ArgumentParser):
+    """ArgumentParser that prints JSON errors when --json is present.
+
+    This ensures invocations like `mlxk2 push --json --private` (missing args)
+    emit a JSON error instead of argparse usage text.
+    """
+
+    def error(self, message):  # type: ignore[override]
+        want_json = "--json" in sys.argv
+        if want_json:
+            err = handle_error("CommandError", message)
+            print(format_json_output(err))
+            self.exit(2)
+        super().error(message)
+
+
+def main():
+    """Main CLI entry point."""
+    parser = MLXKArgumentParser(
+        prog="mlxk2",
+        description="MLX-Knife - HuggingFace model management for MLX"
+    )
+    
+    # Add version argument (supports --json)
+    parser.add_argument("--version", action="store_true", help="Show version information and exit")
+    parser.add_argument("--json", action="store_true", help="Output in JSON format (with --version or per command)")
+    
+    subparsers = parser.add_subparsers(dest="command", help="Available commands", parser_class=MLXKArgumentParser)
+    
+    # List command
+    list_parser = subparsers.add_parser("list", help="List all cached models")
+    list_parser.add_argument("pattern", nargs="?", help="Filter models by pattern (optional)")
+    # Human-output modifiers (JSON output remains unchanged)
+    list_parser.add_argument("--all", action="store_true", dest="show_all", help="Show all details (human output)")
+    list_parser.add_argument("--health", action="store_true", dest="show_health", help="Include health column (human output)")
+    list_parser.add_argument("--verbose", action="store_true", help="Verbose details (human output)")
+    list_parser.add_argument("--json", action="store_true", help="Output in JSON format")
+    
+    # Health command
+    health_parser = subparsers.add_parser("health", help="Check model health")
+    health_parser.add_argument("model", nargs="?", help="Model pattern to check (optional)")
+    health_parser.add_argument("--json", action="store_true", help="Output in JSON format")
+    
+    # Show command
+    show_parser = subparsers.add_parser("show", help="Show detailed model information")
+    show_parser.add_argument("model", help="Model name to show")
+    show_parser.add_argument("--files", action="store_true", help="Include file listing")
+    show_parser.add_argument("--config", action="store_true", help="Include config.json content")
+    show_parser.add_argument("--json", action="store_true", help="Output in JSON format")
+    
+    # Pull command
+    pull_parser = subparsers.add_parser("pull", help="Download a model")
+    pull_parser.add_argument("model", help="Model name to download")
+    pull_parser.add_argument("--json", action="store_true", help="Output in JSON format")
+
+    # Clone command (alpha) - only show if alpha features enabled
+    if os.getenv("MLXK2_ENABLE_ALPHA_FEATURES"):
+        clone_parser = subparsers.add_parser("clone", help="ALPHA: Clone a model to a local workspace")
+        clone_parser.add_argument("model", help="Model name to clone (org/repo[@revision])")
+        clone_parser.add_argument("target_dir", help="Target directory for workspace")
+        clone_parser.add_argument("--branch", help="Specific branch/revision to clone")
+        clone_parser.add_argument("--no-health-check", action="store_true", help="Skip health validation before copy")
+        clone_parser.add_argument("--quiet", action="store_true", help="Suppress progress output")
+        clone_parser.add_argument("--json", action="store_true", help="Output in JSON format")
+    
+    # Remove command
+    rm_parser = subparsers.add_parser("rm", help="Delete a model")
+    rm_parser.add_argument("model", help="Model name to delete")
+    rm_parser.add_argument("-f", "--force", action="store_true", help="Delete without confirmation")
+    rm_parser.add_argument("--json", action="store_true", help="Output in JSON format")
+
+    # Run command
+    run_parser = subparsers.add_parser("run", help="Run model with prompt")
+    run_parser.add_argument("model", help="Model name to run")
+    run_parser.add_argument("prompt", nargs="?", help="Input prompt (optional - triggers interactive mode if omitted)")
+    run_parser.add_argument("--max-tokens", type=int, help="Maximum tokens to generate")
+    run_parser.add_argument("--temperature", type=float, default=0.7, help="Sampling temperature (default: 0.7)")
+    run_parser.add_argument("--top-p", type=float, default=0.9, help="Top-p sampling parameter (default: 0.9)")
+    run_parser.add_argument("--repetition-penalty", type=float, default=1.1, help="Repetition penalty (default: 1.1)")
+    run_parser.add_argument("--no-stream", action="store_true", help="Disable streaming output")
+    run_parser.add_argument("--no-chat-template", action="store_true", help="Disable chat template")
+    run_parser.add_argument("--verbose", action="store_true", help="Show detailed output")
+    run_parser.add_argument("--json", action="store_true", help="Output in JSON format")
+    # Future features (beta.2)
+    run_parser.add_argument("--system", help="System prompt (future feature)")
+    run_parser.add_argument("--hide-reasoning", action="store_true", help="Hide reasoning output (future feature)")
+
+    # Serve command (primary, ollama-compatible)
+    serve_parser = subparsers.add_parser("serve", help="Start OpenAI-compatible API server")
+    serve_parser.add_argument("--model", help="Specific model to pre-load (optional)")
+    serve_parser.add_argument("--port", type=int, default=8000, help="Port to bind server to (default: 8000)")
+    serve_parser.add_argument("--host", default="127.0.0.1", help="Host address to bind to (default: 127.0.0.1)")
+    serve_parser.add_argument("--max-tokens", type=int, help="Default maximum tokens for generation")
+    serve_parser.add_argument("--reload", action="store_true", help="Enable auto-reload for development")
+    serve_parser.add_argument("--log-level", default="info", help="Logging level (debug/info/warning/error, default: info)")
+    serve_parser.add_argument("--log-json", action="store_true", help="Output logs in JSON format (for log aggregation)")
+    serve_parser.add_argument("--verbose", action="store_true", help="Show detailed output")
+    serve_parser.add_argument("--json", action="store_true", help="Output startup info in JSON format")
+
+    # Server command (alias for backward compatibility with 1.x)
+    _ = subparsers.add_parser(
+        "server",
+        help="Start OpenAI-compatible API server (alias for serve)",
+        parents=[serve_parser],
+        add_help=False,
+    )
+
+    # Push command (alpha) - only show if alpha features enabled
+    if os.getenv("MLXK2_ENABLE_ALPHA_FEATURES"):
+        push_parser = subparsers.add_parser("push", help="ALPHA: Upload a local folder to Hugging Face")
+        push_parser.add_argument("local_dir", help="Local folder to upload")
+        push_parser.add_argument("repo_id", help="Target repo as org/model")
+        push_parser.add_argument("--create", action="store_true", help="Create repository/branch if missing")
+        # Alpha.1 safety: require --private to avoid accidental public uploads
+        push_parser.add_argument(
+            "--private",
+            action="store_true",
+            required=True,
+            help="REQUIRED (alpha.1): Proceed only when targeting a private repo",
+        )
+        push_parser.add_argument("--branch", default="main", help="Target branch (default: main)")
+        push_parser.add_argument("--commit", dest="commit_message", default="mlx-knife push", help="Commit message")
+        push_parser.add_argument("--verbose", action="store_true", help="Verbose details (human output)")
+        push_parser.add_argument("--check-only", action="store_true", help="Analyze workspace content; do not upload")
+        push_parser.add_argument("--dry-run", action="store_true", help="Compute changes against remote; do not upload")
+        push_parser.add_argument("--json", action="store_true", help="Output in JSON format")
+    
+    args = parser.parse_args()
+    
+    try:
+        # Handle top-level version first
+        if args.version:
+            if args.json:
+                result = {
+                    "status": "success",
+                    "command": "version",
+                    "data": {
+                        "cli_version": __version__,
+                        "json_api_spec_version": JSON_API_SPEC_VERSION,
+                    },
+                    "error": None,
+                }
+                print(format_json_output(result))
+            else:
+                print(f"mlxk2 {__version__}")
+            sys.exit(0)
+
+        # Initialize result for all paths
+        result = None
+        
+        # Execute command and render per mode
+        if args.command == "list":
+            result = list_models(pattern=args.pattern)
+            if args.json:
+                print(format_json_output(result))
+            else:
+                show_health = getattr(args, "show_health", False)
+                show_all = getattr(args, "show_all", False)
+                verbose = getattr(args, "verbose", False)
+                print(render_list(result, show_health=show_health, show_all=show_all, verbose=verbose))
+        elif args.command == "health":
+            result = health_check_operation(args.model)
+            if args.json:
+                print(format_json_output(result))
+            else:
+                print(render_health(result))
+        elif args.command == "show":
+            result = show_model_operation(args.model, args.files, args.config)
+            if args.json:
+                print(format_json_output(result))
+            else:
+                print(render_show(result))
+        elif args.command == "pull":
+            result = pull_operation(args.model)
+            if args.json:
+                print(format_json_output(result))
+            else:
+                print(render_pull(result))
+        elif args.command == "clone":
+            # Check if alpha features are enabled (should not reach here if not, but double-check)
+            if not os.getenv("MLXK2_ENABLE_ALPHA_FEATURES"):
+                result = handle_error("CommandError", "Clone command requires MLXK2_ENABLE_ALPHA_FEATURES=1")
+                print(format_json_output(result))
+                sys.exit(1)
+
+            # Handle branch parameter by modifying model spec
+            model_spec = args.model
+            if getattr(args, "branch", None):
+                # If --branch is provided, append it to model spec
+                model_spec = f"{args.model}@{args.branch}"
+
+            from .operations.clone import clone_operation
+            result = clone_operation(
+                model_spec=model_spec,
+                target_dir=args.target_dir,
+                health_check=not getattr(args, "no_health_check", False)
+            )
+            if args.json:
+                print(format_json_output(result))
+            else:
+                print(render_clone(result, quiet=getattr(args, "quiet", False)))
+        elif args.command == "rm":
+            result = rm_operation(args.model, args.force)
+            if args.json:
+                print(format_json_output(result))
+            else:
+                print(render_rm(result))
+        elif args.command == "run":
+            # Handle run command with proper parameter mapping
+            result_text = run_model_enhanced(
+                model_spec=args.model,
+                prompt=args.prompt,  # Can be None for interactive mode
+                stream=not args.no_stream,
+                max_tokens=getattr(args, "max_tokens", None),
+                temperature=args.temperature,
+                top_p=getattr(args, "top_p", 0.9),
+                repetition_penalty=getattr(args, "repetition_penalty", 1.1),
+                use_chat_template=not getattr(args, "no_chat_template", False),
+                json_output=args.json,
+                verbose=getattr(args, "verbose", False),
+                system_prompt=getattr(args, "system", None),
+                hide_reasoning=getattr(args, "hide_reasoning", False)
+            )
+            
+            # For JSON output, wrap result in standard format (only for single-shot mode)
+            if args.json and result_text is not None and args.prompt is not None:
+                result = {
+                    "status": "success",
+                    "command": "run",
+                    "data": {
+                        "model": args.model,
+                        "prompt": args.prompt,
+                        "response": result_text
+                    },
+                    "error": None
+                }
+                print(format_json_output(result))
+            else:
+                # For non-JSON or interactive mode, set success result
+                result = {"status": "success"}
+        elif args.command in ["serve", "server"]:  # Handle both serve and server aliases
+            # Handle serve command
+            if args.json:
+                # JSON startup info
+                server_info = {
+                    "status": "starting",
+                    "command": "serve",
+                    "data": {
+                        "host": args.host,
+                        "port": args.port,
+                        "model": getattr(args, "model", None),
+                        "max_tokens": getattr(args, "max_tokens", None),
+                    },
+                    "error": None
+                }
+                print(format_json_output(server_info))
+            
+            # Set MLXK2_LOG_JSON if --log-json flag is present
+            if getattr(args, "log_json", False):
+                os.environ["MLXK2_LOG_JSON"] = "1"
+
+            # Start server (this will run indefinitely)
+            # Lazy import to avoid hard dependency on FastAPI/uvicorn at import time
+            from .operations.serve import start_server
+            start_server(
+                model=getattr(args, "model", None),
+                port=args.port,
+                host=args.host,
+                max_tokens=getattr(args, "max_tokens", None),
+                reload=getattr(args, "reload", False),
+                log_level=getattr(args, "log_level", "info"),
+                verbose=getattr(args, "verbose", False),
+                supervise=True
+            )
+            
+            # Should never reach here (server runs indefinitely)
+            result = {"status": "success"}
+        elif args.command == "push":
+            # Check if alpha features are enabled (should not reach here if not, but double-check)
+            if not os.getenv("MLXK2_ENABLE_ALPHA_FEATURES"):
+                result = handle_error("CommandError", "Push command requires MLXK2_ENABLE_ALPHA_FEATURES=1")
+                print(format_json_output(result))
+                sys.exit(1)
+            result = push_operation(
+                local_dir=args.local_dir,
+                repo_id=args.repo_id,
+                create=getattr(args, "create", False),
+                private=getattr(args, "private", False),
+                branch=getattr(args, "branch", None),
+                commit_message=getattr(args, "commit_message", None),
+                check_only=getattr(args, "check_only", False),
+                dry_run=getattr(args, "dry_run", False),
+                # Quiet mode: when emitting JSON without --verbose, suppress hub progress/log noise
+                quiet=(getattr(args, "json", False) and not getattr(args, "verbose", False)),
+            )
+            if args.json:
+                print(format_json_output(result))
+            else:
+                from .output.human import render_push
+                print(render_push(result, verbose=getattr(args, "verbose", False)))
+        elif args.command is None:
+            # No command specified - show help or JSON error depending on --json flag
+            if args.json:
+                result = handle_error("CommandError", "No command specified")
+                print(format_json_output(result))
+                sys.exit(1)
+            else:
+                parser.print_help()
+                sys.exit(2)
+        else:
+            # Unknown command - show help or JSON error depending on --json flag
+            if args.json:
+                result = handle_error("CommandError", f"Unknown command: {args.command}")
+                print(format_json_output(result))
+                sys.exit(1)
+            else:
+                parser.print_help()
+                sys.exit(2)
+
+        # Exit with appropriate code (only reached for successful commands)
+        sys.exit(0 if result.get("status") == "success" else 1)
+            
+    except Exception as e:
+        error_result = handle_error("InternalError", str(e))
+        print(format_json_output(error_result))
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/mlxk2/context.py b/mlxk2/context.py
new file mode 100644
index 0000000..b1d3382
--- /dev/null
+++ b/mlxk2/context.py
@@ -0,0 +1,76 @@
+"""
+Request context management for MLX Knife 2.0 (ADR-004).
+
+Provides request_id (UUID4) generation and propagation across requests.
+"""
+
+import uuid
+from contextvars import ContextVar
+from typing import Optional
+
+
+# Context variable for request_id (thread-safe for async)
+_request_id_context: ContextVar[Optional[str]] = ContextVar('request_id', default=None)
+
+
+def generate_request_id() -> str:
+    """Generate a new request ID (UUID4).
+
+    Returns:
+        String UUID4 (e.g., "550e8400-e29b-41d4-a716-446655440000")
+    """
+    return str(uuid.uuid4())
+
+
+def set_request_id(request_id: str) -> None:
+    """Set the current request ID in context.
+
+    Args:
+        request_id: UUID string to set as current request ID
+    """
+    _request_id_context.set(request_id)
+
+
+def get_request_id() -> Optional[str]:
+    """Get the current request ID from context.
+
+    Returns:
+        Current request ID, or None if not set
+    """
+    return _request_id_context.get()
+
+
+def clear_request_id() -> None:
+    """Clear the current request ID from context."""
+    _request_id_context.set(None)
+
+
+class RequestContext:
+    """Context manager for request_id lifecycle.
+
+    Usage:
+        with RequestContext() as request_id:
+            # request_id is available via get_request_id()
+            do_work()
+        # request_id is cleared on exit
+    """
+
+    def __init__(self, request_id: Optional[str] = None):
+        """Initialize context manager.
+
+        Args:
+            request_id: Optional existing request ID, or generate new one
+        """
+        self.request_id = request_id or generate_request_id()
+        self._previous_id: Optional[str] = None
+
+    def __enter__(self) -> str:
+        """Enter context and set request_id."""
+        self._previous_id = get_request_id()
+        set_request_id(self.request_id)
+        return self.request_id
+
+    def __exit__(self, _exc_type, _exc_val, _exc_tb):
+        """Exit context and restore previous request_id."""
+        set_request_id(self._previous_id)
+        return False  # Don't suppress exceptions
diff --git a/mlxk2/core/__init__.py b/mlxk2/core/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/mlxk2/core/cache.py b/mlxk2/core/cache.py
new file mode 100644
index 0000000..3a72534
--- /dev/null
+++ b/mlxk2/core/cache.py
@@ -0,0 +1,90 @@
+"""Cache management for MLX-Knife 2.0."""
+
+import os
+from pathlib import Path
+
+# Cache path constants - copied from mlx_knife/cache_utils.py
+DEFAULT_CACHE_ROOT = Path.home() / ".cache/huggingface"
+
+
+def get_current_cache_root() -> Path:
+    """Get current cache root (respects runtime HF_HOME changes)."""
+    return Path(os.environ.get("HF_HOME", DEFAULT_CACHE_ROOT))
+
+
+def get_current_model_cache() -> Path:
+    """Get current model cache path (respects runtime HF_HOME changes)."""
+    return get_current_cache_root() / "hub"
+
+
+def _is_likely_test_cache(path: Path) -> bool:
+    """Heuristic to detect test caches safely on macOS tmp layouts.
+
+    Rules:
+    - Lives under system temp (e.g., /var/folders/)
+    - Contains our temp prefix marker 'mlxk2_test_'
+    """
+    s = str(path)
+    return "/var/folders/" in s and "mlxk2_test_" in s
+
+
+def _is_likely_user_cache(path: Path) -> bool:
+    """Heuristic to detect a non-test (user) cache.
+
+    We avoid site-specific paths. Treat anything that's NOT a test cache
+    as user cache for safety checks.
+    """
+    return not _is_likely_test_cache(path)
+
+
+def verify_cache_context(expected: str = "test"):
+    """Verify the current model cache matches the expected context.
+
+    - expected="test": assert test-like temp cache
+    - expected="user": assert project user cache convention
+    """
+    current_cache = get_current_model_cache()
+    if expected == "test":
+        if not _is_likely_test_cache(current_cache):
+            raise RuntimeError(f"Expected test cache, but using: {current_cache}")
+    elif expected == "user":
+        if not _is_likely_user_cache(current_cache):
+            raise RuntimeError(f"Expected user cache, but using: {current_cache}")
+    else:
+        raise ValueError(f"Unknown cache context: {expected}")
+
+
+# Legacy globals - DEPRECATED: Use get_current_*() functions for consistency
+CACHE_ROOT = get_current_cache_root()
+MODEL_CACHE = get_current_model_cache()
+
+
+def hf_to_cache_dir(hf_name: str) -> str:
+    """Convert HuggingFace model name to cache directory name.
+    
+    Universal rule: ALL "/" become "--" (mechanical conversion).
+    """
+    if hf_name.startswith("models--"):
+        return hf_name
+    
+    # Replace all "/" with "--" for universal conversion
+    converted = hf_name.replace("/", "--")
+    return f"models--{converted}"
+
+
+def cache_dir_to_hf(cache_name: str) -> str:
+    """Convert cache directory name to HuggingFace model name.
+    
+    Universal rule: ALL "--" become "/" (mechanical conversion).
+    This handles both clean names and corrupted cache entries gracefully.
+    """
+    if cache_name.startswith("models--"):
+        remaining = cache_name[len("models--"):]
+        return remaining.replace("--", "/")
+    return cache_name
+
+
+def get_model_path(hf_name: str) -> Path:
+    """Get the full path to a model in the cache."""
+    cache_dir = hf_to_cache_dir(hf_name)
+    return MODEL_CACHE / cache_dir
diff --git a/mlxk2/core/model_resolution.py b/mlxk2/core/model_resolution.py
new file mode 100644
index 0000000..4d6838f
--- /dev/null
+++ b/mlxk2/core/model_resolution.py
@@ -0,0 +1,123 @@
+"""Model name resolution and expansion for MLX-Knife 2.0."""
+
+from pathlib import Path
+from typing import Tuple, Optional, List
+from .cache import get_current_model_cache, hf_to_cache_dir, cache_dir_to_hf
+
+
+def expand_model_name(model_name: str) -> str:
+    """Expand short model names, preferring mlx-community if it exists."""
+    if "/" in model_name:
+        return model_name
+    
+    # Only try mlx-community if it actually exists
+    mlx_candidate = f"mlx-community/{model_name}"
+    model_cache = get_current_model_cache()
+    mlx_cache_dir = model_cache / hf_to_cache_dir(mlx_candidate)
+    if mlx_cache_dir.exists():
+        return mlx_candidate
+    
+    # Otherwise return as-is (no pattern forcing!)
+    return model_name
+
+
+def parse_model_spec(model_spec: str) -> Tuple[str, Optional[str]]:
+    """Parse model specification with optional @hash syntax.
+    
+    Examples:
+        'Phi-3-mini' → ('mlx-community/Phi-3-mini-4k-instruct-4bit', None)
+        'Qwen3@e96' → ('Qwen/Qwen3-Coder-480B-A35B-Instruct', 'e96')
+    """
+    if "@" in model_spec:
+        model_name, commit_hash = model_spec.rsplit("@", 1)
+        expanded_name = expand_model_name(model_name)
+        return expanded_name, commit_hash
+    
+    expanded_name = expand_model_name(model_spec)
+    return expanded_name, None
+
+
+def find_matching_models(pattern: str) -> List[Tuple[Path, str]]:
+    """Find models that match a partial pattern (case-insensitive)."""
+    model_cache = get_current_model_cache()
+    if not model_cache.exists():
+        return []
+        
+    all_models = [d for d in model_cache.iterdir() if d.name.startswith("models--")]
+    matches = []
+    
+    for model_dir in all_models:
+        hf_name = cache_dir_to_hf(model_dir.name)
+        # Case-insensitive partial matching in full name or short name
+        short_name = hf_name.split('/')[-1] if '/' in hf_name else hf_name
+        
+        if (pattern.lower() in hf_name.lower() or 
+            pattern.lower() in short_name.lower()):
+            matches.append((model_dir, hf_name))
+    
+    return matches
+
+
+def find_model_by_hash(pattern: str, commit_hash: str) -> Optional[Tuple[Path, str, str]]:
+    """Find model by pattern and verify hash exists in snapshots.
+    
+    Returns: (model_dir, hf_name, full_hash) or None
+    """
+    matches = find_matching_models(pattern)
+    
+    for model_dir, hf_name in matches:
+        snapshots_dir = model_dir / "snapshots"
+        if not snapshots_dir.exists():
+            continue
+            
+        # Check for hash match (short hash support)
+        for snapshot_dir in snapshots_dir.iterdir():
+            if snapshot_dir.is_dir() and snapshot_dir.name.startswith(commit_hash):
+                return model_dir, hf_name, snapshot_dir.name
+    
+    return None
+
+
+def resolve_model_for_operation(model_spec: str) -> Tuple[Optional[str], Optional[str], Optional[List[str]]]:
+    """Resolve model specification for operations.
+    
+    Returns:
+        (resolved_name, commit_hash, ambiguous_matches)
+        
+    Examples:
+        'Phi-3-mini' → ('mlx-community/Phi-3-mini-4k-instruct-4bit', None, None)
+        'Qwen3@e96' → ('Qwen/Qwen3-Coder-480B-A35B-Instruct', 'e96', None) 
+        'ambig' → (None, None, ['model1', 'model2'])
+    """
+    model_name, commit_hash = parse_model_spec(model_spec)
+    
+    # For @hash syntax, find by pattern + hash verification
+    if commit_hash:
+        base_pattern = model_spec.split('@')[0]
+        result = find_model_by_hash(base_pattern, commit_hash)
+        if result:
+            model_dir, hf_name, full_hash = result
+            return hf_name, full_hash, None
+        else:
+            return None, commit_hash, []
+    
+    # Try exact match first
+    model_cache = get_current_model_cache()
+    exact_cache_dir = model_cache / hf_to_cache_dir(model_name)
+    if exact_cache_dir.exists():
+        return model_name, None, None
+    
+    # Try fuzzy matching
+    base_pattern = model_spec.split('@')[0] if '@' in model_spec else model_spec
+    matches = find_matching_models(base_pattern)
+    
+    if not matches:
+        return None, None, []
+    elif len(matches) == 1:
+        # Unambiguous fuzzy match
+        model_dir, hf_name = matches[0]
+        return hf_name, commit_hash, None
+    else:
+        # Ambiguous matches
+        match_names = [hf_name for _, hf_name in matches]
+        return None, commit_hash, match_names
\ No newline at end of file
diff --git a/mlx_knife/reasoning_utils.py b/mlxk2/core/reasoning.py
similarity index 99%
rename from mlx_knife/reasoning_utils.py
rename to mlxk2/core/reasoning.py
index 0c3c624..febf837 100644
--- a/mlx_knife/reasoning_utils.py
+++ b/mlxk2/core/reasoning.py
@@ -1,6 +1,8 @@
 """
 Utilities for handling reasoning models and their output.
 
+Ported from 1.x mlx_knife/reasoning_utils.py for 2.0 compatibility.
+
 Different models use different formats for reasoning:
 - MXFP4/GPT-OSS: <|channel|>analysis<|message|>REASONING<|end|>...<|channel|>final<|message|>ANSWER
 - DeepSeek R1: <think>REASONING</think>ANSWER
@@ -406,4 +408,4 @@ class StreamingReasoningParser:
                 yield self.buffer
             elif self.state == "IN_FINAL":
                 # Final answer content
-                yield self.buffer
+                yield self.buffer
\ No newline at end of file
diff --git a/mlxk2/core/runner/__init__.py b/mlxk2/core/runner/__init__.py
new file mode 100644
index 0000000..ee658c8
--- /dev/null
+++ b/mlxk2/core/runner/__init__.py
@@ -0,0 +1,644 @@
+"""
+MLX model runner for 2.0 implementation.
+Ported from 1.x mlx_knife/mlx_runner.py with 2.0 architecture integration.
+
+Refactor: packaged as mlxk2.core.runner with helper modules for
+- token limits, chat formatting, reasoning formatting, and stop tokens.
+Behavior is unchanged; public API and patch points are preserved.
+"""
+
+import time
+import signal
+from collections.abc import Iterator
+from pathlib import Path
+from typing import Optional
+
+from ..cache import get_current_model_cache, hf_to_cache_dir
+from ..model_resolution import resolve_model_for_operation
+from ..reasoning import ReasoningExtractor, StreamingReasoningParser
+from .token_limits import get_model_context_length, calculate_dynamic_max_tokens
+from .chat_format import apply_user_prompt, format_conversation as _format_conversation_helper
+from .reasoning_format import format_reasoning_response as _format_reasoning_helper
+from .stop_tokens import extract_stop_tokens as _extract_stop_tokens_helper
+
+# Defer MLX/MLX-LM imports to runtime to avoid init crashes during test collection
+mx = None  # type: ignore[assignment]
+# Expose patchable names for tests (set by tests or lazily inside methods)
+load = None  # type: ignore[assignment]
+generate_step = None  # type: ignore[assignment]
+make_repetition_penalty = None  # type: ignore[assignment]
+make_sampler = None  # type: ignore[assignment]
+
+
+# get_model_context_length is re-exported from token_limits
+
+
+class MLXRunner:
+    """Core MLX model execution engine for 2.0."""
+
+    def __init__(self, model_name_or_path: str, adapter_path: Optional[str] = None, verbose: bool = False,
+                 install_signal_handlers: bool = True):
+        """Initialize the runner with a model.
+        
+        Args:
+            model_name_or_path: Model specification or path
+            adapter_path: Optional path to LoRA adapter
+            verbose: Show detailed output
+            install_signal_handlers: Whether to install SIGINT handler (disable for server mode)
+        """
+        self.model_spec = model_name_or_path
+        self.adapter_path = adapter_path
+        self.model = None
+        self.tokenizer = None
+        self._memory_baseline = None
+        self._stop_tokens = None
+        self._chat_stop_tokens = None
+        self._context_length = None
+        self._is_reasoning_model = False
+        self._reasoning_start = None
+        self._reasoning_end = None
+        self._final_start = None
+        self.verbose = verbose
+        self._model_loaded = False
+        self._context_entered = False
+        self._interrupted = False
+        self._current_generator = None  # Handle to in-flight generation (for early cancellation)
+        
+        # Lazy-loaded MLX/MLX-LM refs (set in load_model / generation)
+        self._mx = None
+        self._load = None
+        self._generate_step = None
+        self._make_repetition_penalty = None
+        self._make_sampler = None
+        
+        # Set up signal handler for Ctrl-C (only for run/interactive mode)
+        if install_signal_handlers:
+            signal.signal(signal.SIGINT, self._handle_interrupt)
+
+    def __enter__(self):
+        """Context manager entry - loads the model."""
+        if self._context_entered:
+            raise RuntimeError("MLXRunner context manager cannot be entered multiple times")
+        
+        self._context_entered = True
+        try:
+            self.load_model()
+            return self
+        except Exception:
+            self._context_entered = False
+            self.cleanup()
+            raise
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        """Context manager exit - cleans up the model."""
+        self._context_entered = False
+        self.cleanup()
+        return False
+
+    def _handle_interrupt(self, signum, frame):
+        """Handle Ctrl-C interruption during generation."""
+        self._interrupted = True
+
+    def request_interrupt(self) -> None:
+        """Request an interruption from external controller (e.g., server signal).
+
+        This sets the internal interruption flag so that ongoing generation loops
+        will stop promptly at the next safe check point. Intended for server mode
+        where per-runner OS signal handlers are disabled.
+        """
+        self._interrupted = True
+        # Attempt to close any in-flight generator immediately to stop compute
+        gen = getattr(self, "_current_generator", None)
+        if gen is not None:
+            try:
+                close = getattr(gen, "close", None)
+                if callable(close):
+                    close()
+            except Exception:
+                pass
+
+    def load_model(self):
+        """Load the MLX model and tokenizer."""
+        if self._model_loaded:
+            if self.verbose:
+                print("Model already loaded, skipping...")
+            return
+
+        # Lazy import MLX and MLX-LM here
+        try:
+            import mlx.core as _mx  # type: ignore
+        except Exception as e:
+            raise RuntimeError(f"Failed to import MLX core: {e}") from e
+        # Prefer test-patched load if available
+        _load = globals().get('load')
+        if _load is None:
+            try:
+                from mlx_lm import load as _load  # type: ignore
+            except Exception as e:
+                raise RuntimeError(f"Failed to import MLX-LM load(): {e}") from e
+
+        # Resolve model path using 2.0 resolution
+        resolved_name, commit_hash, ambiguous = resolve_model_for_operation(self.model_spec)
+        
+        if ambiguous:
+            raise ValueError(f"Ambiguous model specification '{self.model_spec}'. Could be: {ambiguous}")
+        
+        if not resolved_name:
+            # In tests, resolution may be bypassed; fall back to provided spec
+            resolved_name = str(self.model_spec)
+        
+        model_cache = get_current_model_cache()
+        # Support tests that patch cache to a Mock by avoiding Path ops
+        is_path_like = isinstance(model_cache, (str, Path)) or all(
+            hasattr(model_cache, attr) for attr in ("__truediv__",)
+        )
+
+        if not resolved_name:
+            # Fallback to provided spec (tests may patch load() to accept any path)
+            resolved_name = str(self.model_spec)
+
+        if is_path_like:
+            model_cache_dir = (Path(model_cache) if not isinstance(model_cache, Path) else model_cache) / hf_to_cache_dir(resolved_name)
+            if commit_hash:
+                model_path = model_cache_dir / "snapshots" / commit_hash
+            else:
+                # Try to find a snapshot directory; tolerate missing during tests
+                snapshots_dir = model_cache_dir / "snapshots"
+                if snapshots_dir.exists():
+                    snapshots = [d for d in snapshots_dir.iterdir() if d.is_dir()]
+                    model_path = snapshots[0] if snapshots else snapshots_dir / "mock"
+                else:
+                    model_path = snapshots_dir / "mock"
+        else:
+            # Non path-like cache (likely a Mock in unit tests) → pass a synthetic path to load()
+            model_path = Path("/mock") / hf_to_cache_dir(resolved_name) / "snapshots" / (commit_hash or "mock")
+
+        if self.verbose:
+            print(f"Loading model from {model_path}...")
+        start_time = time.time()
+
+        # Capture baseline memory before loading
+        try:
+            _mx.clear_cache()
+        except Exception:
+            pass
+        self._memory_baseline = _mx.get_active_memory() / 1024**3
+
+        try:
+            # Load model and tokenizer
+            self.model, self.tokenizer = _load(
+                str(model_path),
+                adapter_path=self.adapter_path
+            )
+
+            load_time = time.time() - start_time
+            current_memory = _mx.get_active_memory() / 1024**3
+            model_memory = current_memory - self._memory_baseline
+
+            if self.verbose:
+                print(f"Model loaded in {load_time:.1f}s")
+                print(f"Memory: {model_memory:.1f}GB model, {current_memory:.1f}GB total")
+
+            # Extract stop tokens and other properties
+            self._extract_stop_tokens()
+            self._context_length = get_model_context_length(str(model_path))
+            
+            if self.verbose:
+                print(f"Model context length: {self._context_length} tokens")
+                
+            self._model_loaded = True
+            # Store MLX refs for later use
+            self._mx = _mx
+            self._load = _load  # type: ignore
+            
+        except Exception as e:
+            self.model = None
+            self.tokenizer = None
+            self._stop_tokens = None
+            self._model_loaded = False
+            try:
+                _mx.clear_cache()
+            except Exception:
+                pass
+            # Preserve FileNotFoundError (used by tests) and propagate
+            if isinstance(e, FileNotFoundError):
+                raise e
+            raise RuntimeError(f"Failed to load model from {model_path}: {e}") from e
+
+    def _extract_stop_tokens(self):
+        """Extract stop tokens from the tokenizer dynamically (delegated)."""
+        info = _extract_stop_tokens_helper(self.tokenizer, verbose=self.verbose)
+        self._stop_tokens = info.stop_tokens
+        self._chat_stop_tokens = info.chat_stop_tokens
+        self._is_reasoning_model = info.is_reasoning_model
+        self._reasoning_start = info.reasoning_start
+        self._reasoning_end = info.reasoning_end
+        self._final_start = info.final_start
+        if self.verbose and self._stop_tokens:
+            print(f"Stop tokens: {self._stop_tokens}")
+        if self.verbose and self._is_reasoning_model:
+            print("Reasoning model detected - special handling enabled")
+
+    def cleanup(self):
+        """Clean up model resources and clear GPU memory."""
+        mx_core = self._mx
+        if self.verbose and self._model_loaded and mx_core is not None:
+            memory_before = mx_core.get_active_memory() / 1024**3
+            print(f"Cleaning up model (memory before: {memory_before:.1f}GB)...")
+
+        self.model = None
+        self.tokenizer = None
+        self._stop_tokens = None
+        self._chat_stop_tokens = None
+        self._context_length = None
+        self._is_reasoning_model = False
+        self._reasoning_start = None
+        self._reasoning_end = None
+        self._final_start = None
+        self._model_loaded = False
+
+        # Force garbage collection and clear MLX cache
+        import gc
+        gc.collect()
+        try:
+            mx.clear_cache()
+        except Exception:
+            pass
+
+        if self.verbose and mx_core is not None:
+            memory_after = mx_core.get_active_memory() / 1024**3
+            if 'memory_before' in locals():
+                memory_freed = memory_before - memory_after
+                print(f"Cleanup complete (memory after: {memory_after:.1f}GB, freed: {memory_freed:.1f}GB)")
+            else:
+                print(f"Cleanup complete (memory after: {memory_after:.1f}GB)")
+
+    def _calculate_dynamic_max_tokens(self, server_mode: bool = True) -> int:
+        """Calculate dynamic max tokens based on model context and usage mode."""
+        return calculate_dynamic_max_tokens(self._context_length, server_mode=server_mode)
+
+    def generate_streaming(
+        self,
+        prompt: str,
+        max_tokens: Optional[int] = None,
+        temperature: float = 0.7,
+        top_p: float = 0.9,
+        repetition_penalty: float = 1.1,
+        repetition_context_size: int = 20,
+        use_chat_template: bool = True,
+        use_chat_stop_tokens: bool = False,
+        hide_reasoning: bool = False,
+    ) -> Iterator[str]:
+        """Generate text with streaming output.
+        
+        Args:
+            prompt: Input prompt
+            max_tokens: Maximum tokens to generate (None for dynamic)
+            temperature: Sampling temperature
+            top_p: Top-p sampling parameter
+            repetition_penalty: Penalty for repeated tokens
+            repetition_context_size: Context size for repetition penalty
+            use_chat_template: Apply tokenizer's chat template if available
+            use_chat_stop_tokens: Include chat turn markers as stop tokens
+            hide_reasoning: Hide reasoning section for reasoning models
+            
+        Yields:
+            Generated tokens as they are produced
+        """
+        if not self.model or not self.tokenizer:
+            raise RuntimeError("Model not loaded. Call load_model() first.")
+        
+        # Reset any prior interruption at the start of a new generation
+        # so that a previous Ctrl-C does not affect the next run
+        self._interrupted = False
+
+        # Initialize reasoning parser if this is a reasoning model
+        reasoning_parser = None
+        if self._is_reasoning_model:
+            model_type = ReasoningExtractor.detect_model_type(
+                getattr(self.tokenizer, 'name_or_path', '') or ''
+            )
+            reasoning_parser = StreamingReasoningParser(model_type, hide_reasoning=hide_reasoning)
+
+        # Use dynamic max tokens if not specified (run command uses full context)
+        effective_max_tokens = max_tokens if max_tokens is not None else self._calculate_dynamic_max_tokens(server_mode=False)
+
+        # Apply chat template if available and requested
+        formatted_prompt = apply_user_prompt(self.tokenizer, prompt, use_chat_template=use_chat_template)
+
+        # Tokenize the prompt (tolerate mocks)
+        prompt_tokens = self.tokenizer.encode(formatted_prompt)
+        if not isinstance(prompt_tokens, (list, tuple)):
+            prompt_tokens = [0]
+        # Ensure MLX core is available
+        mx_core = self._mx
+        if mx_core is None:
+            try:
+                import mlx.core as mx_core  # type: ignore
+                self._mx = mx_core
+            except Exception as e:
+                raise RuntimeError(f"Failed to import mlx.core for generation: {e}") from e
+        prompt_array = mx_core.array(prompt_tokens)
+
+        # Track generation metrics
+        start_time = time.time()
+        tokens_generated = 0
+
+        # Create sampler and logits processors
+        # Lazy import generation utilities
+        if self._make_sampler is None or self._make_repetition_penalty is None or self._generate_step is None:
+            # Prefer test-patched functions if present
+            _ms = globals().get('make_sampler')
+            _mrp = globals().get('make_repetition_penalty')
+            _gs = globals().get('generate_step')
+            if _ms is None or _mrp is None or _gs is None:
+                try:
+                    from mlx_lm.sample_utils import make_repetition_penalty as _mrp2, make_sampler as _ms2  # type: ignore
+                    from mlx_lm.generate import generate_step as _gs2  # type: ignore
+                    _mrp = _mrp or _mrp2
+                    _ms = _ms or _ms2
+                    _gs = _gs or _gs2
+                except Exception as e:
+                    raise RuntimeError(f"Failed to import MLX-LM generation utils: {e}") from e
+            self._make_repetition_penalty = _mrp
+            self._make_sampler = _ms
+            self._generate_step = _gs
+
+        sampler = self._make_sampler(temp=temperature, top_p=top_p)
+        logits_processors = []
+        if repetition_penalty > 1.0:
+            logits_processors.append(
+                self._make_repetition_penalty(repetition_penalty, repetition_context_size)
+            )
+
+        # Generate tokens one by one for streaming
+        ret = self._generate_step(
+            prompt=prompt_array,
+            model=self.model,
+            max_tokens=effective_max_tokens,
+            sampler=sampler,
+            logits_processors=logits_processors if logits_processors else None,
+        )
+        generator = ret
+        if isinstance(ret, tuple) and len(ret) == 2:
+            # Normalize tuple return into a single-step iterator
+            generator = iter([ret])
+        self._current_generator = generator
+
+        # Collect and yield tokens
+        generated_tokens = []
+        previous_decoded = ""
+        accumulated_response = ""
+        context_window = 10
+
+        for token, _ in generator:
+            # Check for interruption
+            if self._interrupted:
+                # Close underlying generator to stop backend compute quickly
+                try:
+                    if hasattr(generator, "close"):
+                        generator.close()
+                except Exception:
+                    pass
+                yield "\n[Generation interrupted by user]"
+                break
+
+            token_id = token.item() if hasattr(token, 'item') else token
+            generated_tokens.append(token_id)
+
+            # Use sliding window for proper decoding
+            start_idx = max(0, len(generated_tokens) - context_window)
+            window_tokens = generated_tokens[start_idx:]
+            window_text = self.tokenizer.decode(window_tokens)
+
+            # Extract new text
+            if start_idx == 0:
+                # Prefer using the decoded window and diff vs previous text
+                if previous_decoded and window_text.startswith(previous_decoded):
+                    new_text = window_text[len(previous_decoded):]
+                else:
+                    # Fallback: take the window_text directly (robust to minimal mocks)
+                    new_text = window_text
+                previous_decoded = window_text
+            else:
+                new_text = self.tokenizer.decode(window_tokens)
+                if len(window_tokens) > 1:
+                    prefix = self.tokenizer.decode(window_tokens[:-1])
+                    if new_text.startswith(prefix):
+                        new_text = new_text[len(prefix):]
+                    else:
+                        new_text = self.tokenizer.decode([token_id])
+
+            if new_text:
+                accumulated_response += new_text
+                
+                # Check for stop tokens (strings only)
+                stop_tokens_to_check = self._stop_tokens if self._stop_tokens else []
+                stop_tokens_to_check = [t for t in stop_tokens_to_check if isinstance(t, str) and t]
+                if use_chat_stop_tokens:
+                    stop_tokens_to_check.extend(self._chat_stop_tokens)
+                
+                for stop_token in stop_tokens_to_check:
+                    if stop_token in accumulated_response:
+                        stop_pos = accumulated_response.find(stop_token)
+                        text_before_stop = accumulated_response[:stop_pos]
+                        previously_yielded_length = len(accumulated_response) - len(new_text)
+                        if len(text_before_stop) > previously_yielded_length:
+                            new_part_before_stop = text_before_stop[previously_yielded_length:]
+                            if new_part_before_stop:
+                                if reasoning_parser:
+                                    # Process through reasoning parser for formatting
+                                    for formatted_token in reasoning_parser.process_token(new_part_before_stop):
+                                        yield formatted_token
+                                else:
+                                    yield new_part_before_stop
+                        return
+
+                # No stop token found, process the new text
+                if reasoning_parser:
+                    # Process through reasoning parser for formatting
+                    for formatted_token in reasoning_parser.process_token(new_text):
+                        yield formatted_token
+                else:
+                    # Normal streaming for non-reasoning models
+                    yield new_text
+                tokens_generated += 1
+
+            # Check for EOS token (ADR-009: use eos_token_ids Set for multi-EOS models)
+            if token_id in self.tokenizer.eos_token_ids:
+                break
+
+        # Finalize reasoning parser if used
+        if reasoning_parser:
+            yield from reasoning_parser.finalize()
+
+        # Clear current generator handle
+        self._current_generator = None
+
+        if self.verbose:
+            generation_time = time.time() - start_time
+            tokens_per_second = tokens_generated / generation_time if generation_time > 0 else 0
+            print(f"\n\nGenerated {tokens_generated} tokens in {generation_time:.1f}s ({tokens_per_second:.1f} tokens/s)")
+
+    def generate_batch(
+        self,
+        prompt: str,
+        max_tokens: Optional[int] = None,
+        temperature: float = 0.7,
+        top_p: float = 0.9,
+        repetition_penalty: float = 1.1,
+        repetition_context_size: int = 20,
+        use_chat_template: bool = True,
+        use_chat_stop_tokens: bool = False,
+    ) -> str:
+        """Generate text in batch mode (non-streaming).
+        
+        Args:
+            prompt: Input prompt
+            max_tokens: Maximum tokens to generate (None for dynamic)
+            temperature: Sampling temperature
+            top_p: Top-p sampling parameter
+            repetition_penalty: Penalty for repeated tokens
+            repetition_context_size: Context size for repetition penalty
+            use_chat_template: Apply tokenizer's chat template if available
+            use_chat_stop_tokens: Include chat turn markers as stop tokens (e.g., "\nHuman:")
+            
+        Returns:
+            Generated text
+        """
+        if not self.model or not self.tokenizer:
+            raise RuntimeError("Model not loaded. Call load_model() first.")
+        
+        # Reset any prior interruption at the start of a new generation
+        self._interrupted = False
+
+        # Use dynamic max tokens if not specified (run command uses full context)
+        effective_max_tokens = max_tokens if max_tokens is not None else self._calculate_dynamic_max_tokens(server_mode=False)
+
+        # Apply chat template if available and requested
+        formatted_prompt = apply_user_prompt(self.tokenizer, prompt, use_chat_template=use_chat_template)
+
+        start_time = time.time()
+
+        # Tokenize and generate (tolerate mocks)
+        prompt_tokens = self.tokenizer.encode(formatted_prompt)
+        if not isinstance(prompt_tokens, (list, tuple)):
+            prompt_tokens = [0]
+        # Ensure MLX core is available
+        mx_core = self._mx
+        if mx_core is None:
+            try:
+                import mlx.core as mx_core  # type: ignore
+                self._mx = mx_core
+            except Exception as e:
+                raise RuntimeError(f"Failed to import mlx.core for generation: {e}") from e
+        prompt_array = mx_core.array(prompt_tokens)
+
+        if self._make_sampler is None or self._make_repetition_penalty is None or self._generate_step is None:
+            _ms = globals().get('make_sampler')
+            _mrp = globals().get('make_repetition_penalty')
+            _gs = globals().get('generate_step')
+            if _ms is None or _mrp is None or _gs is None:
+                try:
+                    from mlx_lm.sample_utils import make_repetition_penalty as _mrp2, make_sampler as _ms2  # type: ignore
+                    from mlx_lm.generate import generate_step as _gs2  # type: ignore
+                    _mrp = _mrp or _mrp2
+                    _ms = _ms or _ms2
+                    _gs = _gs or _gs2
+                except Exception as e:
+                    raise RuntimeError(f"Failed to import MLX-LM generation utils: {e}") from e
+            self._make_repetition_penalty = _mrp
+            self._make_sampler = _ms
+            self._generate_step = _gs
+        sampler = self._make_sampler(temp=temperature, top_p=top_p)
+        logits_processors = []
+        if repetition_penalty > 1.0:
+            logits_processors.append(
+                self._make_repetition_penalty(repetition_penalty, repetition_context_size)
+            )
+
+        # Generate all tokens
+        generated_tokens = []
+        all_tokens = list(prompt_tokens)
+
+        ret = self._generate_step(
+            prompt=prompt_array,
+            model=self.model,
+            max_tokens=effective_max_tokens,
+            sampler=sampler,
+            logits_processors=logits_processors if logits_processors else None,
+        )
+        generator = ret
+        if isinstance(ret, tuple) and len(ret) == 2:
+            generator = iter([ret])
+        self._current_generator = generator
+
+        for token, _ in generator:
+            if self._interrupted:
+                try:
+                    if hasattr(generator, "close"):
+                        generator.close()
+                except Exception:
+                    pass
+                break
+                
+            token_id = token.item() if hasattr(token, 'item') else token
+            generated_tokens.append(token_id)
+            all_tokens.append(token_id)
+
+            # Check for EOS token (ADR-009: use eos_token_ids Set for multi-EOS models)
+            if token_id in self.tokenizer.eos_token_ids:
+                break
+
+        # Decode full response
+        full_response = self.tokenizer.decode(all_tokens)
+
+        # Remove prompt part (guard types to tolerate mocks)
+        if isinstance(full_response, str) and isinstance(formatted_prompt, str) and full_response.startswith(formatted_prompt):
+            response = full_response[len(formatted_prompt):]
+        else:
+            decoded = self.tokenizer.decode(generated_tokens)
+            response = decoded if isinstance(decoded, str) else str(decoded)
+
+        # Filter stop tokens (strings only)
+        if self._stop_tokens:
+            for stop_token in [t for t in self._stop_tokens if isinstance(t, str) and t]:
+                if stop_token and stop_token in response:
+                    response = response[:response.find(stop_token)]
+                    break
+
+        # Optionally filter chat stop tokens to prevent self-conversations in batch mode
+        if use_chat_stop_tokens and self._chat_stop_tokens:
+            for stop_token in self._chat_stop_tokens:
+                if stop_token and stop_token in response:
+                    response = response[:response.find(stop_token)]
+                    break
+
+        # Format reasoning models output
+        response = self._format_reasoning_response(response)
+
+        generation_time = time.time() - start_time
+
+        if self.verbose:
+            tokens_generated = len(generated_tokens)
+            tokens_per_second = tokens_generated / generation_time if generation_time > 0 else 0
+            print(f"\nGenerated {tokens_generated} tokens in {generation_time:.1f}s ({tokens_per_second:.1f} tokens/s)")
+
+        # Clear current generator handle
+        self._current_generator = None
+
+        return response
+
+    def _format_conversation(self, messages):
+        """Format conversation history into a prompt using chat template."""
+        return _format_conversation_helper(self.tokenizer, messages)
+
+    def _format_reasoning_response(self, response: str) -> str:
+        """Format response from reasoning models for better readability."""
+        return _format_reasoning_helper(
+            response,
+            self._is_reasoning_model,
+            self._reasoning_start,
+            self._reasoning_end,
+            self._final_start,
+        )
diff --git a/mlxk2/core/runner/chat_format.py b/mlxk2/core/runner/chat_format.py
new file mode 100644
index 0000000..9fb94c7
--- /dev/null
+++ b/mlxk2/core/runner/chat_format.py
@@ -0,0 +1,51 @@
+from __future__ import annotations
+
+from typing import Any, Dict, List
+
+
+def apply_user_prompt(tokenizer: Any, prompt: str, use_chat_template: bool = True) -> str:
+    """Format a single user prompt using the tokenizer's chat template if present."""
+    template = getattr(tokenizer, 'chat_template', None)
+    if use_chat_template and isinstance(template, str) and template:
+        messages = [{"role": "user", "content": prompt}]
+        try:
+            return tokenizer.apply_chat_template(
+                messages,
+                tokenize=False,
+                add_generation_prompt=True,
+            )
+        except Exception:
+            # Fall back to raw prompt if chat template application fails
+            pass
+    return prompt
+
+
+def format_conversation(tokenizer: Any, messages: List[Dict[str, str]]) -> str:
+    """Format conversation history into a prompt using chat template if available.
+
+    Falls back to legacy Human/Assistant formatting when no chat template exists.
+    """
+    template = getattr(tokenizer, 'chat_template', None)
+    if isinstance(template, str) and template:
+        try:
+            return tokenizer.apply_chat_template(
+                messages,
+                tokenize=False,
+                add_generation_prompt=True,
+            )
+        except Exception:
+            # Fall back to legacy format if template application fails
+            pass
+
+    formatted_parts = []
+    for msg in messages:
+        role = msg["role"]
+        content = msg["content"]
+        if role == "system":
+            formatted_parts.append(f"System: {content}")
+        elif role == "user":
+            formatted_parts.append(f"Human: {content}")
+        elif role == "assistant":
+            formatted_parts.append(f"Assistant: {content}")
+    return "\n\n".join(formatted_parts) + "\n\nAssistant: "
+
diff --git a/mlxk2/core/runner/reasoning_format.py b/mlxk2/core/runner/reasoning_format.py
new file mode 100644
index 0000000..04f27ce
--- /dev/null
+++ b/mlxk2/core/runner/reasoning_format.py
@@ -0,0 +1,51 @@
+from __future__ import annotations
+
+from typing import Optional
+
+
+def format_reasoning_response(
+    response: str,
+    is_reasoning_model: bool,
+    reasoning_start: Optional[str],
+    reasoning_end: Optional[str],
+    final_start: Optional[str],
+) -> str:
+    """Format response for reasoning-style models.
+
+    Mirrors MLXRunner._format_reasoning_response behavior without changing semantics.
+    """
+    if not is_reasoning_model:
+        return response
+
+    if reasoning_start and final_start and reasoning_start in response and final_start in response:
+        try:
+            before_reasoning, after_start = response.split(reasoning_start, 1)
+            if reasoning_end and reasoning_end in after_start:
+                reasoning_content, after_reasoning = after_start.split(reasoning_end, 1)
+                if final_start in after_reasoning:
+                    final_parts = after_reasoning.split(final_start, 1)
+                    if len(final_parts) > 1:
+                        final_answer = final_parts[1].replace('<|channel|>final<|message|>', '', 1)
+                        formatted = []
+                        formatted.append("\n**[Reasoning]**\n")
+                        formatted.append(reasoning_content.strip())
+                        formatted.append("\n\n---\n\n**[Answer]**\n")
+                        formatted.append(final_answer.strip())
+                        return '\n'.join(formatted)
+        except Exception:
+            pass
+
+    # Fallback cleanup
+    cleaned = response
+    if reasoning_start:
+        cleaned = cleaned.replace(reasoning_start, '')
+    if reasoning_end:
+        cleaned = cleaned.replace(reasoning_end, '')
+    if final_start:
+        cleaned = cleaned.replace(final_start, '')
+
+    for marker in ['<|start|>assistant', '<|return|>']:
+        cleaned = cleaned.replace(marker, '')
+
+    return cleaned.strip()
+
diff --git a/mlxk2/core/runner/stop_tokens.py b/mlxk2/core/runner/stop_tokens.py
new file mode 100644
index 0000000..b1cc57b
--- /dev/null
+++ b/mlxk2/core/runner/stop_tokens.py
@@ -0,0 +1,123 @@
+from __future__ import annotations
+
+from dataclasses import dataclass
+from typing import Any, List, Optional, Set
+
+from ..reasoning import ReasoningExtractor
+
+
+@dataclass
+class StopTokenInfo:
+    stop_tokens: List[str]
+    chat_stop_tokens: List[str]
+    is_reasoning_model: bool
+    reasoning_start: Optional[str]
+    reasoning_end: Optional[str]
+    final_start: Optional[str]
+
+
+def extract_stop_tokens(tokenizer: Any, verbose: bool = False) -> StopTokenInfo:
+    """Extract stop tokens and reasoning markers from a tokenizer.
+
+    This mirrors MLXRunner._extract_stop_tokens logic.
+    """
+    stop_tokens: Set[str] = set()
+
+    eos_token = getattr(tokenizer, 'eos_token', None)
+    if eos_token:
+        stop_tokens.add(eos_token)
+
+    pad_token = getattr(tokenizer, 'pad_token', None)
+    if pad_token and pad_token != eos_token:
+        stop_tokens.add(pad_token)
+
+    additional = getattr(tokenizer, 'additional_special_tokens', None)
+    if isinstance(additional, (list, tuple)):
+        for token in additional:
+            if isinstance(token, str) and token:
+                tl = token.lower()
+                if any(keyword in tl for keyword in ['end', 'stop', 'eot']):
+                    stop_tokens.add(token)
+
+    decoder = getattr(tokenizer, 'added_tokens_decoder', None)
+    if isinstance(decoder, dict):
+        for _token_id, token_info in decoder.items():
+            if isinstance(token_info, dict) and 'content' in token_info:
+                token_content = token_info['content']
+                if isinstance(token_content, str) and token_content:
+                    token_lower = token_content.lower()
+                    if token_content == '<|end|>':
+                        add_eos_token = getattr(tokenizer, 'add_eos_token', None)
+                        if callable(add_eos_token):
+                            try:
+                                add_eos_token(token_content)
+                            except Exception:
+                                pass
+                        continue
+                    end_patterns = ['stop', 'eot', 'return', 'finish', 'done', 'im_end']
+                    if any(pattern in token_lower for pattern in end_patterns):
+                        stop_tokens.add(token_content)
+                    elif 'end' in token_lower and token_content != '<|end|>':
+                        stop_tokens.add(token_content)
+
+    # Common stop tokens: add if tokenizer encodes them as a single token and decodes faithfully
+    common_stop_tokens = {'</s>', '<|endoftext|>', '<|im_end|>', '<|eot_id|>'}
+    for token in common_stop_tokens:
+        try:
+            ids = tokenizer.encode(token, add_special_tokens=False)
+            if ids and len(ids) == 1:
+                decoded = tokenizer.decode(ids)
+                if decoded == token:
+                    stop_tokens.add(token)
+        except Exception:
+            pass
+
+    is_reasoning_model = False
+    reasoning_start: Optional[str] = None
+    reasoning_end: Optional[str] = None
+    final_start: Optional[str] = None
+
+    if hasattr(tokenizer, 'name_or_path'):
+        try:
+            name_or_path = str(getattr(tokenizer, 'name_or_path', '')).lower()
+        except Exception:
+            name_or_path = ''
+        model_type = ReasoningExtractor.detect_model_type(name_or_path)
+
+        if model_type:
+            is_reasoning_model = True
+            if model_type in ReasoningExtractor.PATTERNS:
+                markers = ReasoningExtractor.PATTERNS[model_type]['markers']
+                reasoning_start = markers.get('reasoning_start')
+                reasoning_end = markers.get('reasoning_end')
+                final_start = markers.get('final_marker')
+
+            if reasoning_end:
+                stop_tokens.discard(reasoning_end)
+
+            if model_type == 'gpt-oss':
+                stop_tokens.add('<|return|>')
+
+            if verbose:
+                # Keep any print semantics consistent with previous behavior
+                pass
+
+    chat_stop_tokens = [
+        '\nHuman:', '\nAssistant:', '\nYou:',
+        '\n\nHuman:', '\n\nAssistant:', '\n\nYou:',
+        '\nH:', '\nA:', '\nY:',
+        '\n\nH:', '\n\nA:', '\n\nY:',
+    ]
+
+    # Remove None values and normalize to list[str]
+    stop_tokens.discard(None)  # type: ignore[arg-type]
+    stop_tokens_list = [t for t in stop_tokens if isinstance(t, str) and t]
+
+    return StopTokenInfo(
+        stop_tokens=stop_tokens_list,
+        chat_stop_tokens=chat_stop_tokens,
+        is_reasoning_model=is_reasoning_model,
+        reasoning_start=reasoning_start,
+        reasoning_end=reasoning_end,
+        final_start=final_start,
+    )
diff --git a/mlxk2/core/runner/token_limits.py b/mlxk2/core/runner/token_limits.py
new file mode 100644
index 0000000..731539c
--- /dev/null
+++ b/mlxk2/core/runner/token_limits.py
@@ -0,0 +1,45 @@
+from __future__ import annotations
+
+import json
+import os
+from typing import Optional
+
+
+def get_model_context_length(model_path: str) -> int:
+    """Extract max_position_embeddings from model config with safe fallbacks.
+
+    Returns a sensible default (4096) if the config is missing or malformed.
+    """
+    config_path = os.path.join(model_path, "config.json")
+    try:
+        with open(config_path) as f:
+            config = json.load(f)
+
+        context_keys = [
+            "max_position_embeddings",
+            "n_positions",
+            "context_length",
+            "max_sequence_length",
+            "seq_len",
+        ]
+
+        for key in context_keys:
+            if key in config:
+                value = config[key]
+                if isinstance(value, int) and value > 0:
+                    return value
+                if isinstance(value, str) and value.isdigit():
+                    parsed = int(value)
+                    if parsed > 0:
+                        return parsed
+        return 4096
+    except (FileNotFoundError, json.JSONDecodeError, KeyError):
+        return 4096
+
+
+def calculate_dynamic_max_tokens(context_length: Optional[int], server_mode: bool = True) -> int:
+    """Compute an effective generation limit based on context and mode."""
+    if not context_length or context_length <= 0:
+        return 2048
+    return context_length // 2 if server_mode else context_length
+
diff --git a/mlxk2/core/server_base.py b/mlxk2/core/server_base.py
new file mode 100644
index 0000000..a07cdcf
--- /dev/null
+++ b/mlxk2/core/server_base.py
@@ -0,0 +1,909 @@
+"""
+OpenAI-compatible API server for MLX models (2.0 implementation).
+Provides REST endpoints for text generation with MLX backend.
+"""
+
+import json
+import threading
+import time
+import uuid
+from collections.abc import AsyncGenerator
+from contextlib import asynccontextmanager
+from typing import Any, Dict, List, Optional, Union
+
+from fastapi import FastAPI, HTTPException, Request
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import StreamingResponse, JSONResponse
+from pydantic import BaseModel, Field
+
+from .cache import get_current_model_cache
+from .runner import MLXRunner
+from .. import __version__
+from ..errors import (
+    ErrorType,
+    MLXKError,
+    error_envelope,
+)
+from ..logging import get_logger, set_log_level
+from ..context import generate_request_id
+
+# Global model cache and configuration
+_model_cache: Dict[str, MLXRunner] = {}
+_current_model_path: Optional[str] = None
+_default_max_tokens: Optional[int] = None  # Use dynamic model-aware limits by default
+_model_lock = threading.Lock()  # Thread-safe model switching
+# Global shutdown flag to interrupt in-flight generations promptly
+_shutdown_event = threading.Event()
+
+# Global logger instance (ADR-004)
+logger = get_logger()
+
+
+class CompletionRequest(BaseModel):
+    model: str
+    prompt: Union[str, List[str]]
+    max_tokens: Optional[int] = None
+    temperature: Optional[float] = 0.7
+    top_p: Optional[float] = 0.9
+    stream: Optional[bool] = False
+    stop: Optional[Union[str, List[str]]] = None
+    repetition_penalty: Optional[float] = 1.1
+
+
+class ChatMessage(BaseModel):
+    role: str = Field(..., pattern="^(system|user|assistant)$")
+    content: str
+
+
+class ChatCompletionRequest(BaseModel):
+    model: str
+    messages: List[ChatMessage]
+    max_tokens: Optional[int] = None
+    temperature: Optional[float] = 0.7
+    top_p: Optional[float] = 0.9
+    stream: Optional[bool] = False
+    stop: Optional[Union[str, List[str]]] = None
+    repetition_penalty: Optional[float] = 1.1
+
+
+class CompletionResponse(BaseModel):
+    id: str
+    object: str = "text_completion"
+    created: int
+    model: str
+    choices: List[Dict[str, Any]]
+    usage: Dict[str, int]
+
+
+class ChatCompletionResponse(BaseModel):
+    id: str
+    object: str = "chat.completion"
+    created: int
+    model: str
+    choices: List[Dict[str, Any]]
+    usage: Dict[str, int]
+
+
+class ModelInfo(BaseModel):
+    id: str
+    object: str = "model"
+    owned_by: str = "mlx-knife"
+    permission: List = []
+    context_length: Optional[int] = None
+
+
+def get_or_load_model(model_spec: str, verbose: bool = False) -> MLXRunner:
+    """Get model from cache or load it if not cached.
+    
+    Thread-safe model switching with proper cleanup on interruption.
+    """
+    global _model_cache, _current_model_path
+
+    # Abort early if shutdown requested
+    if _shutdown_event.is_set():
+        raise HTTPException(status_code=503, detail="Server is shutting down")
+
+    # Thread-safe model switching
+    with _model_lock:
+        if _shutdown_event.is_set():
+            raise HTTPException(status_code=503, detail="Server is shutting down")
+        # Simple approach like run command - let MLXRunner handle everything
+        if _current_model_path != model_spec:
+            logger.info(f"Switching to model: {model_spec}", model=model_spec)
+
+            # Clean up previous model
+            if _model_cache:
+                try:
+                    for _old_runner in list(_model_cache.values()):
+                        try:
+                            _old_runner.cleanup()
+                        except Exception as e:
+                            logger.warning(f"Warning during cleanup: {e}")
+                finally:
+                    _model_cache.clear()
+                    _current_model_path = None
+
+            # Load new model (disable signal handlers for server mode)
+            try:
+                runner = MLXRunner(model_spec, verbose=verbose, install_signal_handlers=False)
+                # If shutdown was requested, abort before expensive load
+                if _shutdown_event.is_set():
+                    raise KeyboardInterrupt()
+                runner.load_model()
+                if _shutdown_event.is_set():
+                    raise KeyboardInterrupt()
+
+                _model_cache[model_spec] = runner
+                _current_model_path = model_spec
+
+                logger.info(f"Model loaded successfully: {model_spec}", model=model_spec)
+
+            except KeyboardInterrupt:
+                # Handle interruption during model loading
+                logger.warning("Model loading interrupted")
+                _model_cache.clear()
+                _current_model_path = None
+                raise HTTPException(status_code=503, detail="Server interrupted during model load")
+            except Exception as e:
+                # Clean up on failed load
+                logger.error(f"Model load failed: {model_spec}", error_key=f"model_load_{model_spec}", detail=str(e))
+                _model_cache.clear()
+                _current_model_path = None
+                raise HTTPException(status_code=404, detail=f"Model '{model_spec}' not found or failed to load: {str(e)}")
+
+        return _model_cache[model_spec]
+
+
+async def generate_completion_stream(
+    runner: MLXRunner,
+    prompt: str,
+    request: CompletionRequest,
+) -> AsyncGenerator[str, None]:
+    """Generate streaming completion response."""
+    completion_id = f"cmpl-{uuid.uuid4()}"
+    created = int(time.time())
+
+    # Yield initial response
+    initial_response = {
+        "id": completion_id,
+        "object": "text_completion",
+        "created": created,
+        "model": request.model,
+        "choices": [
+            {
+                "index": 0,
+                "text": "",
+                "logprobs": None,
+                "finish_reason": None
+            }
+        ]
+    }
+
+    yield f"data: {json.dumps(initial_response)}\n\n"
+
+    # Stream tokens
+    try:
+        token_count = 0
+        for token in runner.generate_streaming(
+            prompt=prompt,
+            max_tokens=get_effective_max_tokens(runner, request.max_tokens, server_mode=True),
+            temperature=request.temperature,
+            top_p=request.top_p,
+            repetition_penalty=request.repetition_penalty,
+            use_chat_template=False  # Raw completion mode
+        ):
+            # Stop promptly if server is shutting down
+            if _shutdown_event.is_set():
+                raise KeyboardInterrupt()
+            token_count += 1
+
+            chunk_response = {
+                "id": completion_id,
+                "object": "text_completion",
+                "created": created,
+                "model": request.model,
+                "choices": [
+                    {
+                        "index": 0,
+                        "text": token,
+                        "logprobs": None,
+                        "finish_reason": None
+                    }
+                ]
+            }
+
+            yield f"data: {json.dumps(chunk_response)}\n\n"
+
+            # Check for stop sequences
+            if request.stop:
+                stop_sequences = request.stop if isinstance(request.stop, list) else [request.stop]
+                if any(stop in token for stop in stop_sequences):
+                    break
+
+    except KeyboardInterrupt:
+        # During shutdown/disconnect avoid extra logs; best-effort cleanup
+        if not _shutdown_event.is_set():
+            try:
+                import mlx.core as mx
+                mx.clear_cache()
+            except Exception:
+                pass
+            # Try to send an interrupt marker if client still connected
+            try:
+                interrupt_response = {
+                    "id": completion_id,
+                    "object": "text_completion",
+                    "created": created,
+                    "model": request.model,
+                    "choices": [
+                        {
+                            "index": 0,
+                            "text": "\n\n[Generation interrupted by user]",
+                            "logprobs": None,
+                            "finish_reason": "stop"
+                        }
+                    ]
+                }
+                yield f"data: {json.dumps(interrupt_response)}\n\n"
+            except Exception:
+                pass
+        return
+        
+    except Exception as e:
+        error_response = {
+            "id": completion_id,
+            "object": "text_completion",
+            "created": created,
+            "model": request.model,
+            "choices": [
+                {
+                    "index": 0,
+                    "text": "",
+                    "logprobs": None,
+                    "finish_reason": "error"
+                }
+            ],
+            "error": str(e)
+        }
+        yield f"data: {json.dumps(error_response)}\n\n"
+
+    # Final response (skip if shutting down)
+    if _shutdown_event.is_set():
+        return
+    final_response = {
+        "id": completion_id,
+        "object": "text_completion",
+        "created": created,
+        "model": request.model,
+        "choices": [
+            {
+                "index": 0,
+                "text": "",
+                "logprobs": None,
+                "finish_reason": "stop"
+            }
+        ]
+    }
+
+    yield f"data: {json.dumps(final_response)}\n\n"
+    yield "data: [DONE]\n\n"
+    
+
+
+async def generate_chat_stream(
+    runner: MLXRunner,
+    messages: List[ChatMessage],
+    request: ChatCompletionRequest,
+) -> AsyncGenerator[str, None]:
+    """Generate streaming chat completion response."""
+    completion_id = f"chatcmpl-{uuid.uuid4()}"
+    created = int(time.time())
+
+    # Convert messages to dict format for runner
+    message_dicts = format_chat_messages_for_runner(messages)
+    
+    # Let the runner format with chat templates
+    prompt = runner._format_conversation(message_dicts)
+
+    # Yield initial response
+    initial_response = {
+        "id": completion_id,
+        "object": "chat.completion.chunk",
+        "created": created,
+        "model": request.model,
+        "choices": [
+            {
+                "index": 0,
+                "delta": {"role": "assistant", "content": ""},
+                "finish_reason": None
+            }
+        ]
+    }
+
+    yield f"data: {json.dumps(initial_response)}\n\n"
+
+    # Stream tokens
+    try:
+        for token in runner.generate_streaming(
+            prompt=prompt,
+            max_tokens=get_effective_max_tokens(runner, request.max_tokens, server_mode=True),
+            temperature=request.temperature,
+            top_p=request.top_p,
+            repetition_penalty=request.repetition_penalty,
+            use_chat_template=False,  # Already applied in _format_conversation
+            use_chat_stop_tokens=True   # Server NEEDS chat stop tokens to prevent self-conversations
+        ):
+            # Stop promptly if server is shutting down
+            if _shutdown_event.is_set():
+                raise KeyboardInterrupt()
+            chunk_response = {
+                "id": completion_id,
+                "object": "chat.completion.chunk",
+                "created": created,
+                "model": request.model,
+                "choices": [
+                    {
+                        "index": 0,
+                        "delta": {"content": token},
+                        "finish_reason": None
+                    }
+                ]
+            }
+
+            yield f"data: {json.dumps(chunk_response)}\n\n"
+
+            # Check for stop sequences
+            if request.stop:
+                stop_sequences = request.stop if isinstance(request.stop, list) else [request.stop]
+                if any(stop in token for stop in stop_sequences):
+                    break
+
+    except KeyboardInterrupt:
+        if not _shutdown_event.is_set():
+            try:
+                import mlx.core as mx
+                mx.clear_cache()
+            except Exception:
+                pass
+            try:
+                interrupt_response = {
+                    "id": completion_id,
+                    "object": "chat.completion.chunk",
+                    "created": created,
+                    "model": request.model,
+                    "choices": [
+                        {
+                            "index": 0,
+                            "delta": {"content": "\n\n[Generation interrupted by user]"},
+                            "finish_reason": "stop"
+                        }
+                    ]
+                }
+                yield f"data: {json.dumps(interrupt_response)}\n\n"
+            except Exception:
+                pass
+        return
+        
+    except Exception as e:
+        # Optional debug logging for chat streaming errors
+        try:
+            import os
+            if os.environ.get("MLXK2_DEBUG"):
+                print(f"[DEBUG] Exception in chat streaming: {type(e).__name__}: {e}")
+        except Exception:
+            pass
+        
+        # Try MLX recovery for any exception that might be interrupt-related
+        if "interrupt" in str(e).lower() or "keyboard" in str(e).lower():
+            try:
+                import os
+                if os.environ.get("MLXK2_DEBUG"):
+                    print("[Server] Detected interrupt-like exception, attempting MLX recovery...")
+            except Exception:
+                pass
+            try:
+                import mlx.core as mx
+                mx.clear_cache()
+                try:
+                    import os
+                    if os.environ.get("MLXK2_DEBUG"):
+                        print("[Server] MLX state recovered after exception")
+                except Exception:
+                    pass
+            except Exception as recovery_error:
+                try:
+                    import os
+                    if os.environ.get("MLXK2_DEBUG"):
+                        print(f"[Server] MLX recovery warning: {recovery_error}")
+                except Exception:
+                    pass
+        
+        error_response = {
+            "id": completion_id,
+            "object": "chat.completion.chunk",
+            "created": created,
+            "model": request.model,
+            "choices": [
+                {
+                    "index": 0,
+                    "delta": {},
+                    "finish_reason": "error"
+                }
+            ],
+            "error": str(e)
+        }
+        yield f"data: {json.dumps(error_response)}\n\n"
+
+    # Final response (skip if shutting down)
+    if _shutdown_event.is_set():
+        return
+    final_response = {
+        "id": completion_id,
+        "object": "chat.completion.chunk",
+        "created": created,
+        "model": request.model,
+        "choices": [
+            {
+                "index": 0,
+                "delta": {},
+                "finish_reason": "stop"
+            }
+        ]
+    }
+
+    yield f"data: {json.dumps(final_response)}\n\n"
+    yield "data: [DONE]\n\n"
+    
+
+
+def format_chat_messages_for_runner(messages: List[ChatMessage]) -> List[Dict[str, str]]:
+    """Convert chat messages to format expected by MLXRunner.
+    
+    Returns messages in dict format for the runner to apply chat templates.
+    """
+    return [{"role": msg.role, "content": msg.content} for msg in messages]
+
+
+def get_effective_max_tokens(runner: MLXRunner, requested_max_tokens: Optional[int], server_mode: bool) -> Optional[int]:
+    """Get effective max tokens with server DoS protection."""
+    if requested_max_tokens is not None:
+        return requested_max_tokens
+    else:
+        # Use runner's dynamic calculation with server_mode flag
+        return runner._calculate_dynamic_max_tokens(server_mode=server_mode)
+
+
+def count_tokens(text: str) -> int:
+    """Rough token count estimation."""
+    return int(len(text.split()) * 1.3)  # Approximation, convert to int
+
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """Manage application lifespan."""
+    # Configure log level early (from environment if subprocess mode)
+    import os
+    env_log_level = os.environ.get("MLXK2_LOG_LEVEL", "info")
+    set_log_level(env_log_level)
+
+    logger.info("MLX Knife Server 2.0 starting up...")
+    yield
+    logger.info("MLX Knife Server 2.0 shutting down...")
+    # Ensure shutdown flag is set so any in-flight generations stop quickly
+    try:
+        _request_global_interrupt()
+    except Exception:
+        pass
+    # Clean up model cache
+    global _model_cache
+    try:
+        for _runner in list(_model_cache.values()):
+            try:
+                _runner.cleanup()
+            except Exception:
+                pass
+    finally:
+        _model_cache.clear()
+        
+        # Force MLX memory cleanup
+        try:
+            import mlx.core as mx
+            mx.clear_cache()
+            logger.info("MLX memory cleared")
+        except Exception:
+            pass
+
+
+# Create FastAPI app
+app = FastAPI(
+    title="MLX Knife API 2.0",
+    description="OpenAI-compatible API for MLX models (2.0 implementation)",
+    version=__version__,
+    lifespan=lifespan
+)
+
+# Add CORS middleware for browser access
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],  # Allow all origins for local development
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+
+
+# Request ID middleware (ADR-004)
+@app.middleware("http")
+async def add_request_id_middleware(request: Request, call_next):
+    """Add request_id to all requests for correlation."""
+    request_id = generate_request_id()
+    request.state.request_id = request_id
+
+    response = await call_next(request)
+    response.headers["X-Request-ID"] = request_id
+    return response
+
+
+# Custom exception handler for MLXKError (ADR-004)
+@app.exception_handler(HTTPException)
+async def http_exception_handler(request: Request, exc: HTTPException):
+    """Convert HTTPException to error envelope."""
+    request_id = getattr(request.state, "request_id", None)
+
+    # Map HTTP status to error type
+    error_type_map = {
+        403: ErrorType.ACCESS_DENIED,
+        404: ErrorType.MODEL_NOT_FOUND,
+        400: ErrorType.VALIDATION_ERROR,
+        503: ErrorType.SERVER_SHUTDOWN,
+        500: ErrorType.INTERNAL_ERROR,
+    }
+
+    error_type = error_type_map.get(exc.status_code, ErrorType.INTERNAL_ERROR)
+    error = MLXKError(
+        type=error_type,
+        message=exc.detail,
+        retryable=(exc.status_code == 503)
+    )
+
+    envelope = error_envelope(error, request_id=request_id)
+    return JSONResponse(
+        status_code=exc.status_code,
+        content=envelope
+    )
+
+
+@app.get("/health")
+async def health_check():
+    """Health check endpoint (OpenAI compatible)."""
+    return {"status": "healthy", "service": "mlx-knife-server-2.0"}
+
+
+@app.get("/v1/models")
+async def list_models():
+    """List available MLX models in the cache."""
+    from .cache import cache_dir_to_hf
+    from ..operations.common import detect_framework
+    from ..operations.health import is_model_healthy
+    
+    model_list = []
+    model_cache = get_current_model_cache()
+    
+    # Find all model directories
+    models = [d for d in model_cache.iterdir() if d.name.startswith("models--")]
+    
+    for model_dir in models:
+        model_name = cache_dir_to_hf(model_dir.name)
+        
+        try:
+            # Check if it's a healthy MLX model
+            # Get the latest snapshot for detection
+            snapshots_dir = model_dir / "snapshots"
+            selected_path = None
+            if snapshots_dir.exists():
+                snapshots = [d for d in snapshots_dir.iterdir() if d.is_dir()]
+                if snapshots:
+                    selected_path = snapshots[0]
+            
+            if detect_framework(model_name, model_dir, selected_path) == "MLX" and is_model_healthy(model_name)[0]:
+                # Get model context length (best effort)
+                context_length = None
+                try:
+                    snapshots_dir = model_dir / "snapshots"
+                    if snapshots_dir.exists():
+                        snapshots = [d for d in snapshots_dir.iterdir() if d.is_dir()]
+                        if snapshots:
+                            from .runner import get_model_context_length
+                            context_length = get_model_context_length(str(snapshots[0]))
+                except Exception:
+                    pass
+
+                model_list.append(ModelInfo(
+                    id=model_name,
+                    object="model",
+                    owned_by="mlx-knife-2.0",
+                    context_length=context_length
+                ))
+        except Exception:
+            # Skip models that can't be processed
+            continue
+
+    return {"object": "list", "data": model_list}
+
+
+@app.post("/v1/completions")
+async def create_completion(request: CompletionRequest):
+    """Create a text completion."""
+    try:
+        if _shutdown_event.is_set():
+            raise HTTPException(status_code=503, detail="Server is shutting down")
+        runner = get_or_load_model(request.model)
+
+        # Handle array of prompts
+        if isinstance(request.prompt, list):
+            if len(request.prompt) > 1:
+                raise HTTPException(status_code=400, detail="Multiple prompts not supported yet")
+            prompt = request.prompt[0]
+        else:
+            prompt = request.prompt
+
+        if request.stream:
+            # Streaming response
+            return StreamingResponse(
+                generate_completion_stream(runner, prompt, request),
+                media_type="text/event-stream",
+                headers={"Cache-Control": "no-cache"}
+            )
+        else:
+            # Non-streaming response
+            completion_id = f"cmpl-{uuid.uuid4()}"
+            created = int(time.time())
+
+            generated_text = runner.generate_batch(
+                prompt=prompt,
+                max_tokens=get_effective_max_tokens(runner, request.max_tokens, server_mode=True),
+                temperature=request.temperature,
+                top_p=request.top_p,
+                repetition_penalty=request.repetition_penalty,
+                use_chat_template=False
+            )
+
+            prompt_tokens = count_tokens(prompt)
+            completion_tokens = count_tokens(generated_text)
+
+            return CompletionResponse(
+                id=completion_id,
+                created=created,
+                model=request.model,
+                choices=[
+                    {
+                        "index": 0,
+                        "text": generated_text,
+                        "logprobs": None,
+                        "finish_reason": "stop"
+                    }
+                ],
+                usage={
+                    "prompt_tokens": prompt_tokens,
+                    "completion_tokens": completion_tokens,
+                    "total_tokens": prompt_tokens + completion_tokens
+                }
+            )
+
+    except HTTPException as http_exc:
+        # Preserve intended HTTP status codes from inner helpers
+        raise http_exc
+    except Exception as e:
+        # Map unexpected errors to 500
+        raise HTTPException(status_code=500, detail=str(e))
+
+
+@app.post("/v1/chat/completions")
+async def create_chat_completion(request: ChatCompletionRequest):
+    """Create a chat completion."""
+    try:
+        if _shutdown_event.is_set():
+            raise HTTPException(status_code=503, detail="Server is shutting down")
+        runner = get_or_load_model(request.model)
+
+        if request.stream:
+            # Streaming response
+            return StreamingResponse(
+                generate_chat_stream(runner, request.messages, request),
+                media_type="text/event-stream",
+                headers={"Cache-Control": "no-cache"}
+            )
+        else:
+            # Non-streaming response
+            completion_id = f"chatcmpl-{uuid.uuid4()}"
+            created = int(time.time())
+
+            # Convert messages to dict format for runner
+            message_dicts = format_chat_messages_for_runner(request.messages)
+            
+            # Let the runner format with chat templates
+            prompt = runner._format_conversation(message_dicts)
+
+            generated_text = runner.generate_batch(
+                prompt=prompt,
+                max_tokens=get_effective_max_tokens(runner, request.max_tokens, server_mode=True),
+                temperature=request.temperature,
+                top_p=request.top_p,
+                repetition_penalty=request.repetition_penalty,
+                use_chat_template=False,  # Already applied in _format_conversation
+                use_chat_stop_tokens=True   # Server NEEDS chat stop tokens to prevent self-conversations
+            )
+
+            # Token counting
+            total_prompt = "\n\n".join([msg.content for msg in request.messages])
+            prompt_tokens = count_tokens(total_prompt)
+            completion_tokens = count_tokens(generated_text)
+
+            return ChatCompletionResponse(
+                id=completion_id,
+                created=created,
+                model=request.model,
+                choices=[
+                    {
+                        "index": 0,
+                        "message": {
+                            "role": "assistant",
+                            "content": generated_text
+                        },
+                        "finish_reason": "stop"
+                    }
+                ],
+                usage={
+                    "prompt_tokens": prompt_tokens,
+                    "completion_tokens": completion_tokens,
+                    "total_tokens": prompt_tokens + completion_tokens
+                }
+            )
+
+    except HTTPException as http_exc:
+        # Preserve intended HTTP status codes from inner helpers
+        raise http_exc
+    except Exception as e:
+        # Map unexpected errors to 500
+        raise HTTPException(status_code=500, detail=str(e))
+
+
+def cleanup_server():
+    """Manual cleanup function for emergency situations."""
+    global _model_cache, _current_model_path
+    logger.warning("Forcing server cleanup...")
+
+    # Thread-safe cleanup
+    with _model_lock:
+        try:
+            for _runner in list(_model_cache.values()):
+                try:
+                    _runner.cleanup()
+                except Exception as e:
+                    logger.warning(f"Warning during runner cleanup: {e}")
+        finally:
+            _model_cache.clear()
+            _current_model_path = None
+
+            # Force MLX memory cleanup
+            try:
+                import mlx.core as mx
+                mx.clear_cache()
+                logger.info("MLX memory cleared")
+            except Exception as e:
+                logger.warning(f"Warning during MLX cleanup: {e}")
+
+
+def _request_global_interrupt() -> None:
+    """Request all running generations to stop quickly.
+
+    Used during server shutdown to ensure in-flight streams stop.
+    """
+    _shutdown_event.set()
+    try:
+        with _model_lock:
+            for _runner in list(_model_cache.values()):
+                try:
+                    _runner.request_interrupt()
+                except Exception:
+                    pass
+    except Exception:
+        pass
+
+
+
+
+def run_server(
+    host: str = "127.0.0.1",
+    port: int = 8000,
+    max_tokens: int = 2000,
+    reload: bool = False,
+    log_level: str = "info"
+):
+    """Run the MLX Knife server 2.0."""
+    import os
+
+    # Import uvicorn lazily to keep module import light when server isn't used
+    try:
+        import uvicorn  # type: ignore
+    except Exception as e:
+        raise RuntimeError("uvicorn is required to run the server; install with 'pip install fastapi uvicorn'.") from e
+    global _default_max_tokens
+    _default_max_tokens = max_tokens
+
+    # Check for log level from environment (subprocess mode)
+    env_log_level = os.environ.get("MLXK2_LOG_LEVEL")
+    if env_log_level:
+        log_level = env_log_level
+
+    # Configure logging level for MLXKLogger and root logger (ADR-004)
+    set_log_level(log_level)
+
+    # Rely on Uvicorn's own signal handling; manage shutdown via lifespan
+
+    logger.info(f"Starting MLX Knife Server 2.0 on http://{host}:{port}")
+    logger.info(f"API docs available at http://{host}:{port}/docs")
+    logger.info(f"Default max tokens: {'model-aware dynamic limits' if max_tokens is None else max_tokens}")
+    logger.info("Press Ctrl-C to stop the server")
+
+    # Enable access logs only at debug/info level (reduces noise at warning/error)
+    access_log_enabled = log_level.lower() in ["debug", "info"]
+
+    # Configure Uvicorn log format (JSON if MLXK2_LOG_JSON=1)
+    json_mode = os.environ.get("MLXK2_LOG_JSON", "0") == "1"
+    log_config = None
+    if json_mode:
+        # Use custom log config for JSON formatting
+        log_config = {
+            "version": 1,
+            "disable_existing_loggers": False,
+            "formatters": {
+                "default": {
+                    "()": "mlxk2.logging.JSONFormatter",
+                },
+                "access": {
+                    "()": "mlxk2.logging.JSONFormatter",
+                },
+            },
+            "handlers": {
+                "default": {
+                    "formatter": "default",
+                    "class": "logging.StreamHandler",
+                    "stream": "ext://sys.stderr",
+                },
+                "access": {
+                    "formatter": "access",
+                    "class": "logging.StreamHandler",
+                    "stream": "ext://sys.stderr",
+                },
+            },
+            "loggers": {
+                "uvicorn": {"handlers": ["default"], "level": log_level.upper()},
+                "uvicorn.error": {"level": log_level.upper()},
+                "uvicorn.access": {"handlers": ["access"], "level": "INFO", "propagate": False},
+            },
+        }
+
+    try:
+        uvicorn.run(
+            "mlxk2.core.server_base:app",
+            host=host,
+            port=port,
+            reload=reload,
+            log_level=log_level,
+            log_config=log_config,
+            access_log=access_log_enabled,
+            workers=1,
+            timeout_graceful_shutdown=5,
+            timeout_keep_alive=5,
+            lifespan="on"
+        )
+    except KeyboardInterrupt:
+        logger.info("Server interrupted by user")
+        _request_global_interrupt()
+        cleanup_server()
+    except Exception as e:
+        logger.error(f"Server error: {e}", error_key="server_error")
+        _request_global_interrupt()
+        cleanup_server()
+        raise
diff --git a/mlxk2/errors.py b/mlxk2/errors.py
new file mode 100644
index 0000000..d66558a
--- /dev/null
+++ b/mlxk2/errors.py
@@ -0,0 +1,165 @@
+"""
+Unified error handling for MLX Knife 2.0 (ADR-004).
+
+Provides standardized error envelope, error type taxonomy, and HTTP status mapping.
+"""
+
+from enum import Enum
+from typing import Any, Dict, Optional
+from dataclasses import dataclass
+
+
+class ErrorType(str, Enum):
+    """Standardized error types (ADR-004 taxonomy)."""
+    ACCESS_DENIED = "access_denied"
+    MODEL_NOT_FOUND = "model_not_found"
+    AMBIGUOUS_MATCH = "ambiguous_match"
+    DOWNLOAD_FAILED = "download_failed"
+    VALIDATION_ERROR = "validation_error"
+    PUSH_OPERATION_FAILED = "push_operation_failed"
+    SERVER_SHUTDOWN = "server_shutdown"
+    INTERNAL_ERROR = "internal_error"
+
+
+# HTTP status code mapping (ADR-004 specification)
+ERROR_TYPE_TO_HTTP_STATUS: Dict[ErrorType, int] = {
+    ErrorType.ACCESS_DENIED: 403,
+    ErrorType.MODEL_NOT_FOUND: 404,
+    ErrorType.AMBIGUOUS_MATCH: 400,
+    ErrorType.DOWNLOAD_FAILED: 503,
+    ErrorType.VALIDATION_ERROR: 400,
+    ErrorType.PUSH_OPERATION_FAILED: 500,
+    ErrorType.SERVER_SHUTDOWN: 503,
+    ErrorType.INTERNAL_ERROR: 500,
+}
+
+
+@dataclass
+class MLXKError:
+    """Structured error information (ADR-004 error envelope).
+
+    Attributes:
+        type: Error type from ErrorType enum
+        message: Human-readable error message
+        detail: Optional additional error details (dict or string)
+        retryable: Whether the operation can be retried (None = unknown)
+    """
+    type: ErrorType
+    message: str
+    detail: Optional[Any] = None
+    retryable: Optional[bool] = None
+
+    def to_dict(self) -> Dict[str, Any]:
+        """Convert to dictionary for JSON serialization."""
+        result = {
+            "type": self.type.value,
+            "message": self.message,
+        }
+        if self.detail is not None:
+            result["detail"] = self.detail
+        if self.retryable is not None:
+            result["retryable"] = self.retryable
+        return result
+
+    def to_http_status(self) -> int:
+        """Get HTTP status code for this error type."""
+        return ERROR_TYPE_TO_HTTP_STATUS.get(self.type, 500)
+
+
+def error_envelope(
+    error: MLXKError,
+    request_id: Optional[str] = None,
+    data: Optional[Dict[str, Any]] = None
+) -> Dict[str, Any]:
+    """Create a complete error envelope (ADR-004 specification).
+
+    Args:
+        error: MLXKError instance
+        request_id: Optional request correlation ID (UUID)
+        data: Optional additional response data
+
+    Returns:
+        Error envelope dict: {"status": "error", "error": {...}, "request_id": "...", "data": {...}}
+    """
+    envelope = {
+        "status": "error",
+        "error": error.to_dict()
+    }
+    if request_id:
+        envelope["request_id"] = request_id
+    if data:
+        envelope["data"] = data
+    return envelope
+
+
+def success_envelope(
+    data: Dict[str, Any],
+    request_id: Optional[str] = None
+) -> Dict[str, Any]:
+    """Create a success response envelope (ADR-004 specification).
+
+    Args:
+        data: Response data
+        request_id: Optional request correlation ID (UUID)
+
+    Returns:
+        Success envelope dict: {"status": "success", "data": {...}, "request_id": "..."}
+    """
+    envelope = {
+        "status": "success",
+        "data": data
+    }
+    if request_id:
+        envelope["request_id"] = request_id
+    return envelope
+
+
+# Common error constructors for convenience
+def model_not_found_error(model_name: str, detail: Optional[str] = None) -> MLXKError:
+    """Create a model_not_found error."""
+    return MLXKError(
+        type=ErrorType.MODEL_NOT_FOUND,
+        message=f"Model '{model_name}' not found or failed to load",
+        detail=detail,
+        retryable=False
+    )
+
+
+def validation_error(message: str, detail: Optional[Any] = None) -> MLXKError:
+    """Create a validation_error."""
+    return MLXKError(
+        type=ErrorType.VALIDATION_ERROR,
+        message=message,
+        detail=detail,
+        retryable=False
+    )
+
+
+def server_shutdown_error(message: str = "Server is shutting down") -> MLXKError:
+    """Create a server_shutdown error."""
+    return MLXKError(
+        type=ErrorType.SERVER_SHUTDOWN,
+        message=message,
+        detail=None,
+        retryable=True
+    )
+
+
+def internal_error(message: str, detail: Optional[Any] = None) -> MLXKError:
+    """Create an internal_error."""
+    return MLXKError(
+        type=ErrorType.INTERNAL_ERROR,
+        message=message,
+        detail=detail,
+        retryable=None  # Unknown if retryable
+    )
+
+
+def access_denied_error(message: str, detail: Optional[str] = None) -> MLXKError:
+    """Create an access_denied error."""
+    return MLXKError(
+        type=ErrorType.ACCESS_DENIED,
+        message=message,
+        detail=detail,
+        retryable=False
+    )
diff --git a/mlxk2/logging.py b/mlxk2/logging.py
new file mode 100644
index 0000000..7a7c5c2
--- /dev/null
+++ b/mlxk2/logging.py
@@ -0,0 +1,281 @@
+"""
+Structured logging for MLX Knife 2.0 (ADR-004).
+
+Provides level-based logging with optional JSON output and sensitive data redaction.
+"""
+
+import json
+import logging
+import os
+import re
+import sys
+import time
+from pathlib import Path
+from typing import Any, Dict, Optional
+from collections import defaultdict
+
+
+# Redaction patterns (ADR-004 specification)
+TOKEN_PATTERN = re.compile(r'(hf_[a-zA-Z0-9]{30,})', re.IGNORECASE)
+# Redact user-specific paths (home directories)
+HOME_DIR = str(Path.home())
+
+
+class MLXKLogger:
+    """Structured logger with JSON support and redaction (ADR-004).
+
+    Features:
+    - Level-based logging (INFO, WARN, ERROR, DEBUG)
+    - Optional JSON output via MLXK2_LOG_JSON=1
+    - Automatic redaction of HF_TOKEN and user paths
+    - Request correlation via request_id
+    - Error flood rate limiting
+    """
+
+    def __init__(self, name: str = "mlxk2"):
+        self.name = name
+        self.json_mode = os.environ.get("MLXK2_LOG_JSON", "0") == "1"
+        self.verbose = False  # Set by CLI --verbose flag
+
+        # Rate limiting for duplicate errors (ADR-004: max 1/5s)
+        self._error_counts: Dict[str, int] = defaultdict(int)
+        self._error_last_time: Dict[str, float] = {}
+        self._rate_limit_window = 5.0  # seconds
+
+        # Setup Python logging backend
+        self.logger = logging.getLogger(name)
+        self.logger.setLevel(logging.DEBUG)  # Capture all, filter at handler level
+
+        # Clear existing handlers
+        self.logger.handlers.clear()
+
+        # Add handler
+        handler = logging.StreamHandler(sys.stderr)
+        handler.setLevel(logging.DEBUG)
+
+        if not self.json_mode:
+            # Plain text format
+            formatter = logging.Formatter('%(message)s')
+        else:
+            # JSON formatter handles structured output
+            formatter = logging.Formatter('%(message)s')
+
+        handler.setFormatter(formatter)
+        self.logger.addHandler(handler)
+
+        # Don't propagate to root logger
+        self.logger.propagate = False
+
+    def _redact(self, message: str) -> str:
+        """Redact sensitive data from message (ADR-004).
+
+        Redacts:
+        - HF tokens (hf_...)
+        - User home directory paths
+        """
+        # Redact HF tokens
+        message = TOKEN_PATTERN.sub('[REDACTED_TOKEN]', message)
+
+        # Redact home directory
+        if HOME_DIR and HOME_DIR in message:
+            message = message.replace(HOME_DIR, '~')
+
+        return message
+
+    def _should_log_error(self, error_key: str) -> bool:
+        """Check if error should be logged (rate limiting).
+
+        Rate limit: max 1 occurrence per 5 seconds for same error.
+        """
+        now = time.time()
+        last_time = self._error_last_time.get(error_key, 0)
+
+        if now - last_time >= self._rate_limit_window:
+            # Reset counter
+            self._error_counts[error_key] = 1
+            self._error_last_time[error_key] = now
+            return True
+        else:
+            # Increment suppressed count
+            self._error_counts[error_key] += 1
+            return False
+
+    def _format_log(
+        self,
+        level: str,
+        message: str,
+        request_id: Optional[str] = None,
+        **extra: Any
+    ) -> str:
+        """Format log message (plain or JSON)."""
+        # Redact sensitive data
+        message = self._redact(message)
+
+        if self.json_mode:
+            log_entry = {
+                "ts": time.time(),
+                "level": level,
+                "msg": message,
+            }
+            if request_id:
+                log_entry["request_id"] = request_id
+
+            # Add extra fields (route, model, duration_ms, etc.)
+            for key, value in extra.items():
+                if value is not None:
+                    log_entry[key] = value
+
+            return json.dumps(log_entry)
+        else:
+            # Plain text format
+            prefix = f"[{level}]" if level != "INFO" else ""
+            if prefix:
+                return f"{prefix} {message}"
+            return message
+
+    def info(self, message: str, request_id: Optional[str] = None, **extra: Any):
+        """Log INFO level message."""
+        formatted = self._format_log("INFO", message, request_id, **extra)
+        self.logger.info(formatted)
+
+    def warning(self, message: str, request_id: Optional[str] = None, **extra: Any):
+        """Log WARN level message."""
+        formatted = self._format_log("WARN", message, request_id, **extra)
+        self.logger.warning(formatted)
+
+    def error(
+        self,
+        message: str,
+        request_id: Optional[str] = None,
+        error_key: Optional[str] = None,
+        **extra: Any
+    ):
+        """Log ERROR level message with rate limiting.
+
+        Args:
+            message: Error message
+            request_id: Request correlation ID
+            error_key: Key for rate limiting (default: message hash)
+            **extra: Additional structured fields
+        """
+        # Rate limiting
+        key = error_key or message
+        if not self._should_log_error(key):
+            return  # Suppressed
+
+        # Add suppressed count if > 1
+        count = self._error_counts.get(key, 1)
+        if count > 1:
+            extra["suppressed_count"] = count - 1
+
+        formatted = self._format_log("ERROR", message, request_id, **extra)
+        self.logger.error(formatted)
+
+    def debug(self, message: str, request_id: Optional[str] = None, **extra: Any):
+        """Log DEBUG level message (only if --verbose)."""
+        if not self.verbose:
+            return
+        formatted = self._format_log("DEBUG", message, request_id, **extra)
+        self.logger.debug(formatted)
+
+    def set_verbose(self, verbose: bool):
+        """Enable/disable verbose (DEBUG) logging."""
+        self.verbose = verbose
+
+
+# Custom JSON formatter for root logger (external libraries)
+class JSONFormatter(logging.Formatter):
+    """JSON formatter for root logger (captures mlx-lm, transformers, etc.)."""
+
+    def __init__(self):
+        super().__init__()
+        self.json_mode = os.environ.get("MLXK2_LOG_JSON", "0") == "1"
+
+    def format(self, record: logging.LogRecord) -> str:
+        if not self.json_mode:
+            # Plain text fallback
+            return super().format(record)
+
+        # Redact sensitive data
+        message = record.getMessage()
+        message = TOKEN_PATTERN.sub('[REDACTED_TOKEN]', message)
+        if HOME_DIR and HOME_DIR in message:
+            message = message.replace(HOME_DIR, '~')
+
+        log_entry = {
+            "ts": time.time(),
+            "level": record.levelname,
+            "msg": message,
+            "logger": record.name,
+        }
+
+        return json.dumps(log_entry)
+
+
+# Global logger instance
+_logger: Optional[MLXKLogger] = None
+_root_logger_configured = False
+
+
+def get_logger() -> MLXKLogger:
+    """Get global MLXKLogger instance."""
+    global _logger
+    if _logger is None:
+        _logger = MLXKLogger()
+        _configure_root_logger()
+    return _logger
+
+
+def _configure_root_logger():
+    """Configure root logger to use JSON format when MLXK2_LOG_JSON=1.
+
+    This captures logs from external libraries (mlx-lm, transformers, etc.)
+    and ensures consistent JSON output.
+    """
+    global _root_logger_configured
+    if _root_logger_configured:
+        return
+
+    json_mode = os.environ.get("MLXK2_LOG_JSON", "0") == "1"
+
+    if json_mode:
+        # Configure root logger for JSON output
+        root_logger = logging.getLogger()
+        root_logger.setLevel(logging.INFO)  # Capture INFO and above from external libs
+
+        # Clear existing handlers to avoid duplicates
+        root_logger.handlers.clear()
+
+        # Add JSON handler
+        handler = logging.StreamHandler(sys.stderr)
+        handler.setFormatter(JSONFormatter())
+        root_logger.addHandler(handler)
+
+    _root_logger_configured = True
+
+
+def set_verbose(verbose: bool):
+    """Set verbose mode globally."""
+    get_logger().set_verbose(verbose)
+
+
+def set_log_level(level: str):
+    """Set log level globally for MLXKLogger and root logger.
+
+    Args:
+        level: Log level string (debug, info, warning, error)
+    """
+    level_upper = level.upper()
+    log_level = getattr(logging, level_upper, logging.INFO)
+
+    # Set MLXKLogger level
+    logger = get_logger()
+    logger.logger.setLevel(log_level)
+    for handler in logger.logger.handlers:
+        handler.setLevel(log_level)
+
+    # Set root logger level (for external libraries)
+    root_logger = logging.getLogger()
+    root_logger.setLevel(log_level)
+    for handler in root_logger.handlers:
+        handler.setLevel(log_level)
diff --git a/mlxk2/operations/__init__.py b/mlxk2/operations/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/mlxk2/operations/clone.py b/mlxk2/operations/clone.py
new file mode 100644
index 0000000..f7b1c79
--- /dev/null
+++ b/mlxk2/operations/clone.py
@@ -0,0 +1,341 @@
+"""Clone operation for MLX Knife 2.0.
+
+Implements ADR-007 Phase 1: Same-Volume APFS Clone strategy.
+
+This implementation:
+1. Validates cache and workspace both on same APFS volume
+2. Creates isolated temp cache on same volume as workspace
+3. Pulls model to temp cache (isolated from user cache)
+4. APFS clones temp cache → workspace (instant, zero space initially)
+5. Deletes temp cache (cleanup)
+
+User cache is NEVER touched - only temp cache is used and cleaned up.
+"""
+
+import logging
+import os
+import random
+import re
+import shutil
+import subprocess
+import time
+from pathlib import Path
+from typing import Optional, Dict, Any
+
+from .pull import pull_to_cache
+from ..core.cache import hf_to_cache_dir, get_current_cache_root
+
+logger = logging.getLogger(__name__)
+
+
+def clone_operation(model_spec: str, target_dir: str, health_check: bool = True) -> Dict[str, Any]:
+    """Clone operation following ADR-007 Phase 1: Same-Volume APFS strategy.
+
+    Args:
+        model_spec: Model specification (org/repo[@revision])
+        target_dir: Target directory for workspace
+        health_check: Whether to run health check before copy (default: True)
+
+    Returns:
+        JSON response following API 0.1.4 schema
+    """
+    result = {
+        "status": "success",
+        "command": "clone",
+        "error": None,
+        "data": {
+            "model": model_spec,
+            "clone_status": "unknown",
+            "message": "",
+            "target_dir": str(Path(target_dir).resolve()),
+            "health_check": health_check
+        }
+    }
+
+    temp_cache = None  # Initialize for cleanup in finally block
+
+    try:
+        # Validate target directory
+        target_path = Path(target_dir).resolve()
+        result["data"]["target_dir"] = str(target_path)
+
+        # Check if target exists and is not empty
+        if target_path.exists():
+            if not target_path.is_dir():
+                result["status"] = "error"
+                result["error"] = {
+                    "type": "InvalidTargetError",
+                    "message": f"Target '{target_dir}' exists but is not a directory"
+                }
+                result["data"]["clone_status"] = "error"
+                return result
+
+            # Check if directory is empty
+            if any(target_path.iterdir()):
+                result["status"] = "error"
+                result["error"] = {
+                    "type": "InvalidTargetError",
+                    "message": f"Target directory '{target_dir}' is not empty"
+                }
+                result["data"]["clone_status"] = "error"
+                return result
+
+        # Phase 1: Validate APFS requirement (ADR-007)
+        try:
+            _validate_apfs_filesystem(target_path.parent)
+        except FilesystemError as e:
+            result["status"] = "error"
+            result["error"] = {
+                "type": "FilesystemError",
+                "message": str(e)
+            }
+            result["data"]["clone_status"] = "filesystem_error"
+            return result
+
+        # Phase 1b: Validate same-volume requirement (ADR-007)
+        try:
+            _validate_same_volume(target_path.parent)
+        except FilesystemError as e:
+            result["status"] = "error"
+            result["error"] = {
+                "type": "FilesystemError",
+                "message": str(e)
+            }
+            result["data"]["clone_status"] = "filesystem_error"
+            return result
+
+        # Phase 2: Create temp cache on same volume as workspace
+        result["data"]["clone_status"] = "preparing"
+        temp_cache = _create_temp_cache_same_volume(target_path)
+
+        try:
+            # Phase 3: Pull to isolated temp cache (no HF_HOME patching needed)
+            result["data"]["clone_status"] = "pulling"
+            pull_result = pull_to_cache(model_spec, temp_cache)
+
+            if pull_result["status"] != "success":
+                result["status"] = "error"
+                result["error"] = {
+                    "type": "PullFailedError",
+                    "message": f"Pull operation failed: {pull_result.get('error', {}).get('message', 'Unknown error')}"
+                }
+                result["data"]["clone_status"] = "pull_failed"
+                return result
+
+            # Extract resolved model name from pull result
+            resolved_model = pull_result["data"]["model"]
+            result["data"]["model"] = resolved_model
+
+            # Phase 4: Resolve temp cache snapshot path
+            temp_snapshot = _resolve_latest_snapshot(temp_cache, resolved_model)
+            if not temp_snapshot or not temp_snapshot.exists():
+                result["status"] = "error"
+                result["error"] = {
+                    "type": "CacheNotFoundError",
+                    "message": f"Temp cache snapshot not found for model '{resolved_model}'"
+                }
+                result["data"]["clone_status"] = "cache_not_found"
+                return result
+
+            # Phase 5: Optional health check on temp cache
+            if health_check:
+                result["data"]["clone_status"] = "health_checking"
+                # Use health_from_cache for proper isolation
+                from .health import health_from_cache
+                healthy, health_message = health_from_cache(model_spec, temp_cache)
+                if not healthy:
+                    result["status"] = "error"
+                    result["error"] = {
+                        "type": "ModelUnhealthyError",
+                        "message": f"Model failed health check: {health_message}"
+                    }
+                    result["data"]["clone_status"] = "health_check_failed"
+                    return result
+
+            # Phase 6: APFS clone temp cache → workspace (instant, CoW)
+            result["data"]["clone_status"] = "cloning"
+            target_path.mkdir(parents=True, exist_ok=True)
+            clone_success = _apfs_clone_directory(temp_snapshot, target_path)
+
+            if not clone_success:
+                result["status"] = "error"
+                result["error"] = {
+                    "type": "CloneFailedError",
+                    "message": "APFS clone operation failed"
+                }
+                result["data"]["clone_status"] = "filesystem_error"
+                return result
+
+            # Success - temp cache auto-cleanup via finally block
+            result["data"]["clone_status"] = "success"
+            result["data"]["message"] = f"Cloned to {target_dir}"
+
+        finally:
+            # Phase 7: Cleanup temp cache (always) - with safety check
+            if temp_cache and temp_cache.exists():
+                _cleanup_temp_cache_safe(temp_cache)
+
+    except Exception as e:
+        result["status"] = "error"
+        result["error"] = {
+            "type": "CloneOperationError",
+            "message": str(e)
+        }
+        result["data"]["clone_status"] = "error"
+
+    return result
+
+
+def _validate_apfs_filesystem(path: Path) -> None:
+    """Validate APFS requirement for clone operations.
+
+    Called lazily - only on first clone operation, not at CLI startup.
+    """
+    if not _is_apfs_filesystem(path):
+        raise FilesystemError(
+            f"APFS required for clone operations. "
+            f"Path: {path}\n"
+            f"Solution: Use APFS volume or external APFS SSD."
+        )
+
+
+def _validate_same_volume(workspace_path: Path) -> None:
+    """Validate that workspace and HF_HOME cache are on same volume (ADR-007 Phase 1)."""
+    cache_root = get_current_cache_root()
+
+    # Get volume mount points for both paths
+    workspace_volume = _get_volume_mount_point(workspace_path)
+    cache_volume = _get_volume_mount_point(cache_root)
+
+    if workspace_volume != cache_volume:
+        raise FilesystemError(
+            f"Phase 1 requires workspace and cache on same volume.\n"
+            f"Workspace volume: {workspace_volume}\n"
+            f"Cache volume (HF_HOME): {cache_volume}\n"
+            f"Solution: Set HF_HOME to same volume as workspace:\n"
+            f"  export HF_HOME={workspace_volume}/huggingface/cache"
+        )
+
+
+def _is_apfs_filesystem(path: Path) -> bool:
+    """Simple APFS check - returns True/False only.
+
+    Used by both clone (validation) and push (conditional warning).
+    """
+    try:
+        # Use mount command to check filesystem type on macOS
+        result = subprocess.run(['mount'], capture_output=True, text=True)
+        abs_path = str(path.resolve())
+
+        # Regex pattern for mount lines: device on mountpoint (fstype, options...)
+        mount_pattern = r'^(.+?) on (.+?) \(([^,]+),'
+
+        for line in result.stdout.strip().split('\n'):
+            match = re.match(mount_pattern, line)
+            if match:
+                device, mountpoint, fstype = match.groups()
+
+                # Check if our path is under this mountpoint
+                if abs_path.startswith(mountpoint + '/') or abs_path == mountpoint:
+                    return fstype == 'apfs'
+
+        return False  # No matching mount found
+    except (subprocess.CalledProcessError, re.error):
+        return False  # Safe fallback
+
+
+def _create_temp_cache_same_volume(target_workspace: Path) -> Path:
+    """Create temp cache on same APFS volume as target for CoW optimization."""
+    # Get target volume mount point via st_dev
+    target_volume = _get_volume_mount_point(target_workspace)
+
+    # Create temp cache on same volume
+    temp_cache = target_volume / f".mlxk2_temp_{os.getpid()}_{random.randint(1000,9999)}"
+    temp_cache.mkdir(parents=True)
+
+    # SAFETY: Create sentinel file to prevent accidental user cache deletion
+    sentinel = temp_cache / ".mlxk2_temp_cache_sentinel"
+    sentinel.write_text(f"mlxk2_temp_cache_created_{int(time.time())}")
+
+    return temp_cache
+
+
+def _get_volume_mount_point(path: Path) -> Path:
+    """Find mount point (volume root) for given path via st_dev changes."""
+    abs_path = path.resolve()
+    current = abs_path
+
+    while current != current.parent:
+        try:
+            parent_stat = current.parent.stat()
+            current_stat = current.stat()
+
+            # Different st_dev = mount boundary
+            if parent_stat.st_dev != current_stat.st_dev:
+                return current
+        except (OSError, PermissionError):
+            pass
+        current = current.parent
+
+    return current  # Filesystem root
+
+
+
+
+def _resolve_latest_snapshot(temp_cache: Path, model_name: str) -> Optional[Path]:
+    """Resolve the latest snapshot directory for a model in temp cache."""
+    try:
+        cache_dir = temp_cache / hf_to_cache_dir(model_name)
+
+        if not cache_dir.exists():
+            return None
+
+        snapshots_dir = cache_dir / "snapshots"
+        if not snapshots_dir.exists():
+            return None
+
+        snapshots = [d for d in snapshots_dir.iterdir() if d.is_dir()]
+        if not snapshots:
+            return None
+
+        # Return latest snapshot by modification time
+        latest_snapshot = max(snapshots, key=lambda x: x.stat().st_mtime)
+        return latest_snapshot
+
+    except Exception:
+        return None
+
+
+def _apfs_clone_directory(source: Path, target: Path) -> bool:
+    """Clone directory using APFS copy-on-write via clonefile."""
+    try:
+        for item in source.rglob("*"):
+            if item.is_file():
+                relative_path = item.relative_to(source)
+                target_file = target / relative_path
+                target_file.parent.mkdir(parents=True, exist_ok=True)
+
+                # Use cp -c for clonefile (APFS CoW)
+                subprocess.run(['cp', '-c', str(item), str(target_file)],
+                             check=True, capture_output=True)
+        return True
+
+    except subprocess.CalledProcessError:
+        return False
+
+
+def _cleanup_temp_cache_safe(temp_cache: Path) -> bool:
+    """Safely delete temp cache only if sentinel exists."""
+    # SAFETY: Only delete if sentinel exists
+    sentinel = temp_cache / ".mlxk2_temp_cache_sentinel"
+    if not sentinel.exists():
+        logger.warning(f"Refusing to delete {temp_cache} - no sentinel found")
+        return False
+
+    shutil.rmtree(temp_cache, ignore_errors=True)
+    return True
+
+
+class FilesystemError(Exception):
+    """Raised when filesystem requirements are not met."""
+    pass
\ No newline at end of file
diff --git a/mlxk2/operations/common.py b/mlxk2/operations/common.py
new file mode 100644
index 0000000..572ea6a
--- /dev/null
+++ b/mlxk2/operations/common.py
@@ -0,0 +1,284 @@
+"""Common helpers for model metadata detection (2.0).
+
+Lenient framework/type detection for Issue #31 port:
+- Prefer MLX for mlx-community/* or when README front-matter indicates MLX.
+- Detect chat type via name, config, or tokenizer chat_template hints.
+
+Parsing is intentionally lightweight (no YAML dependency). Front-matter is
+parsed from the first '---' block in README.md when present.
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Dict, Optional
+import json as _json
+
+
+@dataclass
+class FrontMatter:
+    tags: list[str]
+    library_name: Optional[str]
+
+
+def read_front_matter(root: Path) -> Optional[FrontMatter]:
+    """Best-effort parse of README.md YAML-like front matter.
+
+    Supports:
+    - Inline list: tags: [mlx, chat]
+    - Block list:
+        tags:
+          - mlx
+          - chat
+    - library_name: mlx
+    Returns None if README.md or front-matter block missing.
+    """
+    try:
+        readme = root / "README.md"
+        if not readme.exists() or not readme.is_file():
+            return None
+        lines = readme.read_text(encoding="utf-8", errors="ignore").splitlines()
+        if not lines or lines[0].strip() != "---":
+            return None
+        # Extract the first front-matter block
+        block: list[str] = []
+        for line in lines[1:]:
+            if line.strip() == "---":
+                break
+            block.append(line.rstrip("\n"))
+        if not block:
+            return None
+
+        tags: list[str] = []
+        library_name: Optional[str] = None
+
+        # Simple state machine for tags block list
+        in_tags_block = False
+        for raw in block:
+            s = raw.strip()
+            if not s:
+                continue
+            # library_name: value
+            if s.lower().startswith("library_name:"):
+                try:
+                    library_name = s.split(":", 1)[1].strip().strip('"\'')
+                except Exception:
+                    pass
+                in_tags_block = False
+                continue
+
+            # tags: [a, b]
+            if s.lower().startswith("tags:") and "[" in s and "]" in s:
+                try:
+                    inside = s.split("[", 1)[1].rsplit("]", 1)[0]
+                    parts = [p.strip().strip('"\'') for p in inside.split(",") if p.strip()]
+                    tags.extend([p for p in parts if p])
+                except Exception:
+                    pass
+                in_tags_block = False
+                continue
+
+            # tags: (start of block list)
+            if s.lower().startswith("tags:"):
+                in_tags_block = True
+                continue
+
+            if in_tags_block:
+                # Expect lines like "- mlx"
+                try:
+                    if s.startswith("-"):
+                        val = s.lstrip("-").strip().strip('"\'')
+                        if val:
+                            tags.append(val)
+                    else:
+                        # Any other non-dash line ends the block
+                        in_tags_block = False
+                except Exception:
+                    pass
+
+        return FrontMatter(tags=tags, library_name=library_name)
+    except Exception:
+        return None
+
+
+def read_tokenizer_hints(root: Path) -> Dict[str, Any]:
+    """Extract lightweight tokenizer hints (e.g., chat_template presence)."""
+    hints: Dict[str, Any] = {"chat_template": None}
+    try:
+        for fname in ("tokenizer_config.json", "tokenizer.json"):
+            fp = root / fname
+            if fp.exists() and fp.is_file():
+                try:
+                    obj = _json.loads(fp.read_text(encoding="utf-8", errors="ignore"))
+                except Exception:
+                    obj = None
+                if isinstance(obj, dict):
+                    ct = obj.get("chat_template")
+                    if isinstance(ct, str) and ct.strip():
+                        hints["chat_template"] = ct
+                        break
+    except Exception:
+        pass
+    return hints
+
+
+def _has_any(path: Path, patterns: tuple[str, ...]) -> bool:
+    try:
+        for pat in patterns:
+            if any(path.glob(pat)):
+                return True
+    except Exception:
+        return False
+    return False
+
+
+def detect_framework(hf_name: str, model_root: Path, selected_path: Optional[Path] = None, fm: Optional[FrontMatter] = None) -> str:
+    """Lenient framework detection.
+
+    MLX if:
+    - org is mlx-community/*, or
+    - README front-matter tags include 'mlx', or
+    - README front-matter library_name == 'mlx'.
+
+    Else GGUF if any *.gguf present under selected_path or snapshots.
+    Else PyTorch if any *.safetensors or pytorch_model.bin present under snapshots.
+    Else Unknown.
+    """
+    try:
+        if "mlx-community/" in hf_name:
+            return "MLX"
+
+        # Front-matter signals
+        if fm is not None:
+            tags = [t.lower() for t in (fm.tags or [])]
+            lib = (fm.library_name or "").lower()
+            if "mlx" in tags or lib == "mlx":
+                return "MLX"
+
+        # Search location preference: selected snapshot, else model root
+        root = selected_path if selected_path is not None else model_root
+
+        if _has_any(root, ("**/*.gguf",)):
+            return "GGUF"
+
+        # Look under snapshots for common formats
+        snapshots_dir = model_root / "snapshots"
+        if _has_any(snapshots_dir, ("**/*.safetensors", "**/pytorch_model.bin")):
+            return "PyTorch"
+    except Exception:
+        pass
+    return "Unknown"
+
+
+def detect_model_type(hf_name: str, config: Optional[Dict[str, Any]], tok_hints: Dict[str, Any]) -> str:
+    name = hf_name.lower()
+    if "embed" in name:
+        return "embedding"
+    if (config or {}).get("model_type") == "chat":
+        return "chat"
+    ct = tok_hints.get("chat_template")
+    if isinstance(ct, str) and ct.strip():
+        return "chat"
+    if "instruct" in name or "chat" in name:
+        return "chat"
+    return "base"
+
+
+def detect_capabilities(model_type: str, hf_name: str, tok_hints: Dict[str, Any], config: Optional[Dict[str, Any]]) -> list[str]:
+    if model_type == "embedding":
+        return ["embeddings"]
+    caps = ["text-generation"]
+    name = hf_name.lower()
+    ct = tok_hints.get("chat_template")
+    if model_type == "chat" or "instruct" in name or "chat" in name or (isinstance(ct, str) and ct.strip()):
+        caps.append("chat")
+    return caps
+
+
+def _iso8601_utc_from_mtime(p: Path) -> str:
+    try:
+        from datetime import datetime
+        return datetime.fromtimestamp(p.stat().st_mtime).strftime("%Y-%m-%dT%H:%M:%SZ")
+    except Exception:
+        return "1970-01-01T00:00:00Z"
+
+
+def _total_size_bytes(path: Path) -> int:
+    try:
+        total = 0
+        for f in path.rglob("*"):
+            if f.is_file():
+                total += f.stat().st_size
+        return total
+    except Exception:
+        return 0
+
+
+def _load_config_json(path: Path) -> Optional[Dict[str, Any]]:
+    try:
+        fp = path / "config.json"
+        if fp.exists():
+            return _json.loads(fp.read_text(encoding="utf-8", errors="ignore"))
+    except Exception:
+        pass
+    return None
+
+
+def build_model_object(hf_name: str, model_root: Path, selected_path: Optional[Path]) -> Dict[str, Any]:
+    """Build the common model object for list/show using unified detection.
+
+    selected_path: points at the chosen snapshot directory when available; otherwise
+    may be the model_root. Commit hash is taken from selected_path.name if it looks
+    like a 40-char hex string, else None.
+    """
+    from ..operations.health import is_model_healthy, check_runtime_compatibility  # local import to avoid cycle
+
+    # Compute commit hash if selected path is a snapshot dir
+    commit_hash: Optional[str] = None
+    if selected_path is not None:
+        name = selected_path.name
+        if len(name) == 40 and all(c in "0123456789abcdef" for c in name.lower()):
+            commit_hash = name
+
+    # Read hints from selected snapshot if possible; fall back to model root
+    probe = selected_path if selected_path is not None else model_root
+    fm = read_front_matter(probe)
+    tok = read_tokenizer_hints(probe)
+    config = _load_config_json(probe)
+
+    framework = detect_framework(hf_name, model_root, selected_path=selected_path, fm=fm)
+    model_type = detect_model_type(hf_name, config, tok)
+    capabilities = detect_capabilities(model_type, hf_name, tok, config)
+
+    # Health: rely on existing operation (name-based)
+    healthy, health_reason = is_model_healthy(hf_name)
+
+    # Runtime compatibility: ALWAYS computed (gate logic applies)
+    # Gate: Only check runtime if file integrity is healthy
+    if healthy:
+        runtime_compatible, runtime_reason = check_runtime_compatibility(probe, framework)
+    else:
+        # File integrity failed → skip runtime check
+        runtime_compatible = False
+        runtime_reason = None  # health_reason takes precedence
+
+    # Reason field: First problem encountered (health → runtime)
+    reason = health_reason if not healthy else runtime_reason
+
+    # Size/Modified computed from selected path (snapshot preferred)
+    base = selected_path if selected_path is not None else model_root
+    model_obj = {
+        "name": hf_name,
+        "hash": commit_hash,
+        "size_bytes": _total_size_bytes(base),
+        "last_modified": _iso8601_utc_from_mtime(base),
+        "framework": framework,
+        "model_type": model_type,
+        "capabilities": capabilities,
+        "health": "healthy" if healthy else "unhealthy",
+        "runtime_compatible": runtime_compatible,
+        "reason": reason,
+        "cached": True,
+    }
+    return model_obj
diff --git a/mlxk2/operations/health.py b/mlxk2/operations/health.py
new file mode 100644
index 0000000..5df4d8f
--- /dev/null
+++ b/mlxk2/operations/health.py
@@ -0,0 +1,433 @@
+import json
+import logging
+from pathlib import Path
+from typing import Tuple, Optional
+from ..core.cache import get_current_model_cache, hf_to_cache_dir, cache_dir_to_hf
+from ..core.model_resolution import resolve_model_for_operation
+
+
+def is_model_healthy(model_spec):
+    """Framework-agnostic health check accepting model names like 1.1.0."""
+    from ..core.model_resolution import resolve_model_for_operation
+    
+    # Resolve model name to get actual cache directory
+    resolved_name, commit_hash, ambiguous_matches = resolve_model_for_operation(model_spec)
+    
+    if ambiguous_matches or not resolved_name:
+        return False, "Could not resolve model spec"
+    
+    # Get the model cache directory (models--namespace--name)
+    model_cache = get_current_model_cache()
+    model_cache_dir = model_cache / hf_to_cache_dir(resolved_name)
+    if not model_cache_dir.exists():
+        return False, "Model not in cache"
+    
+    # Find the appropriate snapshot to check
+    snapshots_dir = model_cache_dir / "snapshots"
+    if not snapshots_dir.exists():
+        return False, "No snapshots directory found"
+    
+    snapshots = [d for d in snapshots_dir.iterdir() if d.is_dir()]
+    if not snapshots:
+        return False, "No snapshots found"
+    
+    # Use specific hash if provided, otherwise latest snapshot
+    if commit_hash:
+        model_path = snapshots_dir / commit_hash
+        if not model_path.exists():
+            return False, f"Specific hash {commit_hash} not found"
+    else:
+        model_path = max(snapshots, key=lambda x: x.stat().st_mtime)
+    
+    # Now do the actual health check on the snapshot
+    return _check_snapshot_health(model_path)
+
+
+def _check_snapshot_health(model_path):
+    """Check health of a specific snapshot directory.
+
+    Rules (Issue #27 parity):
+    - If a multi-file safetensors index exists (model.safetensors.index.json),
+      ALL referenced shard files must exist and be non-empty, and none may be LFS pointers.
+      A subset must NOT be marked healthy.
+    - Without an index, require at least one weight file present and non-empty,
+      and ensure none are LFS pointers.
+    """
+    if not model_path.exists():
+        return False, "Model path does not exist"
+    
+    # Check config.json
+    config_path = model_path / "config.json"
+    if not config_path.exists():
+        return False, "config.json missing"
+    
+    try:
+        with open(config_path) as f:
+            config_data = json.load(f)
+        if not isinstance(config_data, dict) or len(config_data) == 0:
+            return False, "config.json is empty or invalid"
+    except (OSError, json.JSONDecodeError):
+        return False, "config.json contains invalid JSON"
+    
+    # Prefer safetensors index; else fall back to PyTorch index
+    sft_index = model_path / "model.safetensors.index.json"
+    pt_index = model_path / "pytorch_model.bin.index.json"
+    has_sft_files = any(model_path.rglob("*.safetensors"))
+    has_bin_files = any(model_path.rglob("*.bin"))
+
+    chosen_index = None
+    if sft_index.exists() and has_sft_files:
+        chosen_index = ("sft", sft_index)
+    elif pt_index.exists() and has_bin_files:
+        chosen_index = ("pt", pt_index)
+
+    if chosen_index is not None:
+        kind, index_file = chosen_index
+        try:
+            with open(index_file) as f:
+                index = json.load(f)
+            weight_map = index.get('weight_map') or {}
+            if not isinstance(weight_map, dict) or not weight_map:
+                return False, "Empty or invalid weight_map in index"
+            referenced_files = sorted(set(weight_map.values()))
+            missing = [rf for rf in referenced_files if not (model_path / rf).exists()]
+            if missing:
+                return False, f"Missing weight shards: {', '.join(missing)}"
+            empty = [rf for rf in referenced_files if (model_path / rf).stat().st_size == 0]
+            if empty:
+                return False, f"Empty weight shards: {', '.join(empty)}"
+            # LFS pointer check on referenced files
+            lfs_bad = []
+            for rf in referenced_files:
+                fp = (model_path / rf)
+                if fp.is_file() and fp.stat().st_size < 200:
+                    try:
+                        with open(fp, 'rb') as f:
+                            header = f.read(100)
+                            if b'version https://git-lfs.github.com/spec/v1' in header:
+                                lfs_bad.append(rf)
+                    except Exception:
+                        pass
+            if lfs_bad:
+                return False, f"LFS pointers instead of files: {', '.join(lfs_bad)}"
+            return True, "Multi-file model complete"
+        except (OSError, json.JSONDecodeError):
+            return False, "Invalid index file"
+
+    # No index: Check weight files (supports common formats)
+    weight_files = (
+        list(model_path.glob("*.safetensors")) +
+        list(model_path.glob("*.bin")) +
+        list(model_path.glob("*.gguf"))
+    )
+    if not weight_files:
+        weight_files = (
+            list(model_path.glob("**/*.safetensors")) +
+            list(model_path.glob("**/*.bin")) +
+            list(model_path.glob("**/*.gguf"))
+        )
+    # Pattern-based completeness (no index): model-XXXXX-of-YYYYY.safetensors
+    # If such shards are present, require full set to be present and non-empty
+    if weight_files:
+        import re
+        shard_regex = re.compile(r"model-(\d{5})-of-(\d{5})\.safetensors$")
+        shards = []
+        for f in weight_files:
+            m = shard_regex.search(f.name)
+            if m:
+                idx = int(m.group(1))
+                total = int(m.group(2))
+                shards.append((idx, total, f))
+        if shards:
+            totals = {t for (_, t, _) in shards}
+            if len(totals) != 1:
+                return False, "Inconsistent shard totals detected"
+            expected_total = next(iter(totals))
+            present_indices = {i for (i, _, _) in shards}
+            missing_indices = [i for i in range(1, expected_total + 1) if i not in present_indices]
+            if missing_indices:
+                return False, f"Missing shards by pattern: {len(present_indices)}/{expected_total} present"
+            empties = [f.name for (_, _, f) in shards if f.stat().st_size == 0]
+            if empties:
+                return False, f"Empty shards: {', '.join(empties)}"
+    if not weight_files:
+        return False, "No model weights found"
+
+    # Partial download markers → unhealthy
+    for fp in model_path.rglob("*"):
+        if fp.is_file():
+            name = fp.name.lower()
+            if name.endswith('.partial') or name.endswith('.tmp') or 'partial' in name:
+                return False, "Partial download marker detected"
+
+    # Ensure files are non-empty
+    if any(f.stat().st_size == 0 for f in weight_files):
+        empties = [f.name for f in weight_files if f.stat().st_size == 0]
+        return False, f"Empty weight files: {', '.join(empties)}"
+
+    # Pattern-based completeness (no index): model-XXXXX-of-YYYYY.safetensors
+    # If such shards are present but no index, mark unhealthy (index required for sharded models)
+    import re
+    shard_regex = re.compile(r"model-(\d{5})-of-(\d{5})\.safetensors$")
+    shards = []
+    for f in weight_files:
+        m = shard_regex.search(f.name)
+        if m:
+            idx = int(m.group(1))
+            total = int(m.group(2))
+            shards.append((idx, total, f))
+    if shards:
+        totals = {t for (_, t, _) in shards}
+        if len(totals) != 1:
+            return False, "Inconsistent shard totals detected"
+        expected_total = next(iter(totals))
+        present_indices = {i for (i, _, _) in shards}
+        missing_indices = [i for i in range(1, expected_total + 1) if i not in present_indices]
+        if missing_indices:
+            return False, f"Missing shards by pattern: {len(present_indices)}/{expected_total} present"
+        # Even if complete by pattern, absence of index is unhealthy (robust policy)
+        return False, "Safetensors index missing for sharded model"
+
+    # LFS pointer scan (recursive simplified)
+    lfs_ok, lfs_msg = check_lfs_corruption(model_path)
+    if not lfs_ok:
+        return False, lfs_msg
+
+    return True, "Model is healthy"
+
+
+def check_lfs_corruption(model_path):
+    """Check for Git LFS pointer files instead of actual model files (recursive)."""
+    corrupted_files = []
+    for file_path in model_path.rglob("*"):
+        if file_path.is_file() and file_path.stat().st_size < 200:
+            try:
+                with open(file_path, 'rb') as f:
+                    header = f.read(100)
+                    if b'version https://git-lfs.github.com/spec/v1' in header:
+                        corrupted_files.append(str(file_path.relative_to(model_path)))
+            except Exception:
+                pass
+    
+    if corrupted_files:
+        return False, f"LFS pointers instead of files: {', '.join(corrupted_files)}"
+    return True, "No LFS corruption detected"
+
+
+def health_from_cache(model_spec, cache_dir):
+    """Health check for a specific model in a specific cache directory.
+
+    This is used by clone operations to check model health in temporary caches
+    without contaminating the user's main cache. Uses the full _check_snapshot_health()
+    logic to ensure identical health validation standards.
+
+    Args:
+        model_spec: Model name/spec to check (e.g., "microsoft/DialoGPT-small")
+        cache_dir: Path to the cache directory containing the model
+
+    Returns:
+        (bool, str): (is_healthy, reason_message)
+    """
+    from pathlib import Path
+    from ..core.cache import hf_to_cache_dir
+
+    cache_path = Path(cache_dir)
+
+    # Convert model spec to cache directory format
+    model_cache_dir = cache_path / hf_to_cache_dir(model_spec)
+    if not model_cache_dir.exists():
+        return False, "Model not in cache"
+
+    # Find the appropriate snapshot to check
+    snapshots_dir = model_cache_dir / "snapshots"
+    if not snapshots_dir.exists():
+        return False, "No snapshots directory found"
+
+    snapshots = [d for d in snapshots_dir.iterdir() if d.is_dir()]
+    if not snapshots:
+        return False, "No snapshots found"
+
+    # Use the latest snapshot (by modification time)
+    model_path = max(snapshots, key=lambda x: x.stat().st_mtime)
+
+    # Use the same health check logic as regular health operations
+    return _check_snapshot_health(model_path)
+
+
+def check_runtime_compatibility(model_path: Path, framework: str) -> Tuple[bool, Optional[str]]:
+    """Check if model is executable with mlx-lm.
+
+    Gate logic:
+    1. Framework must be "MLX" (GGUF/PyTorch → incompatible)
+    2. Weight files must use mlx-lm compatible naming (not legacy formats)
+    3. model_type must be supported by current mlx-lm version
+
+    Returns:
+        (is_compatible, reason): reason is None if compatible, error message otherwise
+    """
+    # Gate 1: Framework check
+    if framework != "MLX":
+        return False, f"Incompatible: {framework}"
+
+    # Gate 2: Weight file format check (legacy format detection)
+    # mlx-lm only accepts:
+    # - model.safetensors (single file)
+    # - model-XXXXX-of-YYYYY.safetensors (sharded, with index)
+    # Legacy formats are rejected: weights.*.safetensors, pytorch_model-*.safetensors
+    import re
+
+    # Check for legacy weight file patterns
+    legacy_patterns = [
+        re.compile(r'^weights\.\d+\.safetensors$'),  # weights.00.safetensors
+        re.compile(r'^pytorch_model-\d+\.safetensors$'),  # pytorch_model-00001.safetensors
+    ]
+
+    # Check for valid mlx-lm weight file patterns
+    valid_patterns = [
+        re.compile(r'^model\.safetensors$'),  # Single file
+        re.compile(r'^model-\d{5}-of-\d{5}\.safetensors$'),  # Sharded
+    ]
+
+    weight_files = list(model_path.glob("*.safetensors"))
+    if weight_files:
+        has_valid = any(
+            any(pattern.match(f.name) for pattern in valid_patterns)
+            for f in weight_files
+        )
+        has_legacy = any(
+            any(pattern.match(f.name) for pattern in legacy_patterns)
+            for f in weight_files
+        )
+
+        if has_legacy and not has_valid:
+            # Found only legacy format files, no valid mlx-lm files
+            return False, "Legacy format not supported by mlx-lm"
+
+    # Gate 3: model_type support check via mlx-lm
+    config_path = model_path / "config.json"
+    if not config_path.exists():
+        return False, "config.json missing (required for model_type detection)"
+
+    try:
+        with open(config_path) as f:
+            config = json.load(f)
+        model_type = config.get("model_type")
+        if not model_type:
+            return False, "config.json missing model_type field"
+    except (OSError, json.JSONDecodeError) as e:
+        return False, f"Failed to read config.json: {e}"
+
+    # Check if mlx-lm supports this model_type
+    try:
+        # Suppress mlx-lm's ERROR logs during detection
+        # mlx-lm uses root logger, so we need to suppress both mlx_lm and root
+        mlx_logger = logging.getLogger("mlx_lm")
+        root_logger = logging.getLogger()
+        original_mlx_level = mlx_logger.level
+        original_root_level = root_logger.level
+        mlx_logger.setLevel(logging.CRITICAL)
+        root_logger.setLevel(logging.CRITICAL)
+
+        try:
+            # Try mlx-lm >= 0.28.0 API first (mlx_lm.models.base._get_classes)
+            try:
+                from mlx_lm.models.base import _get_classes
+                model_class, _ = _get_classes(config=config, model_config=config)
+            except ImportError:
+                # Fall back to mlx-lm 0.27.x API (mlx_lm.utils._get_classes)
+                from mlx_lm.utils import _get_classes
+                model_class, _ = _get_classes(config)
+
+            if model_class is None:
+                return False, f"model_type '{model_type}' not supported by mlx-lm"
+
+            return True, None
+        finally:
+            mlx_logger.setLevel(original_mlx_level)
+            root_logger.setLevel(original_root_level)
+
+    except Exception as e:
+        # Pass through the actual error for debugging
+        return False, str(e) if str(e) else "Runtime check failed"
+
+
+def health_check_operation(model_pattern=None):
+    """Health check operation for JSON API with model resolution support."""
+    result = {
+        "status": "success",
+        "command": "health",
+        "error": None,
+        "data": {
+            "healthy": [],
+            "unhealthy": [],
+            "summary": {
+                "total": 0,
+                "healthy_count": 0,
+                "unhealthy_count": 0
+            }
+        }
+    }
+    
+    try:
+        model_cache = get_current_model_cache()
+        if not model_cache.exists():
+            result["data"]["summary"]["total"] = 0
+            return result
+        
+        # Use model resolution if specific pattern provided
+        if model_pattern:
+            resolved_name, commit_hash, ambiguous_matches = resolve_model_for_operation(model_pattern)
+            
+            if ambiguous_matches:
+                # Multiple matches - let user choose
+                result["status"] = "error"
+                result["error"] = {
+                    "type": "ambiguous_match",
+                    "message": f"Multiple models match '{model_pattern}'",
+                    "matches": ambiguous_matches
+                }
+                return result
+            elif not resolved_name:
+                # No matches found
+                result["data"]["summary"]["total"] = 0
+                return result
+            else:
+                # Single match found - check just this model
+                model_cache_dir = model_cache / hf_to_cache_dir(resolved_name)
+                if model_cache_dir.exists():
+                    models_to_check = [model_cache_dir]
+                else:
+                    models_to_check = []
+        else:
+            # No pattern - check all models
+            models_to_check = [d for d in model_cache.iterdir() if d.name.startswith("models--")]
+        
+        result["data"]["summary"]["total"] = len(models_to_check)
+        
+        for model_dir in sorted(models_to_check, key=lambda x: x.name):
+            hf_name = cache_dir_to_hf(model_dir.name)
+            
+            # Use the new flexible health check
+            healthy, reason = is_model_healthy(hf_name)
+            
+            model_info = {
+                "name": hf_name,
+                "status": "healthy" if healthy else "unhealthy", 
+                "reason": reason
+            }
+            
+            if healthy:
+                result["data"]["healthy"].append(model_info)
+                result["data"]["summary"]["healthy_count"] += 1
+            else:
+                result["data"]["unhealthy"].append(model_info)
+                result["data"]["summary"]["unhealthy_count"] += 1
+    
+    except Exception as e:
+        result["status"] = "error"
+        result["error"] = {
+            "type": "health_check_failed",
+            "message": str(e)
+        }
+    
+    return result
diff --git a/mlxk2/operations/list.py b/mlxk2/operations/list.py
new file mode 100644
index 0000000..23df761
--- /dev/null
+++ b/mlxk2/operations/list.py
@@ -0,0 +1,72 @@
+"""List models operation for MLX-Knife 2.0."""
+
+from typing import Dict, Any, Optional, Tuple
+
+from ..core.cache import get_current_model_cache, cache_dir_to_hf
+from .common import build_model_object
+
+
+def _latest_snapshot(model_path) -> Tuple[Optional[str], Optional[object]]:
+    """Return (hash, path) for the latest snapshot if any, else (None, None)."""
+    snapshots_dir = model_path / "snapshots"
+    if not snapshots_dir.exists():
+        return None, None
+    snapshots = [d for d in snapshots_dir.iterdir() if d.is_dir() and len(d.name) == 40]
+    if not snapshots:
+        return None, None
+    latest = max(snapshots, key=lambda x: x.stat().st_mtime)
+    return latest.name, latest
+
+
+def list_models(pattern: str = None) -> Dict[str, Any]:
+    """List all models in cache with JSON output.
+    
+    Args:
+        pattern: Optional pattern to filter models (case-insensitive substring match)
+    """
+    models = []
+    model_cache = get_current_model_cache()
+    
+    if not model_cache.exists():
+        return {
+            "status": "success",
+            "command": "list",
+            "data": {
+                "models": models,
+                "count": 0
+            },
+            "error": None
+        }
+    
+    # Find all model directories
+    for model_dir in model_cache.iterdir():
+        if not model_dir.is_dir() or not model_dir.name.startswith("models--"):
+            continue
+            
+        hf_name = cache_dir_to_hf(model_dir.name)
+        # Hide test sentinel directories from listings
+        if "TEST-CACHE-SENTINEL" in hf_name:
+            continue
+        
+        # Apply pattern filter if specified
+        if pattern and pattern.strip():
+            if pattern.lower() not in hf_name.lower():
+                continue
+
+        # Select snapshot (prefer latest) and build model object
+        _hash, snap_path = _latest_snapshot(model_dir)
+        model_obj = build_model_object(hf_name, model_dir, snap_path if snap_path is not None else model_dir)
+        models.append(model_obj)
+    
+    # Sort by name for consistent output
+    models.sort(key=lambda x: x["name"])
+    
+    return {
+        "status": "success",
+        "command": "list",
+        "data": {
+            "models": models,
+            "count": len(models)
+        },
+        "error": None
+    }
diff --git a/mlxk2/operations/pull.py b/mlxk2/operations/pull.py
new file mode 100644
index 0000000..906b198
--- /dev/null
+++ b/mlxk2/operations/pull.py
@@ -0,0 +1,331 @@
+from ..core.cache import MODEL_CACHE, hf_to_cache_dir
+from ..core.model_resolution import resolve_model_for_operation
+from .health import is_model_healthy
+import os
+
+
+# Pull uses exact user input - HuggingFace resolves model names
+
+def preflight_repo_access(model_name, hf_api=None):
+    """Check repository access before download to prevent cache pollution.
+
+    Issue #30: Fail fast for gated/private or non-existent repos without starting any download.
+
+    Args:
+        model_name: Repository name to check
+        hf_api: Optional injected `HfApi` instance (testability)
+
+    Returns:
+        (success: bool, error_message: str or None)
+    """
+    try:
+        # Lazy imports with robust error shims across hub versions
+        import huggingface_hub as _hub
+        from huggingface_hub import HfApi
+        try:
+            from requests.exceptions import HTTPError, Timeout  # type: ignore
+        except Exception:  # requests may not be present in minimal envs
+            HTTPError = Timeout = None  # type: ignore
+
+        hub_errors = getattr(_hub, "errors", None)
+
+        api = hf_api or HfApi()
+
+        # Prefer modern token name in messages, but accept legacy var when present
+        token = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_HUB_TOKEN")
+
+        try:
+            # Lightweight metadata request (no file download)
+            api.model_info(model_name, token=token)
+            return True, None
+
+        except Exception as e:  # Map known cases first, then fallbacks
+            # 1) Map huggingface_hub specific errors if available
+            if hub_errors is not None:
+                GatedRepoError = getattr(hub_errors, "GatedRepoError", None)
+                RepositoryNotFoundError = getattr(hub_errors, "RepositoryNotFoundError", None)
+                HfHubHTTPError = getattr(hub_errors, "HfHubHTTPError", None)
+                HfHubError = getattr(hub_errors, "HfHubError", None)
+
+                if GatedRepoError and isinstance(e, GatedRepoError):
+                    return False, (
+                        f"Access denied: gated/private model '{model_name}'. "
+                        f"Accept terms and set HF_TOKEN."
+                    )
+                if RepositoryNotFoundError and isinstance(e, RepositoryNotFoundError):
+                    # Security feature: HG often returns access denied semantics for missing
+                    return False, f"Access denied or not found for '{model_name}'."
+                # Generic hub HTTP error with status code
+                if (HfHubHTTPError and isinstance(e, HfHubHTTPError)) or (HfHubError and isinstance(e, HfHubError)):
+                    resp = getattr(e, "response", None)
+                    code = getattr(resp, "status_code", None)
+                    if code in (401, 403):
+                        return False, f"Access denied to model '{model_name}'. Set HF_TOKEN."
+                    if code:
+                        # Non-auth HTTP issues during preflight: degrade gracefully to download stage
+                        return True, f"Preflight HTTP {code}; continuing to download stage."
+                    # Fallback without code → degrade gracefully
+                    return True, "Preflight error without HTTP code; continuing."
+
+            # 2) requests timeouts / HTTP errors (when surfaced directly)
+            if Timeout and isinstance(e, Timeout):  # type: ignore[arg-type]
+                # Network timeout during preflight: degrade to download stage
+                return True, f"Preflight timeout for '{model_name}'; continuing to download stage."
+            if HTTPError and isinstance(e, HTTPError):  # type: ignore[arg-type]
+                code = getattr(getattr(e, "response", None), "status_code", None)
+                if code in (401, 403):
+                    return False, f"Access denied to model '{model_name}'. Set HF_TOKEN."
+                if code:
+                    return True, f"Preflight HTTP {code}; continuing to download stage."
+                return True, "Preflight HTTP error; continuing."
+
+            # 3) Generic fallback based on message hints
+            msg = str(e).lower()
+            # Hard fail on clear access-denied/gated patterns
+            if any(h in msg for h in ("forbidden", "unauthorized", "denied", "gated", "private")):
+                return False, f"Access denied or gated/private for '{model_name}'."
+            if "not found" in msg:
+                return False, f"Access denied or not found for '{model_name}'."
+
+            # Unknown errors → degrade gracefully to allow downstream error surface
+            return True, f"Preflight error: {str(e)}; continuing to download stage."
+
+    except ImportError:
+        # No preflight available → fail safe, include expected keywords
+        return False, "Access denied or not found (preflight unavailable; install huggingface-hub)."
+
+    except Exception as e:
+        # Unknown errors → fail safe, include expected keywords
+        return False, f"Access denied or gated/private (preflight failed: {str(e)}). Set HF_TOKEN if needed."
+
+
+def pull_model_with_huggingface_hub(model_name, cache_dir=None):
+    """Use huggingface-hub to pull a model to specified cache directory."""
+    try:
+        # Just-in-time suppression for macOS Python 3.9 LibreSSL warning
+        import warnings as _warnings
+        _warnings.filterwarnings('ignore', message='urllib3 v2 only supports OpenSSL 1.1.1+')
+        # Use direct Python API instead of CLI
+        from huggingface_hub import snapshot_download
+
+        # Download model to specified cache or default
+        kwargs = {
+            "repo_id": model_name,
+            "local_files_only": False,
+            "resume_download": True
+        }
+        if cache_dir:
+            kwargs["cache_dir"] = str(cache_dir)
+
+        local_dir = snapshot_download(**kwargs)
+
+        return True, f"Downloaded to {local_dir}"
+
+    except ImportError:
+        return False, "huggingface-hub not installed (pip install huggingface-hub)"
+    except Exception as e:
+        return False, f"Download failed: {str(e)}"
+
+
+def pull_operation(model_spec):
+    """Pull (download) operation for JSON API."""
+    result = {
+        "status": "success",
+        "command": "pull",
+        "error": None,
+        "data": {
+            "model": None,
+            "download_status": "unknown",
+            "message": "",
+            "expanded_name": None
+        }
+    }
+    
+    try:
+        # Early validation before any network/library usage
+        if not model_spec or not str(model_spec).strip():
+            result["status"] = "error"
+            result["error"] = {
+                "type": "ValidationError",
+                "message": "Invalid model name: empty",
+            }
+            result["data"]["download_status"] = "error"
+            return result
+
+        base_spec = str(model_spec).split("@", 1)[0]
+        # HF repo id soft rules (MVP): length, bad slashes; allow single-segment as fuzzy/alias
+        if len(base_spec) > 96 or base_spec.startswith("/") or base_spec.endswith("/") or "//" in base_spec:
+            result["status"] = "error"
+            result["error"] = {
+                "type": "ValidationError",
+                "message": "Invalid model name: must be <= 96 chars and not contain leading/trailing or double slashes",
+            }
+            result["data"]["download_status"] = "error"
+            return result
+
+        # Use model resolution for fuzzy matching and expansion
+        resolved_name, commit_hash, ambiguous_matches = resolve_model_for_operation(model_spec)
+        
+        if ambiguous_matches:
+            result["status"] = "error"
+            result["error"] = {
+                "type": "ambiguous_match",
+                "message": f"Multiple models match '{model_spec}'",
+                "matches": ambiguous_matches
+            }
+            return result
+        elif not resolved_name:
+            # No existing model found - use original spec for download as-is
+            if "@" in model_spec:
+                model_name, commit_hash = model_spec.rsplit("@", 1)
+                result["data"]["commit_hash"] = commit_hash
+            else:
+                model_name = model_spec
+                commit_hash = None
+            resolved_name = model_name  # Use exact name - let HuggingFace resolve it
+        
+        result["data"]["model"] = resolved_name
+        result["data"]["expanded_name"] = resolved_name if resolved_name != model_spec.split('@')[0] else None
+        if commit_hash:
+            result["data"]["commit_hash"] = commit_hash
+        
+        # Check if already exists and is healthy
+        cache_dir = MODEL_CACHE / hf_to_cache_dir(resolved_name)
+        if cache_dir.exists():
+            healthy, _ = is_model_healthy(resolved_name)
+            if healthy:
+                result["data"]["download_status"] = "already_exists"
+                result["data"]["message"] = f"Model {resolved_name} already exists in cache"
+                return result
+            else:
+                # Model exists but unhealthy - suggest rm workflow
+                result["status"] = "error"
+                result["error"] = {
+                    "type": "model_corrupted",
+                    "message": f"Model exists but is corrupted. Use 'rm {model_spec}' first, then pull again."
+                }
+                result["data"]["download_status"] = "corrupted"
+                return result
+        
+        # Preflight check for repository access (Issue #30)
+        result["data"]["download_status"] = "checking_access"
+        preflight_success, preflight_error = preflight_repo_access(resolved_name)
+        
+        if not preflight_success:
+            result["status"] = "error"
+            result["data"]["download_status"] = "access_denied"
+            result["error"] = {
+                "type": "access_denied",
+                "message": preflight_error
+            }
+            return result
+        elif preflight_error:
+            # Warning case - log but continue
+            result["data"]["preflight_warning"] = preflight_error
+        
+        # Attempt download
+        result["data"]["download_status"] = "downloading"
+        success, message = pull_model_with_huggingface_hub(resolved_name)
+        
+        if success:
+            result["data"]["download_status"] = "success"
+            result["data"]["message"] = message
+        else:
+            result["status"] = "error"
+            result["data"]["download_status"] = "failed"
+            result["error"] = {
+                "type": "download_failed",
+                "message": message
+            }
+    
+    except Exception as e:
+        result["status"] = "error"
+        result["error"] = {
+            "type": "pull_operation_failed", 
+            "message": str(e)
+        }
+        result["data"]["download_status"] = "error"
+
+    return result
+
+
+def pull_to_cache(model_spec, cache_dir):
+    """Pull model to specific cache directory - used by clone operation."""
+    result = {
+        "status": "success",
+        "command": "pull",
+        "error": None,
+        "data": {
+            "model": None,
+            "download_status": "unknown",
+            "message": "",
+            "expanded_name": None
+        }
+    }
+
+    try:
+        # Basic validation
+        if not model_spec or not str(model_spec).strip():
+            result["status"] = "error"
+            result["error"] = {
+                "type": "ValidationError",
+                "message": "Invalid model name: empty",
+            }
+            result["data"]["download_status"] = "error"
+            return result
+
+        base_spec = str(model_spec).split("@", 1)[0]
+        if len(base_spec) > 96 or base_spec.startswith("/") or base_spec.endswith("/") or "//" in base_spec:
+            result["status"] = "error"
+            result["error"] = {
+                "type": "ValidationError",
+                "message": "Invalid model name: must be <= 96 chars and not contain leading/trailing or double slashes",
+            }
+            result["data"]["download_status"] = "error"
+            return result
+
+        # For clone operations, use model spec as-is (no fuzzy resolution)
+        model_name = model_spec
+        result["data"]["model"] = model_name
+        result["data"]["expanded_name"] = model_name
+
+        # Preflight check for repository access (Issue #30)
+        result["data"]["download_status"] = "checking_access"
+        preflight_success, preflight_error = preflight_repo_access(model_name)
+
+        if not preflight_success:
+            result["status"] = "error"
+            result["data"]["download_status"] = "access_denied"
+            result["error"] = {
+                "type": "access_denied",
+                "message": preflight_error
+            }
+            return result
+        elif preflight_error:
+            # Warning case - log but continue
+            result["data"]["preflight_warning"] = preflight_error
+
+        # Download to specified cache directory
+        result["data"]["download_status"] = "downloading"
+        success, message = pull_model_with_huggingface_hub(model_name, cache_dir)
+
+        if success:
+            result["data"]["download_status"] = "success"
+            result["data"]["message"] = message
+        else:
+            result["status"] = "error"
+            result["error"] = {
+                "type": "DownloadError",
+                "message": message
+            }
+            result["data"]["download_status"] = "error"
+
+    except Exception as e:
+        result["status"] = "error"
+        result["error"] = {
+            "type": "OperationError",
+            "message": f"Unexpected error during pull: {str(e)}"
+        }
+        result["data"]["download_status"] = "error"
+
+    return result
diff --git a/mlxk2/operations/push.py b/mlxk2/operations/push.py
new file mode 100644
index 0000000..92c334c
--- /dev/null
+++ b/mlxk2/operations/push.py
@@ -0,0 +1,684 @@
+"""Experimental push operation for MLX-Knife 2.0 (M0: upload only).
+
+This is a minimal, JSON-first implementation that uploads a local folder
+to a Hugging Face model repository using huggingface_hub.upload_folder.
+
+Scope (M0):
+- No validation, no filters, no manifests.
+- Requires HF_TOKEN environment variable.
+- Default branch is main (configurable via CLI).
+"""
+
+from __future__ import annotations
+
+import os
+from pathlib import Path
+from typing import Dict, Any, List, Optional
+import json as _json
+
+# Import APFS check from clone operation and cache utilities
+from mlxk2.operations.clone import _is_apfs_filesystem
+from mlxk2.core.cache import get_current_cache_root
+
+
+DEFAULT_PUSH_BRANCH = "main"
+
+
+def push_operation(
+    local_dir: str,
+    repo_id: str,
+    create: bool = False,
+    private: bool = False,
+    branch: str = DEFAULT_PUSH_BRANCH,
+    commit_message: str | None = None,
+    check_only: bool = False,
+    quiet: bool = False,
+    dry_run: bool = False,
+) -> Dict[str, Any]:
+    """Perform a minimal push (upload) to Hugging Face Hub.
+
+    Returns a JSON-serializable result dict following the 2.0 pattern.
+    """
+    result: Dict[str, Any] = {
+        "status": "success",
+        "command": "push",
+        "error": None,
+        "data": {
+            "repo_id": repo_id,
+            "branch": branch or DEFAULT_PUSH_BRANCH,
+            "commit_sha": None,
+            "commit_url": None,
+            "repo_url": f"https://huggingface.co/{repo_id}",
+            # Number of actually uploaded/changed files (when available).
+            "uploaded_files_count": None,
+            # Local count of files scanned in the folder (approximation, optional).
+            "local_files_count": None,
+            # Indicates whether the Hub performed a no-op (no changes to commit).
+            "no_changes": None,
+            # Whether the repository was created in this operation.
+            "created_repo": False,
+            # Optional short message for humans (kept in JSON too for clarity).
+            "message": None,
+            "experimental": True,
+            "disclaimer": (
+                "Experimental feature (M0: upload only). No validation/filters; "
+                "review results on the Hub."
+            ),
+        },
+    }
+
+    try:
+        # 1) Token (skip for check-only)
+        hf_token = os.environ.get("HF_TOKEN")
+        if not check_only and not hf_token:
+            result["status"] = "error"
+            result["error"] = {
+                "type": "auth_error",
+                "message": "HF_TOKEN not set",
+            }
+            return result
+
+        # 2) Local folder
+        p = Path(local_dir)
+        if not p.exists() or not p.is_dir():
+            result["status"] = "error"
+            result["error"] = {
+                "type": "workspace_not_found",
+                "message": f"Workspace not found or not a directory: {local_dir}",
+            }
+            return result
+
+        # Optional approximate count (local view)
+        try:
+            approx_count = sum(1 for _ in p.rglob("*") if _.is_file())
+            result["data"]["local_files_count"] = approx_count
+        except Exception:
+            pass
+
+        # 2a) Build ignore patterns early (used by dry-run and upload)
+        ignore_patterns = [
+            "**/.git/**",
+            "**/.git",
+            "**/.DS_Store",
+            ".DS_Store",
+            "**/.hfignore",
+            ".hfignore",
+            "**/.gitignore",
+            ".gitignore",
+            "**/__pycache__/**",
+            "**/.venv/**",
+            "**/venv/**",
+            "**/*.pyc",
+        ]
+        hfignore = p / ".hfignore"
+        if hfignore.exists():
+            try:
+                extra_patterns = []
+                for line in hfignore.read_text().splitlines():
+                    s = line.strip()
+                    if not s or s.startswith("#"):
+                        continue
+                    extra_patterns.append(s)
+                if extra_patterns:
+                    seen = set()
+                    merged = []
+                    for pat in ignore_patterns + extra_patterns:
+                        if pat not in seen:
+                            merged.append(pat)
+                            seen.add(pat)
+                    ignore_patterns = merged
+            except Exception:
+                # Ignore read/parse errors silently in M0
+                pass
+
+        # 2b) Check-only: analyze workspace and return without contacting HF
+        if check_only:
+            diag = _analyze_workspace(p)
+            result["data"]["no_changes"] = None
+            result["data"]["message"] = "Check-only: no upload performed."
+            result["data"]["workspace_health"] = diag
+            return result
+
+        # 3) Import hub pieces lazily and perform repo checks / upload
+        # Suppress macOS Python 3.9 LibreSSL warning like pull operation
+        import warnings as _warnings
+
+        _warnings.filterwarnings(
+            "ignore", message="urllib3 v2 only supports OpenSSL 1.1.1+"
+        )
+
+        try:
+            from huggingface_hub import HfApi, upload_folder
+            from huggingface_hub.errors import (
+                HfHubHTTPError,
+                RepositoryNotFoundError,
+                RevisionNotFoundError,
+            )
+        except Exception:
+            result["status"] = "error"
+            result["error"] = {
+                "type": "dependency_missing",
+                "message": "huggingface-hub not installed (pip install huggingface-hub)",
+            }
+            return result
+
+        api = HfApi(token=hf_token)
+
+        # 4) Ensure repo exists (model type). Do not auto-create branch here.
+        created_repo = False
+        try:
+            # If branch does not exist, this raises RevisionNotFoundError.
+            api.repo_info(repo_id=repo_id, repo_type="model", revision=branch)
+        except RepositoryNotFoundError:
+            if dry_run:
+                # For dry-run, do not create; compute that all files would be added
+                local_files = _collect_local_files(p, ignore_patterns)
+                result["data"].update({
+                    "dry_run": True,
+                    "no_changes": False if local_files else True,
+                    "uploaded_files_count": 0,
+                    "change_summary": {"added": len(local_files), "modified": 0, "deleted": 0},
+                    "dry_run_summary": {"added": len(local_files), "modified": 0, "deleted": 0},
+                    "message": "Dry-run: repository does not exist; would create and add all files.",
+                    "would_create_repo": True,
+                    "would_create_branch": True,
+                })
+                return result
+            if not create:
+                result["status"] = "error"
+                result["error"] = {
+                    "type": "repo_not_found",
+                    "message": f"Repository not found: {repo_id} (use --create)",
+                }
+                return result
+            # Try create repository (exist_ok=True covers races)
+            api.create_repo(
+                repo_id=repo_id, repo_type="model", private=private, exist_ok=True
+            )
+            # After create, no guarantee branch exists; upload_folder below will target revision
+            created_repo = True
+            # Ensure target branch exists if not default
+            try:
+                if branch and branch != DEFAULT_PUSH_BRANCH:
+                    api.create_branch(repo_id=repo_id, repo_type="model", branch=branch)
+            except HfHubHTTPError as e:
+                result["status"] = "error"
+                result["error"] = {
+                    "type": "branch_create_failed",
+                    "message": str(e),
+                }
+                return result
+        except RevisionNotFoundError:
+            # Repo exists but branch doesn't.
+            if dry_run:
+                local_files = _collect_local_files(p, ignore_patterns)
+                result["data"].update({
+                    "dry_run": True,
+                    "no_changes": False if local_files else True,
+                    "uploaded_files_count": 0,
+                    "change_summary": {"added": len(local_files), "modified": 0, "deleted": 0},
+                    "dry_run_summary": {"added": len(local_files), "modified": 0, "deleted": 0},
+                    "message": "Dry-run: branch does not exist; would create branch and add all files.",
+                    "would_create_repo": False,
+                    "would_create_branch": True,
+                })
+                return result
+            # If user asked to create, proactively create the branch to avoid 404 on preupload;
+            # otherwise, tolerate and let upload_folder attempt (offline tests expect this).
+            if create:
+                try:
+                    api.create_branch(repo_id=repo_id, repo_type="model", branch=branch)
+                except HfHubHTTPError:
+                    # Do not fail early; fall through and let upload attempt once
+                    pass
+
+        # 4b) If dry-run and repo/branch exist: compute diff vs remote and return
+        if dry_run:
+            try:
+                remote_files = set(api.list_repo_files(repo_id=repo_id, repo_type="model", revision=branch or DEFAULT_PUSH_BRANCH) or [])
+            except Exception:
+                remote_files = set()
+            local_files = set(_collect_local_files(p, ignore_patterns))
+            added = sorted(list(local_files - remote_files))
+            deleted = sorted(list(remote_files - local_files))
+            # Modified cannot be reliably computed without fetching metadata
+            modified = None
+            no_changes = (len(added) == 0 and len(deleted) == 0)
+            result["data"].update({
+                "dry_run": True,
+                "no_changes": True if no_changes else False,
+                "uploaded_files_count": 0,
+                "change_summary": {"added": len(added), "modified": 0, "deleted": len(deleted)},
+                "dry_run_summary": {"added": len(added), "modified": modified, "deleted": len(deleted)},
+                "message": ("Dry-run: no changes" if no_changes else f"Dry-run: +{len(added)} ~? -{len(deleted)}"),
+                "would_create_repo": False,
+                "would_create_branch": False,
+                "added_files": added[:20] if added else [],
+                "deleted_files": deleted[:20] if deleted else [],
+            })
+            return result
+
+        # 5) Upload folder
+        commit_msg = commit_message or "mlx-knife push"
+        # ignore_patterns prepared earlier
+
+        # Capture hub logs to enrich JSON (e.g., no-op messages) and optionally silence console noise in JSON mode
+        hf_logs = None
+        try:
+            import logging as _logging
+            import contextlib as _contextlib
+            _hf_logger = _logging.getLogger("huggingface_hub")
+
+            class _BufHandler(_logging.Handler):
+                def __init__(self):
+                    super().__init__()
+                    self.buf = []
+                def emit(self, record):
+                    try:
+                        msg = self.format(record)
+                    except Exception:
+                        msg = str(record.getMessage()) if hasattr(record, "getMessage") else str(record)
+                    self.buf.append(msg)
+
+            _handler = _BufHandler()
+            _handler.setLevel(_logging.INFO)
+            _old_level = _hf_logger.level
+            _old_handlers = list(_hf_logger.handlers)
+            _old_propagate = _hf_logger.propagate
+
+            # In quiet mode (JSON without --verbose), avoid emitting hub logs/progress to the console
+            # 1) disable progress bars via env (respected by huggingface_hub/tqdm)
+            _prev_pbar_env = os.environ.get("HF_HUB_DISABLE_PROGRESS_BARS")
+            if quiet:
+                os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
+
+            try:
+                _hf_logger.setLevel(_logging.INFO)
+                _hf_logger.addHandler(_handler)
+                if quiet:
+                    _hf_logger.propagate = False
+                    _hf_logger.handlers = [_handler]  # keep only our buffer in quiet mode
+
+                # Silence tqdm progress bars to stderr as an extra safety in quiet mode
+                def _do_upload():
+                    return upload_folder(
+                        repo_id=repo_id,
+                        repo_type="model",
+                        folder_path=str(p),
+                        revision=branch or DEFAULT_PUSH_BRANCH,
+                        commit_message=commit_msg,
+                        token=hf_token,
+                        ignore_patterns=ignore_patterns,
+                    )
+
+                if quiet:
+                    with open(os.devnull, "w") as _devnull:
+                        with _contextlib.redirect_stderr(_devnull):
+                            info = _do_upload()
+                else:
+                    info = _do_upload()
+                hf_logs = getattr(_handler, "buf", None)
+            finally:
+                # Restore logger state
+                try:
+                    _hf_logger.removeHandler(_handler)
+                except Exception:
+                    pass
+                try:
+                    _hf_logger.setLevel(_old_level)
+                    _hf_logger.propagate = _old_propagate
+                    _hf_logger.handlers = _old_handlers
+                except Exception:
+                    pass
+                # Restore env var
+                try:
+                    if quiet:
+                        if _prev_pbar_env is None:
+                            del os.environ["HF_HUB_DISABLE_PROGRESS_BARS"]
+                        else:
+                            os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = _prev_pbar_env
+                except Exception:
+                    pass
+        except HfHubHTTPError as he:
+            # In some hub versions, uploading to a non-existent branch raises here.
+            # If --create was given, try to create the branch and retry once.
+            msg = str(he)
+            if create and ("Revision Not Found" in msg or "Invalid rev id" in msg):
+                try:
+                    api.create_branch(repo_id=repo_id, repo_type="model", branch=branch)
+                    # Retry upload once
+                    try:
+                        info = upload_folder(
+                            repo_id=repo_id,
+                            repo_type="model",
+                            folder_path=str(p),
+                            revision=branch or DEFAULT_PUSH_BRANCH,
+                            commit_message=commit_msg,
+                            token=hf_token,
+                            ignore_patterns=ignore_patterns,
+                        )
+                        hf_logs = hf_logs or []
+                    except HfHubHTTPError as he2:
+                        result["status"] = "error"
+                        result["error"] = {"type": "upload_failed", "message": str(he2)}
+                        return result
+                except HfHubHTTPError as ce:
+                    result["status"] = "error"
+                    result["error"] = {"type": "branch_create_failed", "message": str(ce)}
+                    return result
+            else:
+                result["status"] = "error"
+                result["error"] = {
+                    "type": "upload_failed",
+                    "message": str(he),
+                }
+                return result
+        except Exception as e:
+            result["status"] = "error"
+            result["error"] = {
+                "type": "upload_failed",
+                "message": str(e),
+            }
+            return result
+
+        # 6) Success — extract details from CommitInfo (robust across hub versions)
+        commit_id = None
+        commit_url = None
+        uploaded_count = None
+        no_changes = None
+
+        try:
+            commit_id = getattr(info, "commit_id", None) or getattr(info, "oid", None)
+            commit_url = getattr(info, "commit_url", None) or getattr(info, "html_url", None)
+
+            # Try to compute number of committed files and a change summary
+            change_summary = {"added": 0, "modified": 0, "deleted": 0}
+            files_seq = getattr(info, "files", None) or getattr(info, "operations", None)
+            if files_seq is not None:
+                for f in files_seq:
+                    # Infer operation in a version-agnostic way
+                    op = None
+                    # object attribute style
+                    if hasattr(f, "operation"):
+                        op = getattr(f, "operation")
+                    elif hasattr(f, "op"):
+                        op = getattr(f, "op")
+                    # mapping/dict style
+                    elif isinstance(f, dict):
+                        op = f.get("operation") or f.get("op") or f.get("type")
+                    # class name fallback
+                    if op is None:
+                        cls = f.__class__.__name__ if hasattr(f, "__class__") else ""
+                        op = cls
+
+                    op_s = str(op).lower()
+                    if "add" in op_s or "+" in op_s:
+                        change_summary["added"] += 1
+                    elif "del" in op_s or "remove" in op_s or "-" in op_s:
+                        change_summary["deleted"] += 1
+                    elif "update" in op_s or "modify" in op_s or "mod" in op_s:
+                        change_summary["modified"] += 1
+                    else:
+                        # treat unknown as modified
+                        change_summary["modified"] += 1
+
+                uploaded_count = sum(change_summary.values())
+                result["data"]["change_summary"] = change_summary
+
+            # Determine no-op (no changes)
+            if commit_id in (None, ""):
+                no_changes = True
+            else:
+                # Some hub versions may still create a commit even with no file changes; treat zero operations as no-op
+                no_changes = (uploaded_count == 0) if uploaded_count is not None else False
+        except Exception:
+            # Be conservative if introspection fails
+            pass
+
+        # If hub logs indicate empty commit was skipped, prefer that signal
+        try:
+            if any(
+                isinstance(m, str) and (
+                    "Skipping to prevent empty commit" in m or "No files have been modified" in m
+                )
+                for m in (hf_logs or [])
+            ):
+                no_changes = True
+                commit_id = None
+                commit_url = None
+                uploaded_count = 0
+        except Exception:
+            pass
+
+        # Populate result fields
+        result["data"]["commit_sha"] = commit_id
+        result["data"]["commit_url"] = commit_url
+        result["data"]["uploaded_files_count"] = uploaded_count if uploaded_count is not None else (0 if no_changes else None)
+        result["data"]["no_changes"] = bool(no_changes) if no_changes is not None else (commit_id is None)
+        result["data"]["created_repo"] = created_repo
+
+        if hf_logs:
+            result["data"]["hf_logs"] = hf_logs
+
+        # Human-friendly message retained in JSON
+        if result["data"]["no_changes"]:
+            # Prefer hub-provided message if available
+            hub_msg = None
+            try:
+                hub_msg = next(
+                    (m for m in reversed(hf_logs or []) if isinstance(m, str) and ("Skipping" in m or "No files" in m)),
+                    None,
+                )
+            except Exception:
+                hub_msg = None
+            result["data"]["message"] = hub_msg or "No files changed; skipped empty commit."
+        elif uploaded_count is not None:
+            cs = result["data"].get("change_summary") or {"added": 0, "modified": 0, "deleted": 0}
+            result["data"]["message"] = f"Committed {uploaded_count} files (+{cs['added']} ~{cs['modified']} -{cs['deleted']})."
+        else:
+            result["data"]["message"] = "Commit created."
+
+        # ADR-007 Response Matrix: Add APFS hint to push success message (Alpha only)
+        try:
+            cache_root = get_current_cache_root()
+            if not _is_apfs_filesystem(cache_root):
+                result["data"]["message"] += " Clone operations require APFS filesystem."
+        except Exception as e:
+            # Safe fallback - don't fail push if APFS check fails
+            # Debug: Log the exception to understand what's failing
+            import logging
+            logger = logging.getLogger(__name__)
+            logger.debug(f"APFS warning check failed: {e}")
+            pass
+
+        return result
+
+    except Exception as e:
+        result["status"] = "error"
+        result["error"] = {"type": "push_operation_failed", "message": str(e)}
+        return result
+
+
+def _is_lfs_pointer(path: Path) -> bool:
+    try:
+        if path.stat().st_size > 200:
+            return False
+        head = path.read_text(errors="ignore")[:200]
+        return "version https://git-lfs.github.com/spec/v1" in head
+    except Exception:
+        return False
+
+
+def _analyze_workspace(root: Path) -> Dict[str, Any]:
+    """Rudimentary, content-oriented health check for a local workspace.
+
+    Returns a JSON-serializable dict with summary and issues.
+    """
+    files: List[Path] = [p for p in root.rglob("*") if p.is_file()]
+    total_bytes = 0
+    for f in files:
+        try:
+            total_bytes += f.stat().st_size
+        except Exception:
+            pass
+
+    # config.json
+    config_path: Optional[Path] = None
+    cfg_exists = False
+    cfg_valid = False
+    for candidate in (root / "config.json",):
+        if candidate.exists() and candidate.is_file():
+            config_path = candidate
+            cfg_exists = True
+            try:
+                data = _json.loads(candidate.read_text(encoding="utf-8"))
+                cfg_valid = isinstance(data, dict) and len(data) > 0
+            except Exception:
+                cfg_valid = False
+            break
+
+    # weights detection
+    weights: List[Path] = []
+    ggufs = list(root.rglob("*.gguf"))
+    safes = list(root.rglob("*.safetensors"))
+    bins = list(root.rglob("pytorch_model.bin"))
+    # Exclude the index file from safetensors weights list
+    safes = [s for s in safes if not s.name.endswith(".safetensors.index.json")]
+    weights = ggufs + safes + bins
+
+    # index-aware check
+    index_files = list(root.rglob("*.safetensors.index.json"))
+    index_info: Dict[str, Any] = {"has_index": bool(index_files), "missing": []}
+    if index_files:
+        try:
+            idx_obj = _json.loads(index_files[0].read_text(encoding="utf-8"))
+            # HF index has weight_map: {param_name: filename}
+            weight_map = idx_obj.get("weight_map", {}) if isinstance(idx_obj, dict) else {}
+            referenced = set(weight_map.values()) if isinstance(weight_map, dict) else set()
+            for fname in sorted(referenced):
+                p = root / fname
+                if not p.exists() or p.stat().st_size == 0 or _is_lfs_pointer(p):
+                    index_info["missing"].append(fname)
+        except Exception:
+            index_info["parse_error"] = True
+
+    # pattern-based shards (model-xxxxx-of-yyyyy.safetensors)
+    import re as _re
+
+    shard_re = _re.compile(r"model-(\d{5})-of-(\d{5})\.safetensors$")
+    pattern_files = []
+    for s in safes:
+        if shard_re.search(s.name):
+            pattern_files.append(s)
+    pattern_ok = None
+    if pattern_files:
+        try:
+            xs = [s.name for s in pattern_files]
+            ys = sorted(xs)
+            last = shard_re.search(ys[-1])
+            if last:
+                total = int(last.group(2))
+                present = set()
+                for nm in ys:
+                    m = shard_re.search(nm)
+                    if m:
+                        present.add(int(m.group(1)))
+                pattern_ok = (len(present) == total)
+        except Exception:
+            pattern_ok = False
+
+    # anomalies
+    anomalies: List[Dict[str, Any]] = []
+    if not cfg_exists:
+        anomalies.append({"severity": "error", "code": "config_missing", "message": "config.json not found"})
+    elif not cfg_valid:
+        anomalies.append({"severity": "error", "code": "config_invalid_json", "message": "config.json invalid or empty"})
+
+    # weight presence and sanity
+    if not weights:
+        anomalies.append({"severity": "error", "code": "no_weights_found", "message": "No weights (*.gguf/*.safetensors/pytorch_model.bin)"})
+    else:
+        # LFS or zero-size detection
+        for w in weights:
+            try:
+                if w.stat().st_size == 0:
+                    anomalies.append({"severity": "error", "code": "empty_weight_file", "message": f"Empty file: {w.name}", "path": str(w.relative_to(root))})
+                elif _is_lfs_pointer(w):
+                    anomalies.append({"severity": "error", "code": "lfs_pointer_detected", "message": f"LFS pointer: {w.name}", "path": str(w.relative_to(root))})
+            except Exception:
+                pass
+
+    # index completeness if present
+    if index_info.get("has_index"):
+        if index_info.get("parse_error"):
+            anomalies.append({"severity": "error", "code": "index_parse_error", "message": "model.safetensors.index.json parse error"})
+        missing = index_info.get("missing") or []
+        if missing:
+            anomalies.append({"severity": "error", "code": "index_missing_shard", "message": f"Missing/invalid shards: {len(missing)}", "missing": missing})
+
+    # partial/tmp markers
+    for f in files:
+        nm = f.name.lower()
+        if ".partial" in nm or nm.endswith(".tmp") or "partial" in nm:
+            anomalies.append({"severity": "warn", "code": "partial_marker", "message": f"Partial/tmp marker: {f.name}", "path": str(f.relative_to(root))})
+
+    # Determine health: strictly require config valid and some non-empty non-LFS weights
+    has_good_weight = True if weights else False
+    if weights:
+        has_good_weight = any(
+            (w.stat().st_size > 0 and not _is_lfs_pointer(w))
+            for w in weights
+        )
+    healthy = bool(cfg_valid and has_good_weight and not any(a["severity"] == "error" for a in anomalies if a["code"] not in {"config_missing", "config_invalid_json", "no_weights_found", "empty_weight_file", "lfs_pointer_detected", "index_parse_error", "index_missing_shard"}))
+    # In practice, healthy becomes False if any error-level anomalies present or config/weights invalid.
+    if any(a["severity"] == "error" for a in anomalies):
+        healthy = False
+
+    return {
+        "files_count": len(files),
+        "total_bytes": total_bytes,
+        "config": {"exists": cfg_exists, "valid_json": cfg_valid, "path": str(config_path) if config_path else None},
+        "weights": {
+            "count": len(weights),
+            "formats": sorted(list({w.suffix.lstrip('.') if w.suffix else 'bin' for w in weights})),
+            "index": index_info,
+            "pattern_complete": pattern_ok,
+        },
+        "anomalies": anomalies,
+        "healthy": healthy,
+    }
+
+
+def _collect_local_files(root: Path, ignore_patterns: list[str]) -> list[str]:
+    """Return a list of relative POSIX paths for files under root, honoring ignore patterns.
+
+    This is a best-effort approximation of upload_folder's ignore behavior, using
+    glob-like matching. It is sufficient for dry-run summaries.
+    """
+    from pathlib import PurePosixPath
+    import fnmatch
+
+    def ignored(rel: str) -> bool:
+        p = PurePosixPath(rel)
+        base = p.name
+        for pat in ignore_patterns:
+            try:
+                # Normalize simple relative names (match basenames too)
+                if pat == base or pat == rel:
+                    return True
+                # Try both PurePath.match and fnmatch as a fallback
+                if p.match(pat) or fnmatch.fnmatch(rel, pat):
+                    return True
+            except Exception:
+                # Be permissive on pattern errors
+                if fnmatch.fnmatch(rel, pat):
+                    return True
+        return False
+
+    files: list[str] = []
+    for fp in root.rglob("*"):
+        if fp.is_file():
+            rel = fp.relative_to(root).as_posix()
+            if not ignored(rel):
+                files.append(rel)
+    return files
diff --git a/mlxk2/operations/rm.py b/mlxk2/operations/rm.py
new file mode 100644
index 0000000..cb6d527
--- /dev/null
+++ b/mlxk2/operations/rm.py
@@ -0,0 +1,215 @@
+import shutil
+import os
+from ..core.cache import get_current_model_cache, hf_to_cache_dir, cache_dir_to_hf, verify_cache_context
+from ..core.model_resolution import resolve_model_for_operation
+
+
+def find_matching_models(pattern):
+    """Find models that match a partial pattern."""
+    model_cache = get_current_model_cache()
+    all_models = [d for d in model_cache.iterdir() if d.name.startswith("models--")]
+    matches = []
+    
+    for model_dir in all_models:
+        hf_name = cache_dir_to_hf(model_dir.name)
+        if pattern.lower() in hf_name.lower():
+            matches.append((model_dir, hf_name))
+    
+    return matches
+
+
+def resolve_model_for_deletion(model_spec):
+    """Resolve model spec to exact model for deletion, with fuzzy matching."""
+    if "@" in model_spec:
+        model_name, commit_hash = model_spec.rsplit("@", 1)
+    else:
+        model_name = model_spec
+        commit_hash = None
+    
+    # Try exact match first  
+    model_cache = get_current_model_cache()
+    base_cache_dir = model_cache / hf_to_cache_dir(model_name)
+    if base_cache_dir.exists():
+        return base_cache_dir, model_name, commit_hash, False
+    
+    # Try fuzzy matching
+    matches = find_matching_models(model_name)
+    
+    if not matches:
+        return None, None, None, False
+    elif len(matches) == 1:
+        # Unambiguous match
+        found_model_dir, found_hf_name = matches[0]
+        return found_model_dir, found_hf_name, commit_hash, True
+    else:
+        # Ambiguous - return matches for user choice
+        return None, None, None, matches
+
+
+def check_model_locks(model_name):
+    """Check if model has active lock files."""
+    model_cache = get_current_model_cache()
+    locks_dir = model_cache / ".locks"
+    model_locks = []
+    
+    if not locks_dir.exists():
+        return []
+    
+    # Look for lock files related to this model
+    for lock_file in locks_dir.glob("**/*.lock"):
+        if hf_to_cache_dir(model_name) in str(lock_file):
+            model_locks.append(str(lock_file.relative_to(model_cache)))
+    
+    return model_locks
+
+
+def cleanup_model_locks(model_name):
+    """Clean up HuggingFace lock files for a deleted model."""
+    model_cache = get_current_model_cache()
+    locks_dir = model_cache / ".locks" / hf_to_cache_dir(model_name)
+    
+    if not locks_dir.exists():
+        return 0
+    
+    try:
+        lock_files = list(locks_dir.iterdir())
+        if lock_files:
+            shutil.rmtree(locks_dir)
+            return len(lock_files)
+    except Exception:
+        pass
+    
+    return 0
+
+
+def rm_operation(model_spec, force=False):
+    """Remove (delete) operation for JSON API."""
+    result = {
+        "status": "success",
+        "command": "rm",
+        "error": None,
+        "data": {
+            "model": None,
+            "action": "unknown",
+            "message": "",
+            "requires_confirmation": False,
+            "matches": [],
+            "lock_files_cleaned": 0
+        }
+    }
+    
+    try:
+        model_cache = get_current_model_cache()
+        if not model_cache.exists():
+            result["status"] = "error"
+            result["error"] = {
+                "type": "cache_not_found",
+                "message": "Model cache directory does not exist"
+            }
+            return result
+        
+        resolved_name, commit_hash, ambiguous_matches = resolve_model_for_operation(model_spec)
+        
+        if ambiguous_matches:
+            result["status"] = "error"
+            result["data"]["action"] = "ambiguous"
+            result["data"]["matches"] = ambiguous_matches
+            result["error"] = {
+                "type": "ambiguous_match",
+                "message": f"Multiple models match '{model_spec}'"
+            }
+            return result
+        elif not resolved_name:
+            result["status"] = "error"
+            result["error"] = {
+                "type": "model_not_found", 
+                "message": f"No models found matching '{model_spec}'"
+            }
+            return result
+        
+        resolved_model_dir = model_cache / hf_to_cache_dir(resolved_name)
+        is_fuzzy_match = resolved_name != model_spec.split('@')[0]
+        
+        result["data"]["model"] = resolved_name
+        
+        # Check for active locks - requires --force (replaces interactive prompt)
+        active_locks = check_model_locks(resolved_name)
+        if active_locks and not force:
+            result["status"] = "error"
+            result["data"]["locks_detected"] = True
+            result["data"]["lock_files"] = active_locks
+            result["error"] = {
+                "type": "locks_present",
+                "message": "Model has active locks. Use --force to override."
+            }
+            return result
+        
+        # Check if this requires confirmation (fuzzy match)
+        if is_fuzzy_match and not force:
+            result["data"]["requires_confirmation"] = True
+            result["data"]["action"] = "requires_confirmation"  
+            result["data"]["message"] = f"Would delete '{resolved_name}' (matched from '{model_spec}')"
+            return result
+        
+        # Handle specific hash deletion
+        if commit_hash:
+            snapshots_dir = resolved_model_dir / "snapshots"
+            if not snapshots_dir.exists():
+                result["status"] = "error"
+                result["error"] = {
+                    "type": "snapshots_not_found",
+                    "message": f"No snapshots directory found for {resolved_name}"
+                }
+                return result
+            
+            hash_dir = snapshots_dir / commit_hash
+            if not hash_dir.exists():
+                # List available hashes
+                available_hashes = [s.name[:8] for s in snapshots_dir.iterdir() if s.is_dir()]
+                result["status"] = "error"
+                result["error"] = {
+                    "type": "hash_not_found",
+                    "message": f"Hash {commit_hash} not found",
+                    "available_hashes": available_hashes
+                }
+                return result
+            
+            result["data"]["action"] = "delete_hash"
+            result["data"]["commit_hash"] = commit_hash
+        else:
+            result["data"]["action"] = "delete_model"
+        
+        # Perform deletion (with optional strict test safety)
+        if force or not result["data"]["requires_confirmation"]:
+            # Optional safety: when running tests, enforce test cache context
+            if os.environ.get("MLXK2_STRICT_TEST_DELETE") == "1":
+                verify_cache_context("test")
+            # MLX-Knife 2.0 Fix: Always delete entire model directory
+            # This prevents the Issue #23 double-execution problem
+            shutil.rmtree(resolved_model_dir)
+            
+            # Clean up lock files
+            lock_count = cleanup_model_locks(resolved_name)
+            result["data"]["lock_files_cleaned"] = lock_count
+            
+            if commit_hash:
+                result["data"]["message"] = f"Deleted {resolved_name}@{commit_hash}"
+            else:
+                result["data"]["message"] = f"Deleted entire model {resolved_name}"
+            
+            result["data"]["action"] = "deleted"
+    
+    except PermissionError as e:
+        result["status"] = "error"
+        result["error"] = {
+            "type": "permission_denied",
+            "message": f"Permission denied: Cannot delete {e.filename}"
+        }
+    except Exception as e:
+        result["status"] = "error"
+        result["error"] = {
+            "type": "deletion_failed",
+            "message": str(e)
+        }
+    
+    return result
diff --git a/mlxk2/operations/run.py b/mlxk2/operations/run.py
new file mode 100644
index 0000000..b14fa6c
--- /dev/null
+++ b/mlxk2/operations/run.py
@@ -0,0 +1,334 @@
+"""
+Run operation for 2.0 implementation.
+Ported from 1.x with 2.0 architecture integration.
+"""
+
+from typing import Optional
+
+from ..core.runner import MLXRunner
+from ..core.cache import get_current_model_cache, hf_to_cache_dir
+from ..core.model_resolution import resolve_model_for_operation
+from ..operations.health import check_runtime_compatibility
+from ..operations.common import detect_framework, read_front_matter
+
+
+def run_model(
+    model_spec: str,
+    prompt: Optional[str] = None,
+    stream: bool = True,
+    max_tokens: Optional[int] = None,
+    temperature: float = 0.7,
+    top_p: float = 0.9,
+    repetition_penalty: float = 1.1,
+    use_chat_template: bool = True,
+    json_output: bool = False,
+    verbose: bool = False
+) -> Optional[str]:
+    """Execute model with prompt - supports both single-shot and interactive modes.
+
+    Args:
+        model_spec: Model specification or path
+        prompt: Input prompt (None = interactive mode)
+        stream: Enable streaming output (default True)
+        max_tokens: Maximum tokens to generate (None for dynamic)
+        temperature: Sampling temperature
+        top_p: Top-p sampling parameter
+        repetition_penalty: Penalty for repeated tokens
+        use_chat_template: Apply tokenizer's chat template if available
+        json_output: Return JSON format instead of printing
+        verbose: Show detailed output
+
+    Returns:
+        Generated text if json_output=True, None otherwise
+    """
+    # Pre-flight check: Verify runtime compatibility before attempting to load
+    # This is a "best effort" check - if the model is in cache, verify it's compatible
+    # If not in cache or check fails, let the runner handle it (for tests and edge cases)
+    try:
+        resolved_name, commit_hash, ambiguous = resolve_model_for_operation(model_spec)
+
+        if ambiguous:
+            error_msg = f"Ambiguous model specification '{model_spec}'. Could be: {ambiguous}"
+            if json_output:
+                return f"Error: {error_msg}"
+            else:
+                print(f"Error: {error_msg}")
+                return None
+
+        # Only perform compatibility check if model is actually in cache
+        if resolved_name:
+            model_cache = get_current_model_cache()
+            model_cache_dir = model_cache / hf_to_cache_dir(resolved_name)
+
+            if model_cache_dir.exists():
+                snapshots_dir = model_cache_dir / "snapshots"
+                if snapshots_dir.exists():
+                    # Resolve snapshot path (commit-pinned or latest)
+                    model_path = None
+                    if commit_hash:
+                        model_path = snapshots_dir / commit_hash
+                    else:
+                        snapshots = [d for d in snapshots_dir.iterdir() if d.is_dir()]
+                        if snapshots:
+                            model_path = max(snapshots, key=lambda x: x.stat().st_mtime)
+
+                    # Check runtime compatibility for both pinned and unpinned models
+                    if model_path and model_path.exists():
+                        # Read README front-matter for framework hints (e.g., private MLX models)
+                        fm = read_front_matter(model_path)
+                        framework = detect_framework(resolved_name, model_cache_dir, selected_path=model_path, fm=fm)
+                        compatible, reason = check_runtime_compatibility(model_path, framework)
+
+                        if not compatible:
+                            error_msg = f"Model '{resolved_name}' is not compatible: {reason}"
+                            if json_output:
+                                return f"Error: {error_msg}"
+                            else:
+                                print(f"Error: {error_msg}")
+                                return None
+
+    except Exception:
+        # Pre-flight check failed - let the runner handle it
+        # This preserves backward compatibility with tests and edge cases
+        pass
+
+    # Runtime compatibility verified, proceed with model loading
+    try:
+        with MLXRunner(model_spec, verbose=verbose) as runner:
+            # Interactive mode: no prompt provided
+            if prompt is None:
+                if json_output:
+                    print("Error: Interactive mode not compatible with JSON output")
+                    return None
+                return interactive_chat(
+                    runner, 
+                    stream=stream, 
+                    max_tokens=max_tokens,
+                    temperature=temperature,
+                    top_p=top_p,
+                    repetition_penalty=repetition_penalty,
+                    use_chat_template=use_chat_template,
+                    prepare_next_prompt=False
+                )
+            else:
+                # Single-shot mode: prompt provided  
+                return single_shot_generation(
+                    runner, 
+                    prompt, 
+                    stream=stream,
+                    max_tokens=max_tokens,
+                    temperature=temperature,
+                    top_p=top_p,
+                    repetition_penalty=repetition_penalty,
+                    use_chat_template=use_chat_template,
+                    json_output=json_output
+                )
+                    
+    except Exception as e:
+        if json_output:
+            return f"Error: {e}"
+        else:
+            print(f"Error: {e}")
+            return None
+
+
+def interactive_chat(
+    runner,
+    stream: bool = True,
+    max_tokens: Optional[int] = None,
+    temperature: float = 0.7,
+    top_p: float = 0.9,
+    repetition_penalty: float = 1.1,
+    use_chat_template: bool = True,
+    prepare_next_prompt: bool = False,
+):
+    """Interactive conversation mode with history tracking."""
+    print("Starting interactive chat. Type 'exit' or 'quit' to end.\n")
+    
+    conversation_history = []
+    
+    while True:
+        try:
+            user_input = input("You: ").strip()
+            
+            if user_input.lower() in ['exit', 'quit', 'q']:
+                print("\nGoodbye!")
+                break
+                
+            if not user_input:
+                continue
+                
+            # Add user message to conversation history
+            conversation_history.append({"role": "user", "content": user_input})
+            
+            # Format conversation using chat template
+            # Pass a shallow copy to avoid later mutations affecting captured args in tests
+            formatted_prompt = runner._format_conversation(conversation_history.copy())
+            
+            # Generate response
+            print("\nAssistant: ", end="", flush=True)
+            
+            if stream:
+                # Streaming mode
+                response_tokens = []
+                # Build standard params but be robust to mocks that don't accept them
+                params = dict(
+                    max_tokens=max_tokens,
+                    temperature=temperature,
+                    top_p=top_p,
+                    repetition_penalty=repetition_penalty,
+                    use_chat_template=False,
+                    use_chat_stop_tokens=True,
+                )
+                try:
+                    iterator = runner.generate_streaming(formatted_prompt, **params)
+                except TypeError:
+                    try:
+                        iterator = runner.generate_streaming(formatted_prompt)
+                    except TypeError:
+                        iterator = runner.generate_streaming()
+                for token in iterator:
+                    print(token, end="", flush=True)
+                    response_tokens.append(token)
+                response = "".join(response_tokens).strip()
+            else:
+                # Batch mode
+                params = dict(
+                    max_tokens=max_tokens,
+                    temperature=temperature,
+                    top_p=top_p,
+                    repetition_penalty=repetition_penalty,
+                    use_chat_template=False,
+                    use_chat_stop_tokens=True,
+                )
+                try:
+                    response = runner.generate_batch(formatted_prompt, **params)
+                except TypeError:
+                    try:
+                        response = runner.generate_batch(formatted_prompt)
+                    except TypeError:
+                        response = runner.generate_batch()
+                print(response)
+            
+            # Add assistant response to history
+            conversation_history.append({"role": "assistant", "content": response})
+            print()  # Newline after response
+            
+            # Optionally expose assistant message to template users without duplicating user entries
+            if prepare_next_prompt:
+                try:
+                    _ = runner._format_conversation([{"role": "assistant", "content": response}])
+                except Exception:
+                    pass
+            
+        except KeyboardInterrupt:
+            print("\n\nChat interrupted. Goodbye!")
+            break
+        except Exception as e:
+            print(f"\n[ERROR] {e}")
+            continue
+
+
+def single_shot_generation(
+    runner,
+    prompt: str,
+    stream: bool = True,
+    max_tokens: Optional[int] = None,
+    temperature: float = 0.7,
+    top_p: float = 0.9,
+    repetition_penalty: float = 1.1,
+    use_chat_template: bool = True,
+    json_output: bool = False
+) -> Optional[str]:
+    """Single prompt generation."""
+    if stream and not json_output:
+        # Streaming mode - print tokens as they arrive
+        generated_text = ""
+        for token in runner.generate_streaming(
+            prompt=prompt,
+            max_tokens=max_tokens,
+            temperature=temperature,
+            top_p=top_p,
+            repetition_penalty=repetition_penalty,
+            use_chat_template=use_chat_template,
+        ):
+            print(token, end="", flush=True)
+            generated_text += token
+        
+        if not json_output:
+            print()  # Final newline
+        
+        return generated_text if json_output else None
+    else:
+        # Batch mode - generate complete response
+        result = runner.generate_batch(
+            prompt=prompt,
+            max_tokens=max_tokens,
+            temperature=temperature,
+            top_p=top_p,
+            repetition_penalty=repetition_penalty,
+            use_chat_template=use_chat_template,
+        )
+        
+        if json_output:
+            return result
+        else:
+            print(result)
+            return None
+
+
+def run_model_enhanced(
+    model_spec: str,
+    prompt: str,
+    stream: bool = True,
+    max_tokens: Optional[int] = None,
+    temperature: float = 0.7,
+    top_p: float = 0.9,
+    repetition_penalty: float = 1.1,
+    repetition_context_size: int = 20,
+    use_chat_template: bool = True,
+    json_output: bool = False,
+    verbose: bool = False,
+    system_prompt: Optional[str] = None,
+    hide_reasoning: bool = False
+) -> Optional[str]:
+    """Enhanced run with additional parameters for future features.
+    
+    This function signature matches what will be needed for 2.0.0-beta.2
+    when system prompts and reasoning features are added.
+    
+    Args:
+        model_spec: Model specification or path
+        prompt: Input prompt
+        stream: Enable streaming output
+        max_tokens: Maximum tokens to generate
+        temperature: Sampling temperature
+        top_p: Top-p sampling parameter
+        repetition_penalty: Penalty for repeated tokens
+        repetition_context_size: Context size for repetition penalty
+        use_chat_template: Apply tokenizer's chat template
+        json_output: Return JSON format
+        verbose: Show detailed output
+        system_prompt: System prompt (future feature)
+        hide_reasoning: Hide reasoning output (future feature)
+        
+    Returns:
+        Generated text if json_output=True, None otherwise
+    """
+    # For now, forward to basic run_model
+    # TODO: Add system_prompt and hide_reasoning support in beta.2
+    if system_prompt:
+        print("Warning: System prompts not yet implemented in beta.1")
+    
+    return run_model(
+        model_spec=model_spec,
+        prompt=prompt,
+        stream=stream,
+        max_tokens=max_tokens,
+        temperature=temperature,
+        top_p=top_p,
+        repetition_penalty=repetition_penalty,
+        use_chat_template=use_chat_template,
+        json_output=json_output,
+        verbose=verbose
+    )
diff --git a/mlxk2/operations/serve.py b/mlxk2/operations/serve.py
new file mode 100644
index 0000000..1737cac
--- /dev/null
+++ b/mlxk2/operations/serve.py
@@ -0,0 +1,131 @@
+"""
+Server operation for 2.0 implementation.
+"""
+
+import os
+import signal
+import subprocess
+import sys
+import time
+from typing import Optional
+
+from ..core.server_base import run_server
+
+
+def _run_supervised_uvicorn(host: str, port: int, log_level: str, reload: bool = False) -> int:
+    """Run uvicorn as a supervised subprocess and handle Ctrl-C in parent.
+
+    Returns the subprocess' exit code.
+    """
+    cmd = [
+        sys.executable,
+        "-m",
+        "uvicorn",
+        "mlxk2.core.server_base:app",
+        "--host",
+        host,
+        "--port",
+        str(port),
+        "--log-level",
+        log_level,
+        "--workers",
+        "1",
+        "--timeout-keep-alive",
+        "5",
+        "--timeout-graceful-shutdown",
+        "5",
+        "--lifespan",
+        "on",
+    ]
+    if reload:
+        cmd.append("--reload")
+
+    # Start in a new session so we can signal the whole process group
+    proc = subprocess.Popen(
+        cmd,
+        start_new_session=True,
+    )
+
+    try:
+        return proc.wait()
+    except KeyboardInterrupt:
+        # Suppress further SIGINT while we clean up
+        previous = signal.signal(signal.SIGINT, signal.SIG_IGN)
+        try:
+            # First Ctrl-C: ask child to stop gracefully
+            try:
+                os.killpg(proc.pid, signal.SIGTERM)
+            except Exception:
+                pass
+            # Wait briefly, then force kill if still alive
+            deadline = time.time() + 5.0
+            while time.time() < deadline:
+                ret = proc.poll()
+                if ret is not None:
+                    return ret
+                try:
+                    time.sleep(0.1)
+                except KeyboardInterrupt:
+                    # Second Ctrl-C: escalate to SIGKILL immediately
+                    break
+            try:
+                os.killpg(proc.pid, signal.SIGKILL)
+            except Exception:
+                pass
+            # Wait for child without being interrupted
+            while True:
+                ret = proc.poll()
+                if ret is not None:
+                    return ret
+                time.sleep(0.05)
+        finally:
+            # Restore previous handler
+            try:
+                signal.signal(signal.SIGINT, previous)
+            except Exception:
+                pass
+
+
+def start_server(
+    model: Optional[str] = None,
+    port: int = 8000,
+    host: str = "127.0.0.1",
+    max_tokens: Optional[int] = None,
+    reload: bool = False,
+    log_level: str = "info",
+    verbose: bool = False,
+    supervise: bool = True,
+) -> None:
+    """Start OpenAI-compatible API server for MLX models.
+
+    Args:
+        model: Specific model to load on startup (optional)
+        port: Port to bind the server to
+        host: Host address to bind to
+        max_tokens: Default maximum tokens for generation
+        reload: Enable auto-reload for development
+        log_level: Logging level
+        verbose: Show detailed output
+        supervise: Run uvicorn in a supervised subprocess for instant Ctrl-C
+    """
+    if verbose:
+        print("Starting MLX Knife Server 2.0...")
+        if model:
+            print(f"Pre-loading model: {model}")
+        print(f"Server will bind to: http://{host}:{port}")
+
+    if supervise:
+        # Pass log_level via environment to subprocess (ADR-004)
+        os.environ["MLXK2_LOG_LEVEL"] = log_level
+        # Delegate to subprocess-managed uvicorn
+        _ = _run_supervised_uvicorn(host=host, port=port, log_level=log_level, reload=reload)
+        return
+
+    # Default: run uvicorn in-process
+    run_server(
+        host=host,
+        port=port,
+        max_tokens=max_tokens,
+        reload=reload,
+        log_level=log_level,
+    )
diff --git a/mlxk2/operations/show.py b/mlxk2/operations/show.py
new file mode 100644
index 0000000..66a91e2
--- /dev/null
+++ b/mlxk2/operations/show.py
@@ -0,0 +1,205 @@
+"""Show model operation for MLX-Knife 2.0."""
+
+import json
+from typing import Dict, Any
+
+from ..core.cache import get_current_model_cache, hf_to_cache_dir
+from ..core.model_resolution import resolve_model_for_operation
+from .common import build_model_object
+
+
+def get_file_type(file_name):
+    """Determine file type based on file name."""
+    if file_name == "config.json":
+        return "config"
+    elif file_name.endswith((".safetensors", ".bin", ".gguf")):
+        return "weights"
+    elif "tokenizer" in file_name.lower():
+        return "tokenizer"
+    elif file_name.endswith(".json"):
+        return "config"
+    elif file_name == "README.md":
+        return "readme"
+    else:
+        return "other"
+
+
+def get_model_files(model_path):
+    """Get list of files in model directory with type classification."""
+    files = []
+    
+    if not model_path.exists():
+        return files
+        
+    for file_path in sorted(model_path.rglob("*")):
+        if file_path.is_file():
+            size_bytes = file_path.stat().st_size
+            if size_bytes >= 1_000_000_000:
+                size_str = f"{size_bytes / 1_000_000_000:.1f}GB"
+            elif size_bytes >= 1_000_000:
+                size_str = f"{size_bytes / 1_000_000:.1f}MB"
+            elif size_bytes >= 1_000:
+                size_str = f"{size_bytes / 1_000:.1f}KB"
+            else:
+                size_str = f"{size_bytes}B"
+                
+            files.append({
+                "name": file_path.name,
+                "size": size_str,
+                "type": get_file_type(file_path.name)
+            })
+    
+    return files
+
+
+def extract_model_metadata(model_path):
+    """Extract metadata from config.json if available."""
+    config_path = model_path / "config.json"
+    if not config_path.exists():
+        return None
+        
+    try:
+        with open(config_path) as f:
+            config = json.load(f)
+            
+        # Extract common metadata fields
+        metadata = {}
+        
+        # Model architecture
+        if "model_type" in config:
+            metadata["model_type"] = config["model_type"]
+        if "architectures" in config and config["architectures"]:
+            metadata["architecture"] = config["architectures"][0]
+            
+        # Quantization info
+        if "quantization_config" in config:
+            quant = config["quantization_config"]
+            if "bits" in quant:
+                metadata["quantization"] = f"{quant['bits']}bit"
+        
+        # Size parameters
+        if "max_position_embeddings" in config:
+            metadata["context_length"] = config["max_position_embeddings"]
+        if "vocab_size" in config:
+            metadata["vocab_size"] = config["vocab_size"]
+        if "hidden_size" in config:
+            metadata["hidden_size"] = config["hidden_size"]
+        if "num_attention_heads" in config:
+            metadata["num_attention_heads"] = config["num_attention_heads"]
+        if "num_hidden_layers" in config:
+            metadata["num_hidden_layers"] = config["num_hidden_layers"]
+            
+        return metadata if metadata else None
+        
+    except (OSError, json.JSONDecodeError):
+        return None
+
+
+def get_config_content(model_path):
+    """Get config.json content as parsed JSON."""
+    config_path = model_path / "config.json"
+    if not config_path.exists():
+        return None
+        
+    try:
+        with open(config_path) as f:
+            return json.load(f)
+    except (OSError, json.JSONDecodeError):
+        return None
+
+
+def _is_40_hex(s: str) -> bool:
+    return len(s) == 40 and all(c in "0123456789abcdef" for c in s.lower())
+
+
+def show_model_operation(model_pattern: str, include_files: bool = False, include_config: bool = False) -> Dict[str, Any]:
+    """Show detailed model information."""
+    result = {
+        "status": "success",
+        "command": "show",
+        "data": None,
+        "error": None
+    }
+    
+    try:
+        # Resolve model name and hash
+        resolved_name, commit_hash, ambiguous_matches = resolve_model_for_operation(model_pattern)
+        
+        if ambiguous_matches:
+            result["status"] = "error"
+            result["error"] = {
+                "type": "ambiguous_match",
+                "message": f"Multiple models match '{model_pattern}'",
+                "matches": ambiguous_matches
+            }
+            return result
+            
+        if not resolved_name:
+            result["status"] = "error"
+            result["error"] = {
+                "type": "model_not_found",
+                "message": f"No model found matching '{model_pattern}'"
+            }
+            return result
+            
+        # Get model directory
+        model_cache_dir = get_current_model_cache() / hf_to_cache_dir(resolved_name)
+        if not model_cache_dir.exists():
+            result["status"] = "error"
+            result["error"] = {
+                "type": "model_not_cached",
+                "message": f"Model '{resolved_name}' not found in cache"
+            }
+            return result
+            
+        # Find the correct snapshot
+        snapshots_dir = model_cache_dir / "snapshots"
+        model_path = None
+        
+        if commit_hash and snapshots_dir.exists():
+            # Specific hash requested
+            hash_path = snapshots_dir / commit_hash
+            if hash_path.exists():
+                model_path = hash_path
+            else:
+                result["status"] = "error"
+                result["error"] = {
+                    "type": "hash_not_found",
+                    "message": f"Hash '{commit_hash}' not found for model '{resolved_name}'"
+                }
+                return result
+        elif snapshots_dir.exists():
+            # Use latest snapshot
+            snapshots = [d for d in snapshots_dir.iterdir() if d.is_dir()]
+            if snapshots:
+                model_path = max(snapshots, key=lambda x: x.stat().st_mtime)
+                commit_hash = model_path.name
+        
+        if not model_path:
+            model_path = model_cache_dir
+            
+        # Build unified model object
+        model_obj = build_model_object(resolved_name, model_cache_dir, model_path)
+
+        # Build response data
+        data = {"model": model_obj}
+        
+        if include_files:
+            data["files"] = get_model_files(model_path)
+            data["metadata"] = None
+        elif include_config:
+            data["config"] = get_config_content(model_path)
+            data["metadata"] = None
+        else:
+            data["metadata"] = extract_model_metadata(model_path)
+            
+        result["data"] = data
+        
+    except Exception as e:
+        result["status"] = "error"
+        result["error"] = {
+            "type": "show_operation_failed",
+            "message": str(e)
+        }
+        
+    return result
diff --git a/mlxk2/output/__init__.py b/mlxk2/output/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/mlxk2/output/human.py b/mlxk2/output/human.py
new file mode 100644
index 0000000..b0d33c4
--- /dev/null
+++ b/mlxk2/output/human.py
@@ -0,0 +1,417 @@
+from datetime import datetime, timezone
+from typing import Any, Dict, List, Optional
+
+
+def humanize_size(num_bytes: Optional[int]) -> str:
+    if not isinstance(num_bytes, int):
+        return "-"
+    n = float(num_bytes)
+    for unit in ["B", "KB", "MB", "GB", "TB"]:
+        if n < 1000:
+            return f"{n:.0f}{unit}" if unit == "B" else f"{n:.1f}{unit}"
+        n /= 1000.0
+    return f"{n:.1f}PB"
+
+
+def fmt_hash7(h: Optional[str]) -> str:
+    if not h:
+        return "-"
+    return h[:7]
+
+
+def fmt_time(iso_utc_z: Optional[str]) -> str:
+    if not iso_utc_z:
+        return "-"
+    try:
+        # Expected like 2025-08-30T12:34:56Z (UTC)
+        dt = datetime.strptime(iso_utc_z, "%Y-%m-%dT%H:%M:%SZ").replace(tzinfo=timezone.utc)
+        now = datetime.now(timezone.utc)
+        delta = now - dt
+        seconds = int(delta.total_seconds())
+
+        if seconds < 45:
+            return "just now"
+        if seconds < 90:
+            return "1m ago"
+        minutes = round(seconds / 60)
+        if minutes < 45:
+            return f"{minutes}m ago"
+        if minutes < 90:
+            return "1h ago"
+        hours = round(minutes / 60)
+        if hours < 24:
+            return f"{hours}h ago"
+        if hours < 36:
+            return "1d ago"
+        days = round(hours / 24)
+        if days < 30:
+            return f"{days}d ago"
+        # For older entries, fall back to a compact date
+        return dt.strftime("%Y-%m-%d")
+    except Exception:
+        return iso_utc_z
+
+
+def _table(rows: List[List[str]], headers: List[str], max_col_width: Optional[int] = None) -> str:
+    """
+    Build a table with optional column width limit for last column.
+
+    Args:
+        rows: Table rows
+        headers: Column headers
+        max_col_width: If set, limits last column to this width (wraps text to new lines)
+    """
+    widths = [len(h) for h in headers]
+    for r in rows:
+        for i, cell in enumerate(r):
+            if i < len(widths):
+                widths[i] = max(widths[i], len(cell))
+            else:
+                widths.append(len(cell))
+
+    # Apply max width limit to last column if specified
+    if max_col_width and len(widths) > 0:
+        widths[-1] = min(widths[-1], max_col_width)
+
+    def fmt_row(cols: List[str]) -> str:
+        return " | ".join(col.ljust(widths[i]) for i, col in enumerate(cols))
+
+    def wrap_cell(text: str, width: int) -> List[str]:
+        """Wrap text to width, breaking at word boundaries."""
+        if len(text) <= width:
+            return [text]
+        words = text.split()
+        lines = []
+        current = []
+        current_len = 0
+        for word in words:
+            word_len = len(word)
+            if current and current_len + 1 + word_len > width:
+                lines.append(" ".join(current))
+                current = [word]
+                current_len = word_len
+            else:
+                current.append(word)
+                current_len += (1 if current_len > 0 else 0) + word_len
+        if current:
+            lines.append(" ".join(current))
+        return lines
+
+    lines = []
+    lines.append(fmt_row(headers))
+    lines.append("-+-".join("-" * w for w in widths))
+
+    for r in rows:
+        # Check if last column needs wrapping
+        if max_col_width and len(r) > 0 and len(r[-1]) > max_col_width:
+            wrapped_lines = wrap_cell(r[-1], max_col_width)
+            # First line with all columns
+            first_row = r[:-1] + [wrapped_lines[0]]
+            lines.append(fmt_row(first_row))
+            # Additional lines with empty cells except last column
+            for wrapped_line in wrapped_lines[1:]:
+                continuation_row = [""] * (len(r) - 1) + [wrapped_line]
+                lines.append(fmt_row(continuation_row))
+        else:
+            lines.append(fmt_row(r))
+
+    return "\n".join(lines)
+
+
+def render_list(data: Dict[str, Any], show_health: bool, show_all: bool, verbose: bool) -> str:
+    models: List[Dict[str, Any]] = data.get("data", {}).get("models", [])
+    compact = (not show_all) and (not verbose)
+    if compact:
+        headers = ["Name", "Hash", "Size", "Modified", "Type"]
+    else:
+        headers = ["Name", "Hash", "Size", "Modified", "Framework", "Type"]
+    if show_health:
+        if verbose:
+            # Verbose mode: split health into Integrity + Runtime + Reason columns
+            headers.extend(["Integrity", "Runtime", "Reason"])
+        else:
+            # Compact mode: single Health column
+            headers.append("Health")
+
+    # Human filter:
+    # - --all: show everything
+    # - default: show only MLX chat models (safer for run/server selection)
+    # - --verbose (without --all): show all MLX models (chat + base)
+    filtered: List[Dict[str, Any]] = []
+    for m in models:
+        fw = str(m.get("framework", "")).upper()
+        typ = str(m.get("model_type", "")).lower()
+        if show_all:
+            filtered.append(m)
+        else:
+            if fw != "MLX":
+                continue
+            if verbose:
+                # In verbose mode, show all MLX models
+                filtered.append(m)
+            else:
+                # Default compact mode: only MLX chat
+                if typ == "chat":
+                    filtered.append(m)
+
+    rows: List[List[str]] = []
+    for m in filtered:
+        name = str(m.get("name", "-"))
+        if not verbose and name.startswith("mlx-community/"):
+            # Compact name without the default org prefix
+            name = name.split("/", 1)[1]
+        if compact:
+            row = [
+                name,
+                fmt_hash7(m.get("hash")),
+                humanize_size(m.get("size_bytes")),
+                fmt_time(m.get("last_modified")),
+                str(m.get("model_type", "-")),
+            ]
+        else:
+            row = [
+                name,
+                fmt_hash7(m.get("hash")),
+                humanize_size(m.get("size_bytes")),
+                fmt_time(m.get("last_modified")),
+                str(m.get("framework", "-")),
+                str(m.get("model_type", "-")),
+            ]
+        if show_health:
+            if verbose:
+                # Verbose mode: Integrity | Runtime | Reason columns
+                health = m.get("health", "unknown")
+                runtime_compatible = m.get("runtime_compatible")
+                reason = m.get("reason", "")
+
+                # Integrity column
+                integrity = "healthy" if health == "healthy" else "unhealthy" if health == "unhealthy" else "-"
+
+                # Runtime column (only meaningful if integrity is healthy)
+                if health == "healthy" and runtime_compatible is not None:
+                    runtime = "yes" if runtime_compatible else "no"
+                else:
+                    runtime = "-"
+
+                # Reason column (truncate to 60 chars)
+                reason_str = str(reason) if reason else "-"
+                if len(reason_str) > 60:
+                    reason_str = reason_str[:57] + "..."
+
+                row.extend([integrity, runtime, reason_str])
+            else:
+                # Compact mode: single Health column (healthy/healthy*/unhealthy)
+                health = m.get("health", "unknown")
+                runtime_compatible = m.get("runtime_compatible")
+
+                if health == "healthy":
+                    if runtime_compatible is True:
+                        health_str = "healthy"
+                    elif runtime_compatible is False:
+                        health_str = "healthy*"
+                    else:
+                        # No runtime check performed
+                        health_str = "healthy"
+                elif health == "unhealthy":
+                    health_str = "unhealthy"
+                else:
+                    health_str = "-"
+
+                row.append(health_str)
+        rows.append(row)
+
+    # Note: show_all/verbose are reserved for future detail; table remains deterministic
+    # Apply 26 char limit to Reason column in verbose mode
+    max_col_width = 26 if (show_health and verbose) else None
+    return _table(rows, headers, max_col_width=max_col_width)
+
+
+def render_health(data: Dict[str, Any]) -> str:
+    d = data.get("data", {})
+    summary = d.get("summary", {})
+    total = summary.get("total", 0)
+    healthy_count = summary.get("healthy_count", 0)
+    unhealthy_count = summary.get("unhealthy_count", 0)
+
+    lines = [f"Summary: total {total}, healthy {healthy_count}, unhealthy {unhealthy_count}"]
+    for entry in d.get("healthy", []):
+        lines.append(f"healthy   {entry.get('name','-')} — {entry.get('reason','')}".rstrip())
+    for entry in d.get("unhealthy", []):
+        lines.append(f"unhealthy {entry.get('name','-')} — {entry.get('reason','')}".rstrip())
+    return "\n".join(lines)
+
+
+def render_show(data: Dict[str, Any]) -> str:
+    d = data.get("data", {})
+    model = d.get("model", {})
+    name = model.get("name", "-")
+    h7 = fmt_hash7(model.get("hash"))
+    header = f"Model: {name}{('@'+h7) if h7 != '-' else ''}"
+
+    # Build health status string
+    health = model.get('health', '-')
+    runtime_compatible = model.get('runtime_compatible')
+    if health == 'healthy' and runtime_compatible is True:
+        health_str = 'healthy'
+    elif health == 'healthy' and runtime_compatible is False:
+        health_str = 'healthy (files OK, runtime incompatible)'
+    else:
+        health_str = health
+
+    details = [
+        f"Framework: {model.get('framework','-')}",
+        f"Type: {model.get('model_type','-')}",
+        f"Size: {humanize_size(model.get('size_bytes'))}",
+        f"Modified: {fmt_time(model.get('last_modified'))}",
+        f"Health: {health_str}",
+    ]
+
+    # Add reason if present
+    reason = model.get('reason')
+    if reason:
+        details.append(f"Reason: {reason}")
+
+    # Optional sections
+    out: List[str] = [header, *details]
+    if "files" in d and isinstance(d["files"], list):
+        out.append("")
+        out.append("Files:")
+        for f in d["files"]:
+            out.append(f"  - {f.get('name','?')} ({f.get('type','other')}, {f.get('size','?')})")
+    elif "config" in d and isinstance(d["config"], dict):
+        out.append("")
+        out.append("Config:")
+        for k, v in d["config"].items():
+            out.append(f"  {k}: {v}")
+    elif d.get("metadata"):
+        out.append("")
+        out.append("Metadata:")
+        for k, v in d["metadata"].items():
+            out.append(f"  {k}: {v}")
+    return "\n".join(out)
+
+
+def render_pull(data: Dict[str, Any]) -> str:
+    d = data.get("data", {})
+    status = data.get("status", "error")
+    model = d.get("model", "-")
+    msg = d.get("message", "")
+    if status == "success":
+        return f"pull: {model} — {msg}".rstrip()
+    err = data.get("error", {})
+    return f"pull: {model} — {err.get('message', msg)}".rstrip()
+
+
+def render_rm(data: Dict[str, Any]) -> str:
+    d = data.get("data", {})
+    status = data.get("status", "error")
+    model = d.get("model", "-")
+    action = d.get("action", "-")
+    msg = d.get("message", "")
+    if status == "success":
+        return f"rm: {model} — {action}: {msg}".rstrip()
+    err = data.get("error", {})
+    return f"rm: {model} — {err.get('message', msg)}".rstrip()
+
+
+def render_clone(data: Dict[str, Any], quiet: bool = False) -> str:
+    """Render clone operation result for human output."""
+    d = data.get("data", {})
+    status = data.get("status", "error")
+    model = d.get("model", "-")
+    target_dir = d.get("target_dir", "-")
+    msg = d.get("message", "")
+    clone_status = d.get("clone_status", "unknown")
+
+    if status == "success":
+        if quiet:
+            return f"clone: {model} → {target_dir}"
+
+        # Show additional info for successful clone
+        cache_cleanup = d.get("cache_cleanup", False)
+        health_check = d.get("health_check", True)
+
+        status_parts = []
+        if health_check:
+            status_parts.append("✓ health")
+        if cache_cleanup:
+            status_parts.append("✓ cleanup")
+
+        status_info = f" ({', '.join(status_parts)})" if status_parts else ""
+        return f"clone: {model} → {target_dir}{status_info} — {msg}".rstrip()
+
+    # Error case
+    err = data.get("error", {})
+    error_msg = err.get("message", msg)
+
+    # Show the specific phase where it failed
+    if clone_status in ["pull_failed", "health_check_failed", "copy_failed", "cache_not_found"]:
+        phase = clone_status.replace("_", " ")
+        return f"clone: {model} → {target_dir} — {phase}: {error_msg}".rstrip()
+
+    return f"clone: {model} → {target_dir} — {error_msg}".rstrip()
+
+
+def render_push(data: Dict[str, Any], verbose: bool = False) -> str:
+    d = data.get("data", {})
+    status = data.get("status", "error")
+    repo = d.get("repo_id", "-")
+    branch = d.get("branch", "-")
+    cs = d.get("commit_sha")
+    h7 = cs[:7] if isinstance(cs, str) and len(cs) >= 7 else "-"
+    prefix = "push (experimental):"
+    # Dry-run handling
+    if d.get("dry_run"):
+        if d.get("no_changes") is True:
+            return f"{prefix} {repo}@{branch} — dry-run: no changes".rstrip()
+        summ = d.get("dry_run_summary") or d.get("change_summary") or {}
+        added = summ.get("added")
+        modified = summ.get("modified")
+        deleted = summ.get("deleted")
+        mod_part = str(modified) if isinstance(modified, int) else "?"
+        line = f"{prefix} {repo}@{branch} — dry-run: +{added or 0} ~{mod_part} -{deleted or 0}"
+        if verbose and (d.get("would_create_repo") or d.get("would_create_branch")):
+            hints = []
+            if d.get("would_create_repo"):
+                hints.append("create repo")
+            if d.get("would_create_branch"):
+                hints.append("create branch")
+            if hints:
+                line = f"{line} ({', '.join(hints)})"
+        return line.rstrip()
+    if status == "success":
+        if d.get("no_changes"):
+            msg = d.get("message")
+            base = f"{prefix} {repo}@{branch} — no changes"
+            if verbose and isinstance(msg, str) and msg and "no changes" not in msg.lower():
+                return f"{base} ({msg})".rstrip()
+            return base.rstrip()
+        # If we have a commit, show it and include a compact summary when available
+        if isinstance(cs, str) and cs:
+            summary = d.get("change_summary") or {}
+            added = summary.get("added")
+            modified = summary.get("modified")
+            deleted = summary.get("deleted")
+            if all(isinstance(x, int) for x in (added, modified, deleted)):
+                line = f"{prefix} {repo}@{branch} — commit {h7} (+{added} ~{modified} -{deleted})"
+            else:
+                line = f"{prefix} {repo}@{branch} — commit {h7}"
+
+            # Workaround: Show important warnings from message (e.g., APFS warning)
+            msg = d.get("message", "")
+            if isinstance(msg, str) and "Clone operations require APFS filesystem" in msg:
+                line = f"{line} (Clone operations require APFS filesystem)"
+
+            if verbose:
+                url = d.get("commit_url")
+                if isinstance(url, str) and url:
+                    line = f"{line} <{url}>"
+            return line.rstrip()
+        # Fallback
+        msg = d.get("message")
+        if isinstance(msg, str) and msg:
+            return f"{prefix} {repo}@{branch} — {msg}".rstrip()
+        return f"{prefix} {repo}@{branch} — done".rstrip()
+    err = data.get("error", {})
+    msg = err.get("message", "")
+    return f"{prefix} {repo}@{branch} — {msg}".rstrip()
diff --git a/mlxk2/spec.py b/mlxk2/spec.py
new file mode 100644
index 0000000..42f6eb2
--- /dev/null
+++ b/mlxk2/spec.py
@@ -0,0 +1,7 @@
+"""Spec constants for MLX-Knife 2.0 JSON API.
+
+Single source of truth for the JSON API specification version used by the
+current code and tests. Keep this in sync with docs/json-api-specification.md.
+"""
+
+JSON_API_SPEC_VERSION = "0.1.5"
diff --git a/pyproject.toml b/pyproject.toml
index 267d8c4..872b4cb 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -5,10 +5,10 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "mlx-knife"
 dynamic = ["version"]
-description = "ollama-style CLI for MLX models on Apple Silicon"
+description = "HuggingFace model management for MLX on Apple Silicon"
 readme = "README.md"
 requires-python = ">=3.9"
-license = {text = "MIT"}
+license = {text = "Apache-2.0"}
 authors = [
     {name = "The BROKE team", email = "broke@gmx.eu"},
 ]
@@ -24,135 +24,48 @@ classifiers = [
     "Programming Language :: Python :: 3.13",
     "Operating System :: MacOS",
     "Environment :: Console",
+    "License :: OSI Approved :: Apache Software License",
 ]
 dependencies = [
     "huggingface-hub>=0.34.0",
     "requests>=2.32.0",
+    "mlx-lm>=0.28.3",
     "mlx>=0.29.0",
-    "mlx-lm>=0.27.0",
     "fastapi>=0.116.0",
     "uvicorn>=0.35.0",
     "pydantic>=2.11.0",
+    "httpx>=0.27.0",
 ]
 
-[project.optional-dependencies]
-test = [
-    "pytest>=7.4.0",
-    "pytest-asyncio>=0.21.0",
-    "pytest-timeout>=2.1.0",
-    "psutil>=5.9.0",
-    "pytest-mock>=3.11.0",
-    "pytest-cov>=4.1.0"
-]
-dev = [
-    "ruff>=0.1.0",
-    "mypy>=1.7.0",
-    "types-requests>=2.31.0"
-]
+[project.scripts]
+mlxk = "mlxk2.cli:main"
+mlxk-json = "mlxk2.cli:main"
+mlxk2 = "mlxk2.cli:main"
 
 [project.urls]
 Homepage = "https://github.com/mzau/mlx-knife"
+Repository = "https://github.com/mzau/mlx-knife"
 Issues = "https://github.com/mzau/mlx-knife/issues"
 
-[project.scripts]
-mlxk = "mlx_knife.cli:main"
-mlx-knife = "mlx_knife.cli:main"
-mlx_knife = "mlx_knife.cli:main"
-
-[tool.setuptools]
-packages = ["mlx_knife"]
+[tool.setuptools.packages.find]
+include = ["mlxk2*"]
+exclude = ["tests*", "tests_2.0*"]
 
 [tool.setuptools.dynamic]
-version = {attr = "mlx_knife.__version__"}
+version = {attr = "mlxk2.__version__"}
 
-[tool.pytest.ini_options]
-testpaths = ["tests"]
-python_files = "test_*.py"
-python_classes = "Test*"
-python_functions = "test_*"
-addopts = [
-    "-v",
-    "--tb=short", 
-    "--strict-markers",
-    "--disable-warnings",
-    "--durations=10",
-    "-m not server"
+[project.optional-dependencies]
+test = [
+    "pytest>=7",
+    "jsonschema>=4.20",
 ]
-markers = [
-    "integration: integration tests (slower)",
-    "unit: unit tests (faster)",
-    "slow: slow running tests",
-    "requires_model: tests that need actual MLX models",
-    "network: tests that require network access",
-    "server: tests that require MLX Knife server with loaded models (manual setup required)",
-    "timeout: tests with timeout requirements",
-    "framework_validation: tests that require diverse model frameworks"
-]
-timeout = 300
-timeout_method = "thread"
-timeout_func_only = true
-norecursedirs = [".git", ".tox", "dist", "build", "*.egg", "venv", "__pycache__"]
-minversion = "6.0"
-
-[tool.ruff]
-target-version = "py39"
-line-length = 88
-extend-exclude = [
-    ".git",
-    "__pycache__",
-    "venv*",
-    ".venv",
-    "build",
-    "dist"
+dev = [
+    "ruff>=0.1.0",
+    "mypy>=1.5.0",
 ]
 
-[tool.ruff.lint]
-select = [
-    "E",   # pycodestyle errors
-    "W",   # pycodestyle warnings
-    "F",   # pyflakes
-    "I",   # isort
-    "B",   # flake8-bugbear
-    "C4",  # flake8-comprehensions
-    "UP",  # pyupgrade
+[tool.setuptools]
+license-files = [
+  "LICENSE",
+  "mlxk2/NOTICE",
 ]
-ignore = [
-    "E501",  # line too long (handled by formatter)
-    "B008",  # do not perform function calls in argument defaults
-    # Python 3.9 compatibility policy - keep legacy typing for maximum compatibility
-    "UP006", # Use list instead of List (keep typing.List for Python 3.9 compat)
-    "UP035", # typing.Dict is deprecated (keep typing.Dict for Python 3.9 compat)
-    # Temporary ignores for release - TODO: fix these in future versions
-    "E402",  # Module level import not at top of file
-    "E722",  # Do not use bare except
-    "W293",  # Blank line contains whitespace
-    "C414",  # Unnecessary list() call
-    "B904",  # Exception handling (raise from)
-]
-
-[tool.ruff.lint.per-file-ignores]
-"tests/*" = ["B011"]  # assert False in tests is ok
-
-[tool.mypy]
-python_version = "3.9"
-warn_return_any = true
-warn_unused_configs = true
-disallow_untyped_defs = true
-disallow_incomplete_defs = true
-check_untyped_defs = true
-disallow_untyped_decorators = true
-no_implicit_optional = true
-warn_redundant_casts = true
-warn_unused_ignores = true
-warn_no_return = true
-warn_unreachable = true
-strict_equality = true
-show_error_codes = true
-
-[[tool.mypy.overrides]]
-module = [
-    "mlx.*",
-    "mlx_lm.*",
-    "huggingface_hub.*"
-]
-ignore_missing_imports = true
diff --git a/pytest.ini b/pytest.ini
new file mode 100644
index 0000000..5cc4c1f
--- /dev/null
+++ b/pytest.ini
@@ -0,0 +1,17 @@
+[pytest]
+testpaths = tests_2.0
+python_files = test_*.py
+python_classes = Test*
+python_functions = test_*
+markers =
+    spec: JSON API contract tests (current spec only)
+    wet: Opt-in live tests against Hugging Face (require env)
+    live_push: Alias for wet; push live tests (require env)
+    live_list: Alias for wet; list human live tests (require env)
+    live_clone: Alias for wet; clone live tests (require env, ADR-007 Phase 1)
+    live_run: Opt-in run command tests with real models (require user cache model)
+    live_stop_tokens: Opt-in stop token tests with real models (Issue #32, ADR-009)
+    issue27: Real-model health policy tests (opt-in; read-only user cache)
+    slow: Tests that take >1 minute to run
+filterwarnings =
+    ignore::urllib3.exceptions.NotOpenSSLWarning
diff --git a/requirements.txt b/requirements.txt
index b10dd0c..831897a 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -3,7 +3,7 @@
 
 huggingface-hub>=0.34.0
 requests>=2.32.0
-mlx-lm>=0.27.0  # For running MLX models with streaming support
+mlx-lm>=0.28.3
 mlx>=0.29.0     # Core MLX library
 
 # API Server dependencies (for 'mlxk server' command)
@@ -11,4 +11,7 @@ fastapi>=0.116.0
 uvicorn>=0.35.0
 pydantic>=2.11.0
 
-# Note: Python 3.9+ supported, tested on Apple Silicon M1/M2/M3
\ No newline at end of file
+# Test dependencies (for FastAPI TestClient)
+httpx>=0.27.0
+
+# Note: Python 3.9+ supported, tested on Apple Silicon M1/M2/M3
diff --git a/scripts/check-spec-bump.sh b/scripts/check-spec-bump.sh
new file mode 100755
index 0000000..a0e1632
--- /dev/null
+++ b/scripts/check-spec-bump.sh
@@ -0,0 +1,55 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Simple local/CI guard: if the spec docs or schema changed, require bump in mlxk2/spec.py
+# Bypass: include [no-spec-bump] or [skip-spec-bump] in the latest commit message, or set SPEC_BUMP_BYPASS=1
+
+BASE_REF=${1:-}
+
+if [[ -z "${BASE_REF}" ]]; then
+  # Try to find a reasonable base (main branch); fall back to first commit
+  if git show-ref --verify --quiet refs/heads/main; then
+    BASE_REF="main"
+  elif git show-ref --verify --quiet refs/remotes/origin/main; then
+    BASE_REF="origin/main"
+  else
+    BASE_REF=$(git rev-list --max-parents=0 HEAD)
+  fi
+fi
+
+changed_files=$(git diff --name-only "${BASE_REF}"...HEAD)
+
+spec_changed=false
+spec_files=("docs/json-api-specification.md" "docs/json-api-schema.json")
+for f in ${spec_files[@]}; do
+  if echo "${changed_files}" | grep -q "^${f}$"; then
+    spec_changed=true
+  fi
+done
+
+if [[ "${spec_changed}" != "true" ]]; then
+  echo "Spec files unchanged relative to ${BASE_REF}. OK."
+  exit 0
+fi
+
+if [[ "${SPEC_BUMP_BYPASS:-}" == "1" ]]; then
+  echo "Bypass via SPEC_BUMP_BYPASS=1. Skipping spec bump check."
+  exit 0
+fi
+
+last_msg=$(git log -1 --pretty=%B)
+if echo "${last_msg}" | grep -Eqi "\[(no-spec-bump|skip-spec-bump)\]"; then
+  echo "Bypass via commit message token [no-spec-bump]/[skip-spec-bump]."
+  exit 0
+fi
+
+if ! echo "${changed_files}" | grep -q "^mlxk2/spec.py$"; then
+  echo "ERROR: Spec docs or schema changed without version bump in mlxk2/spec.py" >&2
+  echo " - Changed spec files: $(echo "${changed_files}" | grep -E '^(docs/json-api-specification.md|docs/json-api-schema.json)$' | tr '\n' ' ')" >&2
+  echo " - Please update JSON_API_SPEC_VERSION in mlxk2/spec.py and adjust tests accordingly." >&2
+  echo " - To bypass for editorial changes, add [no-spec-bump] to the commit message or set SPEC_BUMP_BYPASS=1." >&2
+  exit 1
+fi
+
+echo "Spec change detected and version bump present in mlxk2/spec.py. OK."
+
diff --git a/scripts/list-index-models.sh b/scripts/list-index-models.sh
new file mode 100755
index 0000000..1515351
--- /dev/null
+++ b/scripts/list-index-models.sh
@@ -0,0 +1,42 @@
+#!/usr/bin/env bash
+
+# List Hugging Face models in the user cache that have an index file
+# (model.safetensors.index.json or pytorch_model.bin.index.json).
+#
+# Usage:
+#   bash scripts/list-index-models.sh [HF_CACHE_ROOT]
+#
+# Resolution order for HF cache root:
+#   1) first CLI arg
+#   2) $MLXK2_USER_HF_HOME
+#   3) $HF_HOME
+
+set -euo pipefail
+
+BASE="${1:-${MLXK2_USER_HF_HOME:-${HF_HOME:-}}}"
+if [[ -z "${BASE}" ]]; then
+  echo "Usage: $0 [HF_CACHE_ROOT]" >&2
+  echo "Hint: export MLXK2_USER_HF_HOME=/path/to/huggingface/cache" >&2
+  exit 1
+fi
+
+HUB_DIR="${BASE%/}/hub"
+if [[ ! -d "${HUB_DIR}" ]]; then
+  echo "Error: '${HUB_DIR}' not found. Expected HF cache layout at: ${BASE}" >&2
+  exit 2
+fi
+
+# Find index files and turn cache directories back into repo ids (org/model)
+# models--org--model[/optional/segments]/snapshots/<hash>/...
+RESULTS=$(find "${HUB_DIR}" -type f \( -name 'model.safetensors.index.json' -o -name 'pytorch_model.bin.index.json' \) 2>/dev/null \
+  | sed -E 's#.*/hub/models--(.*)/snapshots/.*#\1#; s#--#/#g' \
+  | sort -u || true)
+
+if [[ -z "${RESULTS}" ]]; then
+  echo "No index-bearing models found under: ${HUB_DIR}" >&2
+  exit 0
+fi
+
+echo "Index-bearing models in cache (${HUB_DIR}):"
+echo "${RESULTS}"
+
diff --git a/scripts/push-test-workspace.sh b/scripts/push-test-workspace.sh
new file mode 100755
index 0000000..6ebc785
--- /dev/null
+++ b/scripts/push-test-workspace.sh
@@ -0,0 +1,50 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Simple helper to push a local test workspace to Hugging Face.
+# Usage: scripts/push-test-workspace.sh <org/model> [branch] [commit_message]
+
+REPO_ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+WS_DIR="${REPO_ROOT_DIR}/mymodel_test_workspace"
+
+REPO_ID=${1:-}
+BRANCH=${2:-main}
+COMMIT_MSG=${3:-"mlx-knife push (test workspace)"}
+
+if [[ -z "${REPO_ID}" ]]; then
+  echo "Usage: $0 <org/model> [branch] [commit_message]" >&2
+  exit 2
+fi
+
+if [[ -z "${HF_TOKEN:-}" ]]; then
+  echo "HF_TOKEN is not set; export a write-enabled token" >&2
+  exit 2
+fi
+
+# Prepare workspace (ignored by Git via .gitignore)
+mkdir -p "${WS_DIR}"
+if [[ ! -f "${WS_DIR}/README.md" ]]; then
+  cat >"${WS_DIR}/README.md" <<'EOF'
+# Test Workspace for mlxk2 push
+
+This folder is intentionally lightweight and git-ignored.
+It is safe to push to a personal HF test repo for validation.
+EOF
+fi
+
+# Reasonable default exclude rules (merged with hard excludes in code)
+cat >"${WS_DIR}/.hfignore" <<'EOF'
+.DS_Store
+__pycache__/
+*.tmp
+*.log
+*.zip
+*.tar
+*.tar.gz
+.venv/
+venv/
+EOF
+
+echo "Pushing ${WS_DIR} -> ${REPO_ID}@${BRANCH}"
+mlxk2 push "${WS_DIR}" "${REPO_ID}" --create --branch "${BRANCH}" --commit "${COMMIT_MSG}"
+
diff --git a/scripts/test-hooks.sh b/scripts/test-hooks.sh
new file mode 100755
index 0000000..d8df452
--- /dev/null
+++ b/scripts/test-hooks.sh
@@ -0,0 +1,59 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Non-invasive test of local hooks in a temporary worktree.
+
+if ! git rev-parse --is-inside-work-tree >/dev/null 2>&1; then
+  echo "Not inside a Git repository." >&2
+  exit 1
+fi
+
+ROOT=$(git rev-parse --show-toplevel)
+echo "Repo: $ROOT"
+
+echo "[1/3] Testing pre-commit in temp worktree..."
+WT=$(mktemp -d)
+cleanup() {
+  git worktree remove --force "$WT" >/dev/null 2>&1 || true
+}
+trap cleanup EXIT
+
+git worktree add -f "$WT" HEAD >/dev/null
+git -C "$WT" config user.email "local@test"
+git -C "$WT" config user.name "Local Test"
+
+(
+  cd "$WT"
+  echo "test" > AGENTS.md
+  git add -f AGENTS.md
+  if git commit -m "should be blocked by pre-commit" >/dev/null 2>&1; then
+    echo "ERROR: pre-commit did NOT block committing AGENTS.md" >&2
+    exit 2
+  else
+    echo "OK: pre-commit blocked AGENTS.md commit"
+  fi
+  git restore --staged AGENTS.md >/dev/null 2>&1 || true
+  rm -f AGENTS.md
+)
+
+echo "[2/3] Testing pre-push blocking..."
+HOOKS=$(git rev-parse --git-path hooks)
+BR=$(git rev-parse --abbrev-ref HEAD)
+
+if printf "refs/heads/%s 0 refs/heads/%s 0\n" "$BR" "$BR" | "$HOOKS/pre-push" >/dev/null 2>&1; then
+  echo "ERROR: pre-push did NOT block current branch" >&2
+  exit 3
+else
+  echo "OK: pre-push blocked current branch"
+fi
+
+echo "[3/3] Testing pre-push override..."
+if ALLOW_PUSH=1 printf "refs/heads/%s 0 refs/heads/%s 0\n" "$BR" "$BR" | "$HOOKS/pre-push" >/dev/null 2>&1; then
+  echo "OK: pre-push override allowed"
+else
+  echo "ERROR: pre-push override failed" >&2
+  exit 4
+fi
+
+echo "All hook tests passed."
+
diff --git a/test-multi-python.sh b/test-multi-python.sh
index 0aae22a..1cbfe38 100755
--- a/test-multi-python.sh
+++ b/test-multi-python.sh
@@ -2,7 +2,7 @@
 # Note: removed set -e to allow script to continue through all Python versions
 # Individual error handling is done explicitly in each test section
 
-echo "🧪 MLX Knife Multi-Python Version Testing"
+echo "🧪 MLX Knife 2.0 (mlxk2) Multi-Python Version Testing"
 echo "=========================================="
 echo "Prerequisites: Python versions should be available as:"
 echo "  - python3 (3.9+ - system default)"
@@ -42,8 +42,6 @@ test_python_version() {
     
     # Create virtual environment
     local venv_name="test_env_${version_name//./_}"
-    # Trap termination to ensure cleanup (Ctrl-C or external kill)
-    trap 'echo -e "\n⛔ Received termination signal. Cleaning up $venv_name..."; deactivate 2>/dev/null || true; pkill -P $$ 2>/dev/null || true; rm -rf "$venv_name"; echo "Exiting due to signal."; exit 1' INT TERM
     echo "🔧 Creating virtual environment: $venv_name"
     
     if [ -d "$venv_name" ]; then
@@ -54,90 +52,57 @@ test_python_version() {
     source "$venv_name/bin/activate"
     
     # Upgrade pip and install MLX Knife
-    echo "📦 Installing MLX Knife..."
-    pip install --upgrade pip setuptools wheel > /dev/null 2>&1
+    echo "📦 Installing MLX Knife (2.0) ..."
+    local install_log="install_${version_name//./_}.log"
+    pip install --upgrade pip setuptools wheel > "$install_log" 2>&1
     
-    if pip install -e ".[dev,test]" > /dev/null 2>&1; then
+    if pip install -e ".[test]" >> "$install_log" 2>&1; then
         echo -e "${GREEN}✅ Installation successful${NC}"
+        echo "🧰 Ensuring tooling (ruff, mypy)..."
+        pip install -q "ruff>=0.1.0" "mypy>=1.5.0" >> "$install_log" 2>&1 || true
         
         # Run smoke test
-        echo "🧪 Running import test (this may take up to 2 minutes for MLX)..."
-        if python -c "import mlx_knife.cli; print('Import successful')"; then
+        echo "🧪 Running import test (mlxk2)..."
+        if python -c "import mlxk2, mlxk2.cli; print('Import successful')"; then
             echo -e "${GREEN}✅ Import test passed${NC}"
             
             # Try basic CLI command
-            echo "🧪 Testing CLI help..."
-            if python -m mlx_knife.cli --help > /dev/null 2>&1; then
-                echo -e "${GREEN}✅ CLI test passed${NC}"
+            echo "🧪 Testing CLI version (JSON)..."
+            if python -m mlxk2.cli --version --json > /dev/null 2>&1; then
+                echo -e "${GREEN}✅ CLI test (version) passed${NC}"
                 
                 # Run complete test suite
-                echo "🧪 Running FULL test suite (this takes 5-10 minutes)..."
+                echo "🧪 Running 2.0 test suite..."
                 local test_log="test_results_${version_name//./_}.log"
-                # Disable process guard for multi-env run to avoid cross-session signal handling
-                MLXK_TEST_DISABLE_PROCESS_GUARD=1 MLXK_TEST_DISABLE_CATCH_TERM=1 MLXK_TEST_DETACH_PGRP=0 python -m pytest tests/ -v --tb=short --timeout-method=thread > "$test_log" 2>&1
-                local pytest_rc=$?
-                local passed_count=$(grep -c "PASSED" "$test_log" 2>/dev/null)
-                local failed_count=$(grep -c "FAILED" "$test_log" 2>/dev/null)
-                passed_count=${passed_count:-0}
-                failed_count=${failed_count:-0}
-                local test_count=$((passed_count + failed_count))
-
-                # Treat stray signal exits (e.g., 143=SIGTERM, 137=SIGKILL) as success if log shows all passed
-                if [ $pytest_rc -ne 0 ] && [ "$failed_count" -eq 0 ] && [ "$passed_count" -gt 0 ] && grep -q "passed" "$test_log"; then
-                    echo -e "${YELLOW}ℹ️  PyTest exit code $pytest_rc but log shows all tests passed — accepting as success${NC}"
-                    pytest_rc=0
-                fi
-
-                if [ $pytest_rc -eq 0 ]; then
+                if python -m pytest tests_2.0/ -v --tb=short > "$test_log" 2>&1; then
+                    local passed_count=$(grep -c "PASSED" "$test_log" 2>/dev/null)
+                    local failed_count=$(grep -c "FAILED" "$test_log" 2>/dev/null)
+                    passed_count=${passed_count:-0}
+                    failed_count=${failed_count:-0}
+                    local test_count=$((passed_count + failed_count))
+                    
                     if [ "$failed_count" -eq 0 ] && [ "$passed_count" -gt 0 ]; then
                         echo -e "${GREEN}✅ Full test suite passed ($passed_count/$test_count tests)${NC}"
                         
                         # Code quality checks
-                        echo "🧪 Running code quality checks..."
+                        echo "🧪 Running code quality checks (mlxk2)..."
                         
-                        # Check if ruff is properly installed
-                        if python -c "import ruff" > /dev/null 2>&1; then
-                            local ruff_log="ruff_${version_name//./_}.log"
-                            echo "🧪 Running ruff check (logging to $ruff_log)..."
-                            if python -m ruff check mlx_knife/ > "$ruff_log" 2>&1; then
-                                echo -e "${GREEN}✅ ruff linting passed${NC}"
-                                
-                                # Note: mypy might have many warnings, so we allow it to "fail" but still continue
-                                python -m mypy mlx_knife/ --ignore-missing-imports > mypy_${version_name//./_}.log 2>&1
-                                local mypy_errors=$(grep -c "error:" mypy_${version_name//./_}.log 2>/dev/null || echo "0")
-                                echo -e "${YELLOW}ℹ️  mypy check complete ($mypy_errors errors found)${NC}"
-                                
-                                RESULTS+=("${version_name}:FULL_SUCCESS:${passed_count}tests")
-                            else
-                                local ruff_error_count=$(grep -c "Found .* error" "$ruff_log" 2>/dev/null || echo "unknown")
-                                echo -e "${RED}❌ ruff linting failed ($ruff_error_count errors)${NC}"
-                                echo "   See $ruff_log for details"
-                                RESULTS+=("${version_name}:RUFF_FAILED")
-                            fi
+                        local ruff_log="ruff_${version_name//./_}.log"
+                        echo "🧪 Running ruff check on mlxk2 (logging to $ruff_log)..."
+                        if python -m ruff check mlxk2/ > "$ruff_log" 2>&1; then
+                            echo -e "${GREEN}✅ ruff linting passed${NC}"
+                            
+                            # Note: mypy might have many warnings, so we allow it to "fail" but still continue
+                            python -m mypy mlxk2/ --ignore-missing-imports > mypy_${version_name//./_}.log 2>&1
+                            local mypy_errors=$(grep -c "error:" mypy_${version_name//./_}.log 2>/dev/null || echo "0")
+                            echo -e "${YELLOW}ℹ️  mypy check complete ($mypy_errors errors found)${NC}"
+                            
+                            RESULTS+=("${version_name}:FULL_SUCCESS:${passed_count}tests")
                         else
-                            echo -e "${RED}❌ ruff not properly installed, trying to install...${NC}"
-                            if pip install ruff>=0.1.0 > /dev/null 2>&1; then
-                                echo "🔧 ruff installed, retrying check..."
-                                local ruff_log="ruff_${version_name//./_}.log"
-                                if python -m ruff check mlx_knife/ > "$ruff_log" 2>&1; then
-                                    echo -e "${GREEN}✅ ruff linting passed${NC}"
-                                    
-                                    # Note: mypy might have many warnings, so we allow it to "fail" but still continue
-                                    python -m mypy mlx_knife/ --ignore-missing-imports > mypy_${version_name//./_}.log 2>&1
-                                    local mypy_errors=$(grep -c "error:" mypy_${version_name//./_}.log 2>/dev/null || echo "0")
-                                    echo -e "${YELLOW}ℹ️  mypy check complete ($mypy_errors errors found)${NC}"
-                                    
-                                    RESULTS+=("${version_name}:FULL_SUCCESS:${passed_count}tests")
-                                else
-                                    local ruff_error_count=$(grep -c "Found .* error" "$ruff_log" 2>/dev/null || echo "unknown")
-                                    echo -e "${RED}❌ ruff linting failed after installation ($ruff_error_count errors)${NC}"
-                                    echo "   See $ruff_log for details"
-                                    RESULTS+=("${version_name}:RUFF_FAILED")
-                                fi
-                            else
-                                echo -e "${RED}❌ Could not install ruff${NC}"
-                                RESULTS+=("${version_name}:RUFF_INSTALL_FAILED")
-                            fi
+                            local ruff_error_count=$(grep -c "Found .* error" "$ruff_log" 2>/dev/null || echo "unknown")
+                            echo -e "${RED}❌ ruff linting failed ($ruff_error_count errors)${NC}"
+                            echo "   See $ruff_log for details"
+                            RESULTS+=("${version_name}:RUFF_FAILED")
                         fi
                     else
                         echo -e "${RED}❌ Test suite failed ($passed_count passed, $failed_count failed)${NC}"
@@ -145,9 +110,7 @@ test_python_version() {
                         RESULTS+=("${version_name}:TESTS_FAILED:${failed_count}failures")
                     fi
                 else
-                    echo -e "${RED}❌ Test suite timed out or crashed (exit=$pytest_rc)${NC}"
-                    echo "   Tail of log ($test_log):"
-                    tail -n 60 "$test_log" 2>/dev/null || true
+                    echo -e "${RED}❌ Test suite timed out or crashed${NC}"
                     RESULTS+=("${version_name}:TESTS_TIMEOUT")
                 fi
             else
@@ -160,13 +123,13 @@ test_python_version() {
         fi
     else
         echo -e "${RED}❌ Installation failed${NC}"
+        echo "   See $install_log for details"
         RESULTS+=("${version_name}:INSTALL_FAILED")
     fi
     
     # Cleanup
     deactivate 2>/dev/null || true
     rm -rf "$venv_name"
-    trap - INT TERM
 }
 
 # Run tests for all Python versions
@@ -175,40 +138,38 @@ for i in "${!PYTHON_COMMANDS[@]}"; do
 done
 
 # Summary
-echo
-echo "SUMMARY"
+echo -e "\n${YELLOW}📊 SUMMARY${NC}"
 echo "==========="
 
 for result in "${RESULTS[@]}"; do
     IFS=':' read -r version status details <<< "$result"
     case $status in
         "FULL_SUCCESS")
-            echo "OK Python ${version}: FULLY VERIFIED - ${details}"
+            echo -e "${GREEN}✅ Python $version: FULLY VERIFIED ($details)${NC}"
             ;;
         "NOT_FOUND")
-            echo "WARN Python ${version}: NOT INSTALLED"
+            echo -e "${YELLOW}⚠️  Python $version: NOT INSTALLED${NC}"
             ;;
         "TESTS_FAILED")
-            echo "FAIL Python ${version}: TESTS FAILED - ${details}"
+            echo -e "${RED}❌ Python $version: TESTS FAILED ($details)${NC}"
             ;;
         "RUFF_FAILED")
-            echo "FAIL Python ${version}: CODE QUALITY FAILED"
+            echo -e "${RED}❌ Python $version: CODE QUALITY FAILED${NC}"
             ;;
         "RUFF_INSTALL_FAILED")
-            echo "FAIL Python ${version}: RUFF INSTALLATION FAILED"
+            echo -e "${RED}❌ Python $version: RUFF INSTALLATION FAILED${NC}"
             ;;
         "TESTS_TIMEOUT")
-            echo "FAIL Python ${version}: TESTS TIMED OUT"
+            echo -e "${RED}❌ Python $version: TESTS TIMED OUT${NC}"
             ;;
         *)
-            echo "FAIL Python ${version}: ${status}"
+            echo -e "${RED}❌ Python $version: $status${NC}"
             ;;
     esac
 done
 
 # Recommendations
-echo
-echo "RECOMMENDATIONS"
+echo -e "\n${YELLOW}💡 RECOMMENDATIONS${NC}"
 echo "=================="
 
 fully_verified_count=0
@@ -233,58 +194,53 @@ for result in "${RESULTS[@]}"; do
     esac
 done
 
-echo "VERIFICATION RESULTS:"
-printf "   Fully Verified: %s\n" "$fully_verified_count"
-printf "   Failed/Issues: %s\n" "$failed_count"
-printf "   Not Available: %s\n" "$not_found_count"
+echo -e "${YELLOW}📊 VERIFICATION RESULTS:${NC}"
+echo "   Fully Verified: $fully_verified_count"
+echo "   Failed/Issues: $failed_count" 
+echo "   Not Available: $not_found_count"
 
 if [ $fully_verified_count -eq 0 ]; then
-    echo
-    echo "CRITICAL: No Python versions fully verified!"
-    echo "   - Cannot release without verified compatibility"
-    echo "   - Fix blocking issues before any release"
+    echo -e "\n${RED}🚨 CRITICAL: No Python versions fully verified!${NC}"
+    echo "   → Cannot release without verified compatibility"
+    echo "   → Fix blocking issues before any release"
 elif [ $failed_count -eq 0 ] && [ $fully_verified_count -ge 2 ]; then
-    echo
-    echo "PRODUCTION READY: All tested versions fully verified!"
-    echo "   - Safe to release with confidence"
-    echo "   - All versions pass: installation, tests, code quality"
-    echo "   - Verified versions: ${fully_verified_versions[*]}"
+    echo -e "\n${GREEN}🎉 PRODUCTION READY: All tested versions fully verified!${NC}"
+    echo "   → Safe to release with confidence"
+    echo "   → All versions pass: installation, tests, code quality"
+    echo "   → Verified versions: ${fully_verified_versions[*]}"
 elif [ $fully_verified_count -ge 2 ]; then
-    echo
-    echo "PARTIAL SUCCESS: ${fully_verified_count} verified, ${failed_count} with issues"
-    echo "   - Can release with verified versions: ${fully_verified_versions[*]}"
-    echo "   - Document known issues with other versions"
-    echo "   - Consider fixing compatibility or updating requirements"
+    echo -e "\n${YELLOW}⚖️  PARTIAL SUCCESS: $fully_verified_count verified, $failed_count with issues${NC}"
+    echo "   → Can release with verified versions: ${fully_verified_versions[*]}"
+    echo "   → Document known issues with other versions"
+    echo "   → Consider fixing compatibility or updating requirements"
 else
-    echo
-    echo "INSUFFICIENT VERIFICATION: Only ${fully_verified_count} versions verified"
-    echo "   - Need at least 2 fully verified versions for release"
-    echo "   - Fix compatibility issues or verify more versions"
+    echo -e "\n${RED}⚠️  INSUFFICIENT VERIFICATION: Only $fully_verified_count version(s) verified${NC}"
+    echo "   → Need at least 2 fully verified versions for release"
+    echo "   → Fix compatibility issues or verify more versions"
 fi
 
-echo
-echo "NEXT STEPS"
+echo -e "\n${YELLOW}📝 NEXT STEPS${NC}"
 echo "============="
 
 if [ $fully_verified_count -ge 2 ] && [ $failed_count -eq 0 ]; then
-    echo "READY TO RELEASE:"
-    echo "  1. Update README.md with verified Python versions"
-    echo "  2. Update pyproject.toml requires-python based on results"
-    echo "  3. Document verified versions: ${fully_verified_versions[*]}"
-    echo "  4. Safe to tag and release MLX Knife 1.1.1-b2"
+    echo "✅ READY TO RELEASE:"
+    echo "   1. Update README.md with verified Python versions"
+    echo "   2. Update pyproject.toml requires-python based on results"
+    echo "   3. Document verified versions: ${fully_verified_versions[*]}"
+    echo "   4. Safe to tag and release (alpha.1)"
     exit_code=0
 else
-    echo "WORK NEEDED:"
-    echo "  1. Review detailed logs: test_results_*.log, mypy_*.log"
-    echo "  2. Fix compatibility issues for failed versions"
-    echo "  3. Re-run this script until all targeted versions pass"
-    echo "  4. Update documentation to reflect actual compatibility"
-    echo "  5. Consider reducing version scope if fixes are complex"
+    echo "🔧 WORK NEEDED:"
+    echo "   1. Review detailed logs: test_results_*.log, mypy_*.log"
+    echo "   2. Fix compatibility issues for failed versions"
+    echo "   3. Re-run this script until all targeted versions pass"
+    echo "   4. Update documentation to reflect actual compatibility"
+    echo "   5. Consider reducing version scope if fixes are complex"
     exit_code=1
 fi
 
 echo ""
-echo "Generated Files:"
+echo -e "${YELLOW}📁 Generated Files:${NC}"
 echo "   - test_results_<version>.log: Detailed pytest results"  
 echo "   - mypy_<version>.log: Type checking results"
 echo "   - Use these logs to debug specific compatibility issues"
diff --git a/tests/__init__.py b/tests/__init__.py
deleted file mode 100644
index be9b2a8..0000000
--- a/tests/__init__.py
+++ /dev/null
@@ -1,2 +0,0 @@
-"""Test package initializer (enables support module imports)."""
-
diff --git a/tests/conftest.py b/tests/conftest.py
deleted file mode 100644
index a9fb41a..0000000
--- a/tests/conftest.py
+++ /dev/null
@@ -1,286 +0,0 @@
-"""
-Pytest configuration and shared fixtures for MLX Knife tests.
-"""
-import os
-import tempfile
-import shutil
-import pytest
-import subprocess
-import signal
-import sys
-import time
-from pathlib import Path
-from typing import Generator, List
-import psutil
-import signal
-
-
-def _safe_detach_process_group():
-    """Detach pytest runner into its own process group to avoid stray group kills.
-
-    Some cleanup routines or external tools may send signals to a whole process
-    group. By making the test runner the leader of a fresh group, we reduce the
-    chance that a misdirected killpg() affects the runner. Disable with
-    MLXK_TEST_DISABLE_DETACH_PGRP=1 if undesired.
-    """
-    if os.environ.get("MLXK_TEST_DISABLE_DETACH_PGRP") == "1":
-        return
-    try:
-        os.setpgrp()  # equivalent to setpgid(0,0)
-    except Exception:
-        pass
-
-
-# Detach early at import time only if explicitly requested
-if os.environ.get("MLXK_TEST_DETACH_PGRP") == "1":
-    _safe_detach_process_group()
-
-
-@pytest.fixture(autouse=True, scope="session")
-def _optional_zombie_sweep():
-    """Optional best-effort sweep for stale servers at session start.
-
-    Controlled via MLXK_TEST_KILL_ZOMBIES_AT_START=1.
-    No signal handlers installed here to avoid interfering with non-server runs.
-    """
-    if os.environ.get("MLXK_TEST_KILL_ZOMBIES_AT_START") == "1":
-        try:
-            for p in psutil.process_iter(['pid', 'name', 'cmdline']):
-                try:
-                    cmd = ' '.join(p.info.get('cmdline') or [])
-                    if ('mlxk' in cmd and 'server' in cmd) or ('mlx_knife.server:app' in cmd):
-                        p.terminate()
-                        try:
-                            p.wait(timeout=5)
-                        except psutil.TimeoutExpired:
-                            p.kill()
-                except (psutil.NoSuchProcess, psutil.AccessDenied):
-                    continue
-        except Exception:
-            pass
-    yield
-
-@pytest.fixture
-def temp_cache_dir() -> Generator[Path, None, None]:
-    """Create a temporary cache directory for isolated testing."""
-    with tempfile.TemporaryDirectory() as temp_dir:
-        cache_path = Path(temp_dir) / "test_cache"
-        cache_path.mkdir()
-        
-        # Create hub subdirectory (required by HF_HOME/hub fix)
-        hub_path = cache_path / "hub"
-        hub_path.mkdir()
-        
-        # Set HF_HOME to our temp directory
-        old_hf_home = os.environ.get("HF_HOME")
-        os.environ["HF_HOME"] = str(cache_path)
-        
-        try:
-            yield cache_path
-        finally:
-            # Restore original HF_HOME
-            if old_hf_home:
-                os.environ["HF_HOME"] = old_hf_home
-            elif "HF_HOME" in os.environ:
-                del os.environ["HF_HOME"]
-
-
-@pytest.fixture(scope="class")
-def class_temp_cache_dir() -> Generator[Path, None, None]:
-    """Create a temporary cache directory for class-level testing (setup_class/teardown_class)."""
-    with tempfile.TemporaryDirectory() as temp_dir:
-        cache_path = Path(temp_dir) / "test_cache"
-        cache_path.mkdir()
-        
-        # Create hub subdirectory (required by HF_HOME/hub fix)
-        hub_path = cache_path / "hub"
-        hub_path.mkdir()
-        
-        # Set HF_HOME to our temp directory
-        old_hf_home = os.environ.get("HF_HOME")
-        os.environ["HF_HOME"] = str(cache_path)
-        
-        try:
-            yield cache_path
-        finally:
-            # Restore original HF_HOME
-            if old_hf_home:
-                os.environ["HF_HOME"] = old_hf_home
-            elif "HF_HOME" in os.environ:
-                del os.environ["HF_HOME"]
-
-
-@pytest.fixture
-def patch_model_cache():
-    """Utility fixture to temporarily patch MODEL_CACHE to isolated directory."""
-    from contextlib import contextmanager
-    
-    @contextmanager
-    def _patch_cache(cache_path: Path):
-        from mlx_knife import cache_utils
-        original_cache = cache_utils.MODEL_CACHE
-        cache_utils.MODEL_CACHE = cache_path
-        try:
-            yield cache_path
-        finally:
-            cache_utils.MODEL_CACHE = original_cache
-    
-    return _patch_cache
-
-
-@pytest.fixture
-def mlx_knife_process():
-    """Factory fixture to create and manage mlx_knife subprocess."""
-    processes: List[subprocess.Popen] = []
-    
-    def _create_process(args: List[str], **kwargs) -> subprocess.Popen:
-        """Create a new mlx_knife process and track it."""
-        full_args = ["python", "-m", "mlx_knife.cli"] + args
-        proc = subprocess.Popen(
-            full_args,
-            stdout=subprocess.PIPE,
-            stderr=subprocess.PIPE,
-            text=True,
-            **kwargs
-        )
-        # Register with process guard for robust cleanup on Ctrl-C
-        try:
-            from tests.support import process_guard as pg
-            pg.register_popen(proc, label="mlxk-cli")
-        except Exception:
-            pass
-        processes.append(proc)
-        return proc
-    
-    yield _create_process
-    
-    # Cleanup: Kill all created processes
-    for proc in processes:
-        if proc.poll() is None:  # Process still running
-            try:
-                proc.terminate()
-                proc.wait(timeout=5)
-            except subprocess.TimeoutExpired:
-                proc.kill()
-                proc.wait()
-            finally:
-                try:
-                    from tests.support import process_guard as pg
-                    pg.unregister(proc.pid)
-                except Exception:
-                    pass
-
-
-@pytest.fixture
-def process_monitor():
-    """Monitor processes for zombie detection."""
-    def _get_process_tree(pid: int) -> List[psutil.Process]:
-        """Get all child processes of a given PID."""
-        try:
-            parent = psutil.Process(pid)
-            return parent.children(recursive=True)
-        except psutil.NoSuchProcess:
-            return []
-    
-    def _wait_for_process_cleanup(pid: int, timeout: float = 5.0) -> bool:
-        """Wait for all child processes to terminate."""
-        start_time = time.time()
-        while time.time() - start_time < timeout:
-            children = _get_process_tree(pid)
-            if not children:
-                return True
-            time.sleep(0.1)
-        return False
-    
-    return {
-        "get_process_tree": _get_process_tree,
-        "wait_for_cleanup": _wait_for_process_cleanup
-    }
-
-
-@pytest.fixture
-def mock_model_cache(temp_cache_dir):
-    """Create mock model cache structures for testing."""
-    def _create_mock_model(
-        model_name: str,
-        healthy: bool = True,
-        corruption_type: str = None
-    ) -> Path:
-        """Create a mock model in the cache directory."""
-        # Convert model name to cache directory format
-        cache_name = model_name.replace("/", "--")
-        # Create models in hub subdirectory (HF_HOME/hub fix)
-        hub_dir = temp_cache_dir / "hub"
-        model_dir = hub_dir / f"models--{cache_name}" / "snapshots" / "main"
-        model_dir.mkdir(parents=True, exist_ok=True)
-        
-        if healthy and not corruption_type:
-            # Create healthy model files
-            (model_dir / "config.json").write_text('{"model_type": "test"}')
-            (model_dir / "tokenizer.json").write_text('{"version": "1.0"}')
-            (model_dir / "model.safetensors").write_bytes(b"fake_model_data" * 100)
-        elif corruption_type:
-            _create_corrupted_model(model_dir, corruption_type)
-        
-        return model_dir
-    
-    def _create_corrupted_model(model_dir: Path, corruption_type: str):
-        """Create various types of corrupted models."""
-        if corruption_type == "missing_snapshot":
-            # Remove snapshots directory
-            shutil.rmtree(model_dir.parent.parent)
-        elif corruption_type == "missing_config":
-            (model_dir / "tokenizer.json").write_text('{"version": "1.0"}')
-            (model_dir / "model.safetensors").write_bytes(b"fake_model_data")
-            # config.json is missing
-        elif corruption_type == "lfs_pointer":
-            (model_dir / "config.json").write_text('{"model_type": "test"}')
-            (model_dir / "tokenizer.json").write_text('{"version": "1.0"}')
-            # Create LFS pointer file instead of actual data
-            (model_dir / "model.safetensors").write_text(
-                "version https://git-lfs.github.com/spec/v1\n"
-                "oid sha256:abc123\n"
-                "size 1000000\n"
-            )
-        elif corruption_type == "truncated_safetensors":
-            (model_dir / "config.json").write_text('{"model_type": "test"}')
-            (model_dir / "tokenizer.json").write_text('{"version": "1.0"}')
-            # Create truncated/corrupted safetensors
-            (model_dir / "model.safetensors").write_bytes(b"corrupted")
-        elif corruption_type == "missing_tokenizer":
-            (model_dir / "config.json").write_text('{"model_type": "test"}')
-            (model_dir / "model.safetensors").write_bytes(b"fake_model_data")
-            # tokenizer.json is missing
-    
-    return _create_mock_model
-
-
-@pytest.fixture(autouse=True, scope="session")
-def _catch_term_and_exit_cleanly():
-    """Catch SIGTERM and exit cleanly to avoid 'Terminated: 15/9' noise.
-
-    We install a simple handler that exits with code 0 when SIGTERM is received.
-    This avoids the shell printing a termination message after a fully passed
-    test run. If you want to disable this behavior for debugging, set
-    MLXK_TEST_DISABLE_CATCH_TERM=1.
-    """
-    if os.environ.get("MLXK_TEST_DISABLE_CATCH_TERM") == "1":
-        yield
-        return
-
-    def _term_handler(signum, frame):
-        try:
-            print("\n[INFO] Received SIGTERM, exiting cleanly.")
-        except Exception:
-            pass
-        try:
-            sys.exit(0)
-        except SystemExit:
-            os._exit(0)
-
-    # Install handler for the lifetime of the session (no restore on teardown)
-    try:
-        signal.signal(signal.SIGTERM, _term_handler)
-    except Exception:
-        pass
-    yield
diff --git a/tests/integration/test_core_functionality.py b/tests/integration/test_core_functionality.py
deleted file mode 100644
index e2d6142..0000000
--- a/tests/integration/test_core_functionality.py
+++ /dev/null
@@ -1,319 +0,0 @@
-"""
-High Priority Tests: Core Functionality
-
-Tests ensure primary features work correctly:
-- Model execution (run command, streaming, token decoding, stop tokens)
-- Basic operations (list, show, pull, rm)
-- Chat template application
-"""
-import pytest
-import subprocess
-import json
-import time
-from pathlib import Path
-from unittest.mock import patch, MagicMock
-
-
-@pytest.mark.timeout(30)
-class TestBasicOperations:
-    """Test core CLI operations."""
-    
-    def test_list_command_empty_cache(self, mlx_knife_process, temp_cache_dir):
-        """List command should handle empty cache gracefully."""
-        proc = mlx_knife_process(["list"])
-        stdout, stderr = proc.communicate(timeout=10)
-        
-        # Should complete successfully
-        assert proc.returncode == 0, f"List failed on empty cache: {stderr}"
-        
-        # Should produce some output (even if empty list)
-        assert len(stdout) >= 0
-        # Common outputs for empty cache: "No models found" or empty list
-        
-    def test_list_command_with_models(self, mlx_knife_process, mock_model_cache):
-        """List command should display available models."""
-        # Create some mock models
-        mock_model_cache("test-model-1", healthy=True)
-        mock_model_cache("test-model-2", healthy=True)
-        
-        proc = mlx_knife_process(["list"])
-        stdout, stderr = proc.communicate(timeout=10)
-        
-        assert proc.returncode == 0, f"List failed: {stderr}"
-        assert len(stdout) > 0, "List produced no output with models present"
-        
-        # Should contain reference to models (exact format depends on implementation)
-        output_lower = stdout.lower()
-        assert "test" in output_lower or "model" in output_lower or len(stdout.split('\n')) > 1
-
-    def test_show_command_existing_model(self, mlx_knife_process, mock_model_cache):
-        """Show command should display model details."""
-        model_dir = mock_model_cache("test-model", healthy=True)
-        
-        # Try different possible model name formats
-        model_names_to_try = ["test-model", "test/model", "models--test-model"]
-        
-        success = False
-        for model_name in model_names_to_try:
-            proc = mlx_knife_process(["show", model_name])
-            stdout, stderr = proc.communicate(timeout=10)
-            
-            if proc.returncode == 0 and len(stdout) > 0:
-                success = True
-                break
-        
-        # At least one format should work, or command should handle gracefully
-        # The key is that it doesn't crash or hang
-        assert success or all(
-            proc.returncode is not None for proc in [
-                mlx_knife_process(["show", name]) 
-                for name in model_names_to_try
-            ]
-        ), "Show command hung or crashed"
-
-    def test_show_command_nonexistent_model(self, mlx_knife_process, temp_cache_dir):
-        """Show command should handle nonexistent models gracefully."""
-        proc = mlx_knife_process(["show", "nonexistent-model"])
-        stdout, stderr = proc.communicate(timeout=10)
-        
-        # Should complete (likely with error code)
-        assert proc.returncode is not None, "Show command hung"
-        
-        # Should produce some error message
-        output = stdout + stderr
-        assert len(output) > 0, "No error message for nonexistent model"
-
-    def test_rm_command_safety(self, mlx_knife_process, temp_cache_dir):
-        """Remove command should handle nonexistent models safely."""
-        proc = mlx_knife_process(["rm", "nonexistent-model"])
-        stdout, stderr = proc.communicate(timeout=10)
-        
-        # Should complete (may succeed or fail gracefully)
-        assert proc.returncode is not None, "Remove command hung"
-        
-        # Should not crash
-        # Exact behavior depends on implementation
-
-    def test_rm_command_corrupted_empty_snapshots(self, mlx_knife_process, temp_cache_dir):
-        """Remove command should handle corrupted models with empty snapshots directory."""
-        from mlx_knife.cache_utils import hf_to_cache_dir
-        
-        # Create a corrupted model structure (directory exists but snapshots is empty)
-        test_model = "test-org/corrupted-empty-model"
-        # Create in hub subdirectory (new cache structure)
-        hub_dir = temp_cache_dir / "hub"
-        cache_dir = hub_dir / hf_to_cache_dir(test_model)
-        cache_dir.mkdir(parents=True, exist_ok=True)
-        (cache_dir / "snapshots").mkdir(exist_ok=True)
-        (cache_dir / "blobs").mkdir(exist_ok=True)
-        (cache_dir / "refs").mkdir(exist_ok=True)
-        
-        try:
-            # This should NOT fail silently - should either provide error message or handle deletion
-            # Use --force to avoid hanging on input prompts in test environment
-            proc = mlx_knife_process(["rm", test_model, "--force"])
-            stdout, stderr = proc.communicate(timeout=10)
-            
-            # Should complete (not hang)
-            assert proc.returncode is not None, "Remove command hung on corrupted model"
-            
-            # Should produce SOME output (not silent failure)
-            output = (stdout + stderr).strip()
-            assert len(output) > 0, "Remove command failed silently on corrupted model - no output produced"
-            
-            # The behavior should be explicit: either error message or deletion prompt/confirmation
-            output_lower = output.lower()
-            has_error = "error" in output_lower or "not found" in output_lower
-            has_prompt = "delete" in output_lower or "remove" in output_lower
-            
-            assert has_error or has_prompt, f"Remove command should provide clear feedback, got: {output}"
-            
-        finally:
-            # Cleanup - remove the test corrupted model structure
-            import shutil
-            if cache_dir.exists():
-                shutil.rmtree(cache_dir)
-
-
-@pytest.mark.timeout(60)
-class TestModelExecution:
-    """Test model loading and execution functionality."""
-    
-    def test_run_command_basic_prompt(self, mlx_knife_process):
-        """Test basic model execution with prompt using real MLX model."""
-        # Uses Phi-3-mini-4k-instruct-4bit (assumes already pulled and healthy)
-        test_model = "Phi-3-mini-4k-instruct-4bit"
-        test_prompt = "Say hello."
-        
-        proc = mlx_knife_process(["run", test_model, test_prompt, "--max-tokens", "20"])
-        stdout, stderr = proc.communicate(timeout=60)
-        
-        # Test MLX Knife functionality, not model quality
-        assert proc.returncode == 0, f"MLX Knife execution failed: {stderr}"
-        assert len(stdout.strip()) > 0, "MLX Knife produced no output - model loading/generation failed"
-        assert len(stdout.strip()) < 1000, f"MLX Knife did not respect max-tokens limit: {len(stdout)} chars"
-        
-        # Basic sanity check: output should be reasonable text (not binary garbage)
-        # Allow common whitespace characters (newlines, tabs, spaces)
-        clean_output = stdout.replace('\n', '').replace('\t', '').replace('\r', '')
-        assert clean_output.isprintable(), f"MLX Knife produced non-printable output: {repr(stdout)}"
-
-    def test_run_command_invalid_model(self, mlx_knife_process, temp_cache_dir):
-        """Run command should handle invalid models gracefully."""
-        proc = mlx_knife_process(["run", "nonexistent-model", "test prompt"])
-        stdout, stderr = proc.communicate(timeout=15)
-        
-        # Should fail gracefully, not hang
-        assert proc.returncode is not None, "Run command hung on invalid model"
-        assert proc.returncode != 0, "Run should fail on nonexistent model"
-        
-        # Should produce error message
-        output = stdout + stderr
-        assert len(output) > 0, "No error message for invalid model"
-
-    def test_streaming_token_generation(self, mlx_knife_process):
-        """Test streaming token output with real MLX model."""
-        test_model = "Phi-3-mini-4k-instruct-4bit"
-        test_prompt = "Write the word 'test' three times."
-        
-        proc = mlx_knife_process(["run", test_model, test_prompt, "--max-tokens", "30"])
-        stdout, stderr = proc.communicate(timeout=45)
-        
-        # Test MLX Knife streaming functionality, not model accuracy
-        assert proc.returncode == 0, f"MLX Knife streaming failed: {stderr}"
-        assert len(stdout.strip()) > 0, "MLX Knife streaming produced no output"
-        assert len(stdout.strip()) < 2000, f"MLX Knife streaming did not respect token limits: {len(stdout)} chars"
-        
-        # Verify streaming worked by checking output is reasonable text
-        # Allow common whitespace characters (newlines, tabs, spaces)
-        clean_output = stdout.replace('\n', '').replace('\t', '').replace('\r', '')
-        assert clean_output.isprintable(), f"MLX Knife streaming produced non-printable output: {repr(stdout)}"
-
-
-
-@pytest.mark.timeout(120)
-class TestPullOperation:
-    """Test model downloading functionality."""
-    
-    def test_pull_command_invalid_model(self, mlx_knife_process, temp_cache_dir):
-        """Pull command should handle invalid model names gracefully."""
-        proc = mlx_knife_process(["pull", "definitely-not-a-real-model-12345"])
-        stdout, stderr = proc.communicate(timeout=30)
-        
-        # Should fail, not hang
-        assert proc.returncode is not None, "Pull command hung"
-        assert proc.returncode != 0, "Pull should fail on invalid model"
-        
-        # Should produce error message
-        output = stdout + stderr
-        assert len(output) > 0, "No error message for invalid model"
-
-    def test_pull_command_network_timeout_handling(self, mlx_knife_process, temp_cache_dir, patch_model_cache):
-        """Pull command should handle network issues gracefully - uses isolated cache."""
-        # Use Phi-3-mini for realistic timeout testing, but in ISOLATED cache
-        with patch_model_cache(temp_cache_dir / "hub"):
-            proc = mlx_knife_process(["pull", "mlx-community/Phi-3-mini-4k-instruct-4bit", "--no-progress"])
-            
-            # Give it limited time to start, then interrupt
-            time.sleep(5)
-            
-            if proc.poll() is None:  # Still running
-                proc.send_signal(subprocess.signal.SIGINT)
-                try:
-                    stdout, stderr = proc.communicate(timeout=15)
-                except subprocess.TimeoutExpired:
-                    proc.kill()
-                    stdout, stderr = proc.communicate()
-            else:
-                stdout, stderr = proc.communicate()
-            
-            # Key test: should not hang indefinitely
-            assert proc.returncode is not None, "Pull command did not terminate"
-            
-            # Should handle interruption gracefully
-            output = stdout + stderr
-            assert len(output) >= 0  # Some output expected
-            
-            print("✓ Timeout test completed - any broken Phi-3-mini in isolated cache will be auto-cleaned")
-
-
-@pytest.mark.timeout(30)  
-class TestCommandLineInterface:
-    """Test CLI argument parsing and help functionality."""
-    
-    def test_help_command(self, mlx_knife_process):
-        """Help command should display usage information."""
-        proc = mlx_knife_process(["--help"])
-        stdout, stderr = proc.communicate(timeout=10)
-        
-        # Should succeed
-        assert proc.returncode == 0, f"Help command failed: {stderr}"
-        
-        # Should produce help output
-        assert len(stdout) > 0, "Help produced no output"
-        
-        # Should contain basic command information
-        help_text = stdout.lower()
-        assert any(cmd in help_text for cmd in ["list", "pull", "run", "health"]), \
-            "Help missing core commands"
-
-    def test_version_command(self, mlx_knife_process):
-        """Version command should display version information."""
-        # Try common version flags
-        version_flags = ["--version", "-v"]
-        
-        success = False
-        for flag in version_flags:
-            try:
-                proc = mlx_knife_process([flag])
-                stdout, stderr = proc.communicate(timeout=10)
-                
-                if proc.returncode == 0 and len(stdout) > 0:
-                    success = True
-                    # Should contain version number
-                    assert any(char.isdigit() for char in stdout), \
-                        "Version output contains no digits"
-                    break
-            except:
-                continue
-        
-        # At least one version flag should work, or command should handle gracefully
-        if not success:
-            # Test that invalid flags are handled
-            proc = mlx_knife_process(["--invalid-flag"])
-            stdout, stderr = proc.communicate(timeout=10)
-            assert proc.returncode is not None, "Invalid flag handling hung"
-
-    def test_invalid_command_handling(self, mlx_knife_process):
-        """Invalid commands should be handled gracefully."""
-        proc = mlx_knife_process(["invalid-command-xyz"])
-        stdout, stderr = proc.communicate(timeout=10)
-        
-        # Should fail but not hang
-        assert proc.returncode is not None, "Invalid command hung"
-        assert proc.returncode != 0, "Invalid command should not succeed"
-        
-        # Should produce error message
-        output = stdout + stderr
-        assert len(output) > 0, "No error message for invalid command"
-
-    def test_missing_arguments_handling(self, mlx_knife_process):
-        """Commands missing required arguments should fail gracefully."""
-        # Test commands that require arguments
-        commands_needing_args = [
-            ["run"],  # needs model and prompt
-            ["show"],  # needs model name
-            ["pull"],  # needs model name
-        ]
-        
-        for cmd in commands_needing_args:
-            proc = mlx_knife_process(cmd)
-            stdout, stderr = proc.communicate(timeout=10)
-            
-            # Should fail gracefully
-            assert proc.returncode is not None, f"Command {cmd} hung"
-            assert proc.returncode != 0, f"Command {cmd} should fail without required args"
-            
-            # Should produce helpful error
-            output = stdout + stderr
-            assert len(output) > 0, f"No error message for {cmd} without args"
\ No newline at end of file
diff --git a/tests/integration/test_end_token_issue.py b/tests/integration/test_end_token_issue.py
deleted file mode 100644
index 6c4665b..0000000
--- a/tests/integration/test_end_token_issue.py
+++ /dev/null
@@ -1,758 +0,0 @@
-"""
-Test for End-Token Issue: Streaming vs Non-Streaming Consistency
-
-This test ensures that End-Tokens are handled consistently across different
-models and streaming modes using actual token metrics instead of word estimates.
-"""
-
-import logging
-import signal
-import subprocess
-import time
-from typing import Dict, List, Tuple, Any
-import json
-import math
-import subprocess
-from functools import lru_cache
-
-import os
-import psutil
-import pytest
-import requests
-try:
-    from tests.support import process_guard as pg  # pytest path
-except Exception:
-    try:
-        import sys
-        from pathlib import Path
-        sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
-        from support import process_guard as pg  # type: ignore
-    except Exception:
-        class _PG:
-            @staticmethod
-            def register_popen(*args, **kwargs):
-                pass
-            @staticmethod
-            def unregister(*args, **kwargs):
-                pass
-            @staticmethod
-            def install_signal_handlers():
-                pass
-            @staticmethod
-            def kill_all(*args, **kwargs):
-                pass
-        pg = _PG()
-
-logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
-logger = logging.getLogger(__name__)
-
-# RAM requirements by quantization level
-# 4-bit (GB)
-MODEL_RAM_REQUIREMENTS_4BIT = {
-    "0.5B": 1,   "1B": 2,    "3B": 4,    "4B": 5,
-    "7B": 8,     "8x7B": 16, "24B": 20,  "30B": 24,
-    "70B": 40,   "480B": 180,
-}
-
-# FP16/BF16 (GB) — conservative
-MODEL_RAM_REQUIREMENTS_FP16 = {
-    "0.5B": 2,   "1B": 4,    "3B": 8,    "4B": 12,
-    "7B": 16,    "8x7B": 180, "24B": 48,  "30B": 60,
-    "70B": 140,  "480B": 960,
-}
-
-# Model-specific End-Tokens to check for (comprehensive list)
-MODEL_END_TOKENS = {
-    "llama": ["</s>", "<|end_of_text|>", "<|eot_id|>"],  # Llama-2/3.x tokens
-    "mistral": ["</s>", "<|endoftext|>"],  # Mistral variants
-    "qwen": ["<|im_end|>", "<|endoftext|>", "<|end|>", "</s>"],  # Qwen variants  
-    "phi": ["<|endoftext|>", "<|end|>", "</s>"],  # Phi-3 variants
-    "mixtral": ["</s>", "<|endoftext|>"],  # Mixtral (Mistral-based)
-    "gpt-oss": ["<|return|>"],  # GPT-OSS reasoning models: <|end|> is NOT a stop token, only <|return|>
-    "default": [  # Comprehensive catch-all list
-        "</s>", "<|im_end|>", "<|endoftext|>", "<|end_of_text|>", 
-        "<|eot_id|>", "<|end|>", "<end>", "</end>", "<eos>", "</eos>",
-        "<|assistant|>", "<|user|>", "<|system|>"
-    ]
-}
-
-SERVER_BASE_URL = "http://localhost:8000"
-SERVER_PORT = 8000
-
-
-def extract_model_size(model_name: str) -> str:
-    """Extract model size from model name (prefers MoE tokens like 8x7B)."""
-    import re
-
-    # Prefer MoE first, then standard sizes; include special cases
-    size_patterns = [
-        r'(\d+(?:\.\d+)?(?:x\d+)?B)',  # 30B, 0.5B, 3.2B, 8x7B
-        r'Phi-3-mini',                    # Special case → 4B
-        r'Qwen2\.5-(\d+(?:\.\d+)?)B', # Qwen2.5-0.5B → 0.5B
-    ]
-
-    for pattern in size_patterns:
-        match = re.search(pattern, model_name)
-        if match:
-            if 'Phi-3-mini' in model_name:
-                return '4B'
-            elif 'Qwen2.5' in model_name:
-                return f"{match.group(1)}B"
-            else:
-                return match.group(1)
-
-    return 'unknown'
-
-
-def is_quantized_4bit_from_text(text: str) -> bool:
-    t = text.lower()
-    markers = ["4bit", "4-bit", "q4", "int4", "gguf q4", "q4_k", "q4_"]
-    return any(m in t for m in markers)
-
-
-@lru_cache(maxsize=128)
-def get_model_info_via_show(model_name: str) -> dict:
-    """Use `mlxk show` to fetch size and quantization details for a model."""
-    try:
-        res = subprocess.run(["mlxk", "show", model_name], capture_output=True, text=True, timeout=15)
-        if res.returncode != 0:
-            return {}
-        size_gb = None
-        quant_info = None
-        for raw in res.stdout.splitlines():
-            line = raw.strip()
-            if line.startswith("Size:"):
-                size_text = line.split("Size:", 1)[1].strip()
-                val = parse_size_to_gb(size_text)
-                if val is not None:
-                    size_gb = val
-            elif line.startswith("Quantization:"):
-                quant_info = line.split("Quantization:", 1)[1].strip()
-        return {"size_gb": size_gb, "quantization": quant_info}
-    except Exception:
-        return {}
-
-
-def is_quantized_4bit(model_name: str) -> bool:
-    name = model_name.lower()
-    if any(m in name for m in ("4bit", "q4", "int4")):
-        return True
-    info = get_model_info_via_show(model_name)
-    if info and info.get("quantization"):
-        return is_quantized_4bit_from_text(info["quantization"])
-    return False
-
-
-def estimate_required_ram_gb(model_name: str, size_str: str) -> int:
-    """Estimate RAM using show-based disk size and quantization-aware maps."""
-    info = get_model_info_via_show(model_name)
-    q4 = is_quantized_4bit(model_name)
-
-    # Quantization-specific disk→RAM factor
-    try:
-        if q4:
-            factor = float(os.getenv("MLXK_TEST_FACTOR_4BIT", os.getenv("MLXK_TEST_DISK_TO_RAM_FACTOR", "0.6")))
-        else:
-            factor = float(os.getenv("MLXK_TEST_FACTOR_FP16", os.getenv("MLXK_TEST_DISK_TO_RAM_FACTOR", "0.6")))
-        factor = max(0.1, min(2.0, factor))
-    except Exception:
-        factor = 0.6
-
-    disk_ram_est = None
-    if info and info.get("size_gb") is not None:
-        disk_ram_est = max(1, math.ceil(info["size_gb"] * factor))
-
-    map_est = None
-    if size_str and size_str != 'unknown':
-        if q4:
-            map_est = MODEL_RAM_REQUIREMENTS_4BIT.get(size_str)
-        else:
-            map_est = MODEL_RAM_REQUIREMENTS_FP16.get(size_str)
-
-    if disk_ram_est is not None and map_est is not None:
-        return max(disk_ram_est, map_est)
-    if disk_ram_est is not None:
-        return disk_ram_est
-    if map_est is not None:
-        return map_est
-    return 999
-
-
-def parse_size_to_gb(size_str: str) -> float:
-    try:
-        parts = size_str.strip().split()
-        if len(parts) < 2:
-            return None
-        value = float(parts[0])
-        unit = parts[1].upper()
-        if unit.startswith('KB'):
-            return value / (1024 ** 2)
-        if unit.startswith('MB'):
-            return value / 1024
-        if unit.startswith('GB'):
-            return value
-        if unit.startswith('TB'):
-            return value * 1024
-        return None
-    except Exception:
-        return None
-
-
-@lru_cache(maxsize=128)
-def get_model_disk_size_gb(model_name: str) -> float:
-    info = get_model_info_via_show(model_name)
-    return info.get("size_gb") if info else None
-
-
-def get_model_family(model_name: str) -> str:
-    """Determine model family for End-Token selection."""
-    model_lower = model_name.lower()
-    
-    if 'llama' in model_lower:
-        return 'llama'
-    elif 'mistral' in model_lower and 'mixtral' not in model_lower:
-        return 'mistral'
-    elif 'qwen' in model_lower:
-        return 'qwen'
-    elif 'phi' in model_lower:
-        return 'phi'
-    elif 'mixtral' in model_lower:
-        return 'mixtral'
-    elif 'gpt-oss' in model_lower:
-        return 'gpt-oss'
-    else:
-        return 'default'
-
-
-def get_available_ram_gb() -> int:
-    """Get available system RAM in GB (with optional safety margin)."""
-    total = psutil.virtual_memory().total // (1024**3)
-    available = psutil.virtual_memory().available // (1024**3)
-    try:
-        safety_factor = float(os.getenv("MLXK_TEST_RAM_SAFETY", "1.0"))
-        safety_factor = max(0.1, min(1.0, safety_factor))
-    except Exception:
-        safety_factor = 1.0
-    safe_usable = min(int(available * safety_factor), total - 4)
-    return max(1, safe_usable)
-
-
-def find_existing_mlxk_servers() -> List[psutil.Process]:
-    """Find running MLX Knife server processes (best-effort)."""
-    servers = []
-    for p in psutil.process_iter(['pid', 'name', 'cmdline']):
-        try:
-            cmd = ' '.join(p.info.get('cmdline') or [])
-            if 'mlx_knife.server:app' in cmd or ('mlxk' in cmd and 'server' in cmd):
-                servers.append(p)
-        except (psutil.NoSuchProcess, psutil.AccessDenied):
-            continue
-    return servers
-
-
-def cleanup_zombie_servers(port: int):
-    """Clean up any zombie MLX Knife servers on the specified port."""
-    logger.info(f"🧹 Checking for existing servers on port {port}")
-    # Try to find listeners on the port
-    try:
-        conns = psutil.net_connections(kind='inet')
-    except (psutil.AccessDenied, PermissionError):
-        conns = []
-    for c in conns:
-        if c.laddr.port == port and c.status == psutil.CONN_LISTEN and c.pid:
-            try:
-                proc = psutil.Process(c.pid)
-                cmd = ' '.join(proc.cmdline())
-                if 'mlxk' in cmd and 'server' in cmd:
-                    logger.warning(f"⚠️  Killing leftover MLX Knife server PID {proc.pid}")
-                    proc.terminate()
-                    try:
-                        proc.wait(timeout=5)
-                    except psutil.TimeoutExpired:
-                        proc.kill()
-                else:
-                    logger.warning(f"Port {port} occupied by non-MLX process {proc.pid}: {cmd}")
-            except (psutil.NoSuchProcess, psutil.AccessDenied):
-                pass
-    # Also kill any stray MLX Knife servers regardless of port
-    for proc in find_existing_mlxk_servers():
-        try:
-            logger.warning(f"⚠️  Cleaning zombie MLX Knife server PID {proc.pid}")
-            proc.terminate()
-            try:
-                proc.wait(timeout=5)
-            except psutil.TimeoutExpired:
-                proc.kill()
-        except psutil.NoSuchProcess:
-            pass
-
-
-class MLXKnifeServerManager:
-    """Context manager for MLX Knife server lifecycle with zombie cleanup."""
-    
-    def __init__(self):
-        self.process = None
-        
-    def __enter__(self):
-        self.start_server()
-        return self
-        
-    def __exit__(self, exc_type, exc_val, exc_tb):
-        self.stop_server()
-        
-    def start_server(self):
-        """Start MLX Knife server."""
-        # Ensure signal handlers are installed for robust cleanup (server-only)
-        try:
-            pg.install_signal_handlers()
-        except Exception:
-            pass
-        # Ensure no stale server blocks the port
-        cleanup_zombie_servers(SERVER_PORT)
-        logger.info("Starting MLX Knife server...")
-        self.process = subprocess.Popen(
-            ["mlxk", "server", "--host", "127.0.0.1", "--port", str(SERVER_PORT)],
-            stdout=subprocess.PIPE,
-            stderr=subprocess.PIPE,
-            text=True,
-            preexec_fn=os.setsid if hasattr(os, "setsid") else None
-        )
-        # Track for robust cleanup on Ctrl-C
-        pg.register_popen(self.process, label="mlxk-server")
-        
-        # Wait for server to be ready
-        for attempt in range(30):
-            try:
-                response = requests.get(f"{SERVER_BASE_URL}/health", timeout=2)
-                if response.status_code == 200:
-                    logger.info("Server is ready")
-                    return
-            except:
-                pass
-            time.sleep(1)
-        
-        raise RuntimeError("Server failed to start within 30 seconds")
-        
-    def stop_server(self):
-        """Stop MLX Knife server with proper cleanup."""
-        if self.process:
-            logger.info("Stopping server...")
-            # Graceful shutdown attempt
-            try:
-                self.process.terminate()
-            except Exception:
-                pass
-            try:
-                self.process.wait(timeout=10)
-                logger.info("Server stopped gracefully")
-            except subprocess.TimeoutExpired:
-                logger.warning("Server did not stop gracefully, force killing...")
-                # Kill process group first for immediate stop
-                try:
-                    if hasattr(os, "killpg"):
-                        os.killpg(os.getpgid(self.process.pid), signal.SIGKILL)
-                except Exception:
-                    pass
-                try:
-                    self.process.kill()
-                    self.process.wait(timeout=3)
-                except Exception:
-                    pass
-                logger.info("Server force killed")
-            try:
-                pg.unregister(self.process.pid)
-            except Exception:
-                pass
-            
-            # Wait a bit for port cleanup
-            time.sleep(2)
-            
-            # Verify port is actually free
-            for attempt in range(5):
-                try:
-                    response = requests.get(f"{SERVER_BASE_URL}/health", timeout=1)
-                    if attempt == 4:
-                        logger.warning("Port may still be occupied after server shutdown")
-                    time.sleep(1)
-                except requests.exceptions.RequestException:
-                    # Good - server is really down
-                    logger.info("Port confirmed free")
-                    break
-
-
-def get_available_models() -> List[str]:
-    """Get list of available models from server."""
-    try:
-        response = requests.get(f"{SERVER_BASE_URL}/v1/models", timeout=10)
-        if response.status_code == 200:
-            data = response.json()
-            return [model["id"] for model in data.get("data", [])]
-    except Exception as e:
-        logger.warning(f"Failed to get models: {e}")
-    return []
-
-
-def get_safe_models_for_system() -> List[Tuple[str, str, int]]:
-    """Get models that can safely run on current system."""
-    models = get_available_models()
-    available_ram = get_available_ram_gb()
-    safe_models = []
-    
-    for model in models:
-        size_str = extract_model_size(model)
-        ram_needed = estimate_required_ram_gb(model, size_str)
-
-        if ram_needed <= available_ram:
-            safe_models.append((model, size_str, ram_needed))
-
-    return safe_models
-
-
-def get_model_context_length(model_name: str) -> int:
-    """Get model's context length from server."""
-    try:
-        response = requests.get(f"{SERVER_BASE_URL}/v1/models", timeout=10)
-        if response.status_code == 200:
-            data = response.json()
-            for model in data.get("data", []):
-                if model["id"] == model_name:
-                    return model.get("context_length", 4096)
-    except Exception:
-        pass
-    return 4096  # Default fallback
-
-
-def get_model_aware_token_targets(model_name: str, model_size: str) -> Dict[str, int]:
-    """Get realistic token targets based on actual model capabilities."""
-    context_length = get_model_context_length(model_name)
-    
-    # Calculate reasonable target based on model size + context
-    if model_size in ["1B", "3B"]:
-        target_tokens = min(512, context_length // 8)
-    elif model_size in ["4B", "7B"]:
-        target_tokens = min(1024, context_length // 6)
-    elif model_size in ["24B", "30B", "70B"]:
-        target_tokens = min(2048, context_length // 4)
-    else:
-        target_tokens = min(800, context_length // 6)
-    
-    # Model-specific adjustments for known behaviors
-    model_lower = model_name.lower()
-    if 'phi-3' in model_lower:
-        # Phi-3 models tend to be very concise, adjust expectations
-        target_tokens = min(target_tokens, 200)
-    
-    return {
-        "target_tokens": target_tokens,
-        "min_tokens": target_tokens // 3,  # Allow 33% variance
-        "context_length": context_length
-    }
-
-
-def create_adaptive_trilogy_prompt(model_size: str, target_tokens: int) -> str:
-    """Create trilogy prompt adapted to model capabilities."""
-    
-    base_plot = '''Here is the outline for fantasy trilogy "EMBERS OF THE FORGOTTEN":
-
-**MAIN CHARACTERS:**
-1. Kaelen Veyra - The Exiled Flame Herald (32, war poet, controls Soulfire)
-2. Sylra D'Tharn - The Shadow Warrior (28, assassin, uses Emotionweave)
-3. Lord Morvath - The Unforgotten King (45, tragic villain with Grief-Crown)
-
-**TRILOGY STRUCTURE:**
-- Book I: "Embers of the Forgotten" - The flame that remembers
-- Book II: "The Lovers' Crucible" - The fire that doesn't burn
-- Book III: "The Fire That Binds" - The flame that connects
-
-**THEMES:** Love as power not weakness, memory as healing, emotions as connection'''
-
-    if model_size in ["1B", "3B"]:
-        task = f'''**YOUR TASK:** Write a 500-word opening scene of Book I featuring Kaelen's exile.
-- Focus on Kaelen's emotional state after Lirien's death
-- Use poetic, mythic language
-- Target approximately {target_tokens} tokens
-- End with him seeing Veyra (Valley of Faces) in the distance'''
-    
-    elif model_size in ["4B", "7B"]:
-        task = f'''**YOUR TASK:** Write the opening chapter of Book I: "The Poet Who Burned" 
-- Focus on Kaelen's exile from Celestine after Lirien's execution
-- Include his emotional journey and Soulfire powers
-- Use poetic, mythic language with deep inner rhythm
-- Target approximately {target_tokens} tokens (1000-1500 words)
-- End with his arrival at Veyra (Valley of Faces)'''
-    
-    else:  # 24B, 30B, 70B
-        task = f'''**YOUR TASK:** Write the complete first chapter of Book I: "The Poet Who Burned"
-- Focus on Kaelen's exile from Celestine after his beloved Lirien's execution  
-- Include his arrival at Veyra (Valley of Faces) with 30 lost masks
-- Show his Soulfire powers and deep emotional development
-- Use poetic, mythic language with deep inner rhythm
-- Target approximately {target_tokens} tokens (2000+ words)
-- Include dialogue and rich character development
-- End with the mysterious mask whispering: "You were here - a thousand years ago"'''
-
-    return f"{base_plot}\n\n{task}\n\nWrite the complete chapter now."
-
-
-def make_chat_request(model_name: str, prompt: str, stream: bool = False, timeout: int = 120) -> str:
-    """Make chat completion request to server."""
-    payload = {
-        "model": model_name,
-        "messages": [{"role": "user", "content": prompt}],
-        "stream": stream,
-        "temperature": 0.7
-    }
-    
-    response = requests.post(
-        f"{SERVER_BASE_URL}/v1/chat/completions",
-        json=payload,
-        timeout=timeout,
-        stream=stream
-    )
-    
-    if not response.ok:
-        raise RuntimeError(f"Request failed: {response.status_code} - {response.text}")
-    
-    if stream:
-        # Handle streaming response
-        content = ""
-        for line in response.iter_lines(decode_unicode=True):
-            if line.startswith("data: "):
-                data_str = line[6:]
-                if data_str.strip() == "[DONE]":
-                    break
-                try:
-                    data = json.loads(data_str)
-                    delta = data.get("choices", [{}])[0].get("delta", {}).get("content", "")
-                    content += delta
-                except json.JSONDecodeError:
-                    continue
-        return content
-    else:
-        # Handle non-streaming response
-        data = response.json()
-        return data.get("choices", [{}])[0].get("message", {}).get("content", "")
-
-
-def contains_end_tokens(text: str, model_name: str) -> List[str]:
-    """Check if text contains any End-Tokens for the given model."""
-    model_family = get_model_family(model_name)
-    end_tokens = MODEL_END_TOKENS.get(model_family, MODEL_END_TOKENS["default"])
-    
-    found_tokens = []
-    for token in end_tokens:
-        if token in text:
-            found_tokens.append(token)
-    
-    return found_tokens
-
-
-def estimate_token_count(text: str) -> int:
-    """Rough token count estimation (4 chars per token average)."""
-    return len(text) // 4
-
-
-def get_safe_models_lazy():
-    """Lazy evaluation for parametrize to avoid import-time server calls."""
-    try:
-        return get_safe_models_for_system()
-    except:
-        return [("test-model", "1B", 1)]
-
-
-def pytest_generate_tests(metafunc):
-    """Dynamic test parametrization to avoid import-time server calls."""
-    if "model_name" in metafunc.fixturenames:
-        try:
-            with MLXKnifeServerManager() as server:
-                models = get_safe_models_for_system()
-                metafunc.parametrize("model_name,size_str,ram_needed", models)
-        except Exception as e:
-            pytest.skip(f"Cannot set up server for testing: {e}")
-
-
-@pytest.mark.server
-@pytest.mark.timeout(300)  # 5 minute timeout for large models
-def test_non_streaming_end_tokens(model_name, size_str, ram_needed):
-    """
-    Test Issue #20: Non-streaming mode should show End-Tokens (EXPECTED TO FAIL).
-    
-    This test validates that non-streaming responses contain visible End-Tokens,
-    proving the server-side filtering bug in generate_batch().
-    
-    Expected result: FAIL (End-Tokens visible) - this confirms Issue #20.
-    """
-    logger.info(f"🔍 Testing NON-STREAMING End-Tokens with {model_name} ({size_str}, {ram_needed}GB RAM)")
-    
-    with MLXKnifeServerManager() as server:
-        # Get model-specific token targets
-        token_specs = get_model_aware_token_targets(model_name, size_str)
-        logger.info(f"Token targets: {token_specs}")
-        
-        # Create adaptive prompt (no max_tokens - let model use natural stopping)
-        prompt = create_adaptive_trilogy_prompt(size_str, token_specs["target_tokens"])
-        
-        logger.info("🚫 Testing NON-STREAMING mode (should show End-Tokens)...")
-        
-        response_content = make_chat_request(model_name, prompt, stream=False, timeout=300)
-        
-        # Basic validation
-        assert response_content.strip(), "Non-streaming returned empty response"
-        
-        # Token count validation
-        estimated_tokens = estimate_token_count(response_content)
-        logger.info(f"Non-streaming response: ~{estimated_tokens} tokens")
-        logger.info(f"Response ends with: '{response_content[-100:]}'" if len(response_content) > 100 else f"Full response end: '{response_content}'")
-        
-        # Should generate reasonable amount
-        min_expected = token_specs["min_tokens"]
-        assert estimated_tokens >= min_expected, \
-            f"Non-streaming generated too few tokens: {estimated_tokens} < {min_expected}"
-        
-        # Issue #20 Check: Non-streaming SHOULD contain End-Tokens (this is the bug)
-        found_end_tokens = contains_end_tokens(response_content, model_name)
-        
-        if found_end_tokens:
-            logger.error(f"❌ CONFIRMED Issue #20: Non-streaming contains End-Tokens: {found_end_tokens}")
-            logger.error(f"Raw response end: {repr(response_content[-50:])}")
-            # This SHOULD fail - it confirms Issue #20
-            assert False, f"Issue #20 CONFIRMED: Non-streaming shows End-Tokens {found_end_tokens}"
-        else:
-            logger.warning(f"⚠️  UNEXPECTED: Non-streaming clean (no End-Tokens found)")
-            logger.info(f"✅ Non-streaming mode unexpectedly passed (no Issue #20 detected)")
-
-
-@pytest.mark.server  
-@pytest.mark.timeout(300)  # 5 minute timeout for large models
-def test_streaming_end_tokens(model_name, size_str, ram_needed):
-    """
-    Test Issue #20: Streaming mode should filter End-Tokens (EXPECTED TO PASS).
-    
-    This test validates that streaming responses properly filter End-Tokens,
-    proving the streaming pipeline works correctly.
-    
-    Expected result: PASS (End-Tokens filtered) - this shows streaming works correctly.
-    """
-    logger.info(f"🔍 Testing STREAMING End-Tokens with {model_name} ({size_str}, {ram_needed}GB RAM)")
-    
-    with MLXKnifeServerManager() as server:
-        # Get model-specific token targets  
-        token_specs = get_model_aware_token_targets(model_name, size_str)
-        logger.info(f"Token targets: {token_specs}")
-        
-        # Create adaptive prompt (no max_tokens - let model use natural stopping)
-        prompt = create_adaptive_trilogy_prompt(size_str, token_specs["target_tokens"])
-        
-        logger.info("✅ Testing STREAMING mode (should filter End-Tokens)...")
-        
-        response_content = make_chat_request(model_name, prompt, stream=True, timeout=300)
-        
-        # Basic validation
-        assert response_content.strip(), "Streaming returned empty response"
-        
-        # Token count validation
-        estimated_tokens = estimate_token_count(response_content)
-        logger.info(f"Streaming response: ~{estimated_tokens} tokens")
-        logger.info(f"Response ends with: '{response_content[-100:]}'" if len(response_content) > 100 else f"Full response end: '{response_content}'")
-        
-        # Should generate reasonable amount
-        min_expected = token_specs["min_tokens"]
-        assert estimated_tokens >= min_expected, \
-            f"Streaming generated too few tokens: {estimated_tokens} < {min_expected}"
-        
-        # Issue #20 Check: Streaming should NOT contain End-Tokens (correct behavior)
-        found_end_tokens = contains_end_tokens(response_content, model_name)
-        
-        if found_end_tokens:
-            logger.error(f"❌ UNEXPECTED: Streaming contains End-Tokens: {found_end_tokens}")
-            logger.error(f"Raw response end: {repr(response_content[-50:])}")
-            assert False, f"Streaming unexpectedly shows End-Tokens {found_end_tokens}"
-        else:
-            logger.info(f"✅ Streaming mode correctly filtered End-Tokens")
-
-
-@pytest.mark.server
-@pytest.mark.timeout(600)  # Longer timeout for comparison test
-def test_end_token_consistency_comparison(model_name, size_str, ram_needed):
-    """
-    Test Issue #20: Direct comparison of streaming vs non-streaming End-Token handling.
-    
-    This test runs both modes and compares their End-Token behavior to document
-    the exact differences for Issue #20 analysis.
-    
-    Expected pattern:
-    - Non-streaming: Contains End-Tokens (Issue #20 bug) 
-    - Streaming: Clean responses (correct behavior)
-    """
-    logger.info(f"🔍 COMPARISON TEST: {model_name} ({size_str}, {ram_needed}GB RAM)")
-    logger.info("="*80)
-    
-    with MLXKnifeServerManager() as server:
-        # Get model-specific token targets
-        token_specs = get_model_aware_token_targets(model_name, size_str)
-        
-        # Create adaptive prompt (no max_tokens)
-        prompt = create_adaptive_trilogy_prompt(size_str, token_specs["target_tokens"])
-        
-        responses = {}
-        end_token_results = {}
-        
-        # Test both modes
-        for stream_mode in [False, True]:
-            mode_name = "streaming" if stream_mode else "non-streaming"
-            logger.info(f"\n📡 Testing {mode_name.upper()} mode...")
-            
-            response_content = make_chat_request(model_name, prompt, stream=stream_mode, timeout=300)
-            responses[stream_mode] = response_content
-            
-            # Check End-Tokens
-            found_end_tokens = contains_end_tokens(response_content, model_name)
-            end_token_results[stream_mode] = found_end_tokens
-            
-            estimated_tokens = estimate_token_count(response_content)
-            logger.info(f"{mode_name} response: ~{estimated_tokens} tokens")
-            logger.info(f"{mode_name} ends with: '{response_content[-80:]}'" if len(response_content) > 80 else f"Full: '{response_content}'")
-            
-            if found_end_tokens:
-                logger.error(f"❌ {mode_name} contains End-Tokens: {found_end_tokens}")
-            else:
-                logger.info(f"✅ {mode_name} clean (no End-Tokens)")
-        
-        # Issue #20 Pattern Analysis
-        logger.info(f"\n📊 ISSUE #20 ANALYSIS for {model_name}:")
-        logger.info("="*80)
-        
-        non_stream_tokens = end_token_results[False]
-        stream_tokens = end_token_results[True]
-        
-        logger.info(f"Non-streaming End-Tokens: {non_stream_tokens if non_stream_tokens else 'None'}")
-        logger.info(f"Streaming End-Tokens:     {stream_tokens if stream_tokens else 'None'}")
-        
-        # Issue #20 pattern detection
-        if non_stream_tokens and not stream_tokens:
-            logger.error(f"🎯 ISSUE #20 CONFIRMED!")
-            logger.error(f"   - Non-streaming shows End-Tokens: {non_stream_tokens}")
-            logger.error(f"   - Streaming filters correctly: Clean")
-            issue_20_detected = True
-        elif not non_stream_tokens and not stream_tokens:
-            logger.warning(f"⚠️  Both modes clean - Issue #20 not detected")
-            issue_20_detected = False
-        elif non_stream_tokens and stream_tokens:
-            logger.error(f"🚨 Both modes show End-Tokens - different issue?")
-            issue_20_detected = False
-        else:
-            logger.warning(f"🤔 Unexpected pattern - investigate further")
-            issue_20_detected = False
-        
-        # This test is purely documentary - it doesn't fail, just reports findings
-        logger.info(f"\n📝 Issue #20 Status: {'CONFIRMED' if issue_20_detected else 'NOT DETECTED'}")
-        logger.info("="*80)
-
-
-if __name__ == "__main__":
-    # Quick test run
-    with MLXKnifeServerManager() as server:
-        models = get_safe_models_for_system()
-        print(f"Found {len(models)} safe models for testing:")
-        for model, size, ram in models:
-            print(f"  {model} ({size}, {ram}GB)")
diff --git a/tests/integration/test_health_checks.py b/tests/integration/test_health_checks.py
deleted file mode 100644
index f64f682..0000000
--- a/tests/integration/test_health_checks.py
+++ /dev/null
@@ -1,240 +0,0 @@
-"""
-High Priority Tests: Health Check Robustness
-
-Tests ensure reliable "postmortem" analysis of model integrity:
-- Corruption detection (partial downloads, missing files, LFS pointers, etc.)
-- Deterministic results (consistent healthy/broken status)
-- No false positives or negatives
-"""
-import pytest
-import subprocess
-import json
-import shutil
-from pathlib import Path
-from typing import Dict, Any
-
-
-@pytest.mark.timeout(30)
-@pytest.mark.usefixtures("temp_cache_dir")
-class TestHealthCheckRobustness:
-    """Test health check reliability for various corruption scenarios."""
-    
-    def test_healthy_model_detection(self, mlx_knife_process, mock_model_cache):
-        """Verify healthy models are correctly identified."""
-        # Create a healthy model
-        model_dir = mock_model_cache("test-model", healthy=True)
-        
-        # Run health check
-        proc = mlx_knife_process(["health"])
-        stdout, stderr = proc.communicate(timeout=15)
-        return_code = proc.returncode
-        
-        # Should complete successfully
-        assert return_code == 0, f"Health check failed: {stderr}"
-        
-        # Should report healthy status (if any models exist)
-        # Note: The actual output format depends on implementation
-        assert "broken" not in stdout.lower() or "0 broken" in stdout.lower()
-
-    def test_missing_snapshot_detection(self, mlx_knife_process, mock_model_cache):
-        """Health check must detect missing snapshots directory."""
-        # Create model with missing snapshots
-        model_dir = mock_model_cache("test-model", healthy=False, corruption_type="missing_snapshot")
-        
-        proc = mlx_knife_process(["health"])
-        stdout, stderr = proc.communicate(timeout=15)
-        
-        # Should complete (may return error code if broken models found)
-        assert proc.returncode is not None
-        
-        # Should detect the corruption - either report broken models or handle gracefully
-        # The key is that it shouldn't crash or hang
-        assert len(stdout) > 0 or len(stderr) > 0, "Health check produced no output"
-
-    def test_lfs_pointer_detection(self, mlx_knife_process, mock_model_cache):
-        """Health check must detect LFS pointer files instead of actual weights."""
-        model_dir = mock_model_cache("test-model", healthy=False, corruption_type="lfs_pointer")
-        
-        proc = mlx_knife_process(["health"])
-        stdout, stderr = proc.communicate(timeout=15)
-        
-        # Should handle LFS pointers appropriately
-        assert proc.returncode is not None
-        
-        # Should either detect as broken or handle gracefully
-        output = stdout + stderr
-        assert len(output) > 0, "Health check produced no output for LFS pointer"
-
-    def test_missing_config_detection(self, mlx_knife_process, mock_model_cache):
-        """Health check must detect missing config.json."""
-        model_dir = mock_model_cache("test-model", healthy=False, corruption_type="missing_config")
-        
-        proc = mlx_knife_process(["health"])
-        stdout, stderr = proc.communicate(timeout=15)
-        
-        assert proc.returncode is not None
-        
-        # Should detect missing config
-        output = stdout + stderr
-        assert len(output) > 0
-
-    def test_missing_tokenizer_detection(self, mlx_knife_process, mock_model_cache):
-        """Health check must detect missing tokenizer.json."""
-        model_dir = mock_model_cache("test-model", healthy=False, corruption_type="missing_tokenizer")
-        
-        proc = mlx_knife_process(["health"])
-        stdout, stderr = proc.communicate(timeout=15)
-        
-        assert proc.returncode is not None
-        output = stdout + stderr
-        assert len(output) > 0
-
-    def test_truncated_safetensors_detection(self, mlx_knife_process, mock_model_cache):
-        """Health check must detect corrupted/truncated safetensors files."""
-        model_dir = mock_model_cache("test-model", healthy=False, corruption_type="truncated_safetensors")
-        
-        proc = mlx_knife_process(["health"])
-        stdout, stderr = proc.communicate(timeout=15)
-        
-        assert proc.returncode is not None
-        output = stdout + stderr
-        assert len(output) > 0
-
-    def test_deterministic_results(self, mlx_knife_process, mock_model_cache):
-        """Health check results must be consistent across multiple runs."""
-        # Create a healthy model
-        model_dir = mock_model_cache("test-model", healthy=True)
-        
-        results = []
-        for i in range(3):
-            proc = mlx_knife_process(["health"])
-            stdout, stderr = proc.communicate(timeout=15)
-            results.append({
-                "return_code": proc.returncode,
-                "stdout": stdout.strip(),
-                "stderr": stderr.strip()
-            })
-        
-        # All runs should have the same return code
-        return_codes = [r["return_code"] for r in results]
-        assert all(rc == return_codes[0] for rc in return_codes), f"Inconsistent return codes: {return_codes}"
-        
-        # Output should be consistent (allowing for timestamps or minor variations)
-        stdout_outputs = [r["stdout"] for r in results]
-        # Basic consistency check - all should have similar length and key content
-        if stdout_outputs[0]:
-            for stdout in stdout_outputs[1:]:
-                # Allow some variation but outputs should be similar
-                assert abs(len(stdout) - len(stdout_outputs[0])) < 100, "Highly variable output lengths"
-
-    def test_no_false_positives(self, mlx_knife_process, mock_model_cache):
-        """Healthy model must never be reported as broken."""
-        # Create multiple healthy models
-        for i in range(3):
-            mock_model_cache(f"healthy-model-{i}", healthy=True)
-        
-        proc = mlx_knife_process(["health"])
-        stdout, stderr = proc.communicate(timeout=15)
-        
-        # Should succeed
-        assert proc.returncode == 0, f"Health check failed on healthy models: {stderr}"
-        
-        # Should not report broken models (or report 0 broken)
-        if "broken" in stdout.lower():
-            assert "0 broken" in stdout.lower(), f"False positive: {stdout}"
-
-    def test_no_false_negatives_batch(self, mlx_knife_process, mock_model_cache):
-        """Broken models must be detected reliably."""
-        # Create various corrupted models
-        corruption_types = [
-            "missing_config",
-            "missing_tokenizer", 
-            "lfs_pointer",
-            "truncated_safetensors"
-        ]
-        
-        for i, corruption in enumerate(corruption_types):
-            mock_model_cache(f"broken-model-{i}", healthy=False, corruption_type=corruption)
-        
-        proc = mlx_knife_process(["health"])
-        stdout, stderr = proc.communicate(timeout=15)
-        
-        # Should complete (may have non-zero exit if broken models found)
-        assert proc.returncode is not None
-        
-        # Should produce output indicating broken models or handle them gracefully
-        output = stdout + stderr
-        assert len(output) > 0, "No output for batch of broken models"
-
-    def test_mixed_healthy_broken_models(self, mlx_knife_process, mock_model_cache):
-        """Health check must correctly categorize mixed model states."""
-        # Create mix of healthy and broken models
-        mock_model_cache("healthy-1", healthy=True)
-        mock_model_cache("broken-1", healthy=False, corruption_type="missing_config")
-        mock_model_cache("healthy-2", healthy=True)
-        mock_model_cache("broken-2", healthy=False, corruption_type="lfs_pointer")
-        
-        proc = mlx_knife_process(["health"])
-        stdout, stderr = proc.communicate(timeout=15)
-        
-        assert proc.returncode is not None
-        output = stdout + stderr
-        assert len(output) > 0, "No output for mixed model states"
-        
-        # Should handle mixed states appropriately
-        # The exact format depends on implementation, but should not crash
-
-
-@pytest.mark.timeout(15)
-class TestHealthCheckPerformance:
-    """Test health check performance and reliability."""
-    
-    def test_health_check_timeout_handling(self, mlx_knife_process, temp_cache_dir):
-        """Health check should complete within reasonable time."""
-        # Create several models to check
-        for i in range(5):
-            cache_name = f"models--test--model-{i}"
-            model_dir = temp_cache_dir / cache_name / "snapshots" / "main"
-            model_dir.mkdir(parents=True, exist_ok=True)
-            
-            (model_dir / "config.json").write_text('{"model_type": "test"}')
-            (model_dir / "tokenizer.json").write_text('{"version": "1.0"}')
-            (model_dir / "model.safetensors").write_bytes(b"fake_model_data" * 1000)
-        
-        proc = mlx_knife_process(["health"])
-        stdout, stderr = proc.communicate(timeout=30)  # Should complete within 30s
-        
-        assert proc.returncode is not None, "Health check hung"
-
-    def test_health_check_empty_cache(self, mlx_knife_process, temp_cache_dir):
-        """Health check should handle empty cache gracefully."""
-        # temp_cache_dir is empty
-        
-        proc = mlx_knife_process(["health"])
-        stdout, stderr = proc.communicate(timeout=10)
-        
-        # Should complete successfully with empty cache
-        assert proc.returncode == 0, f"Failed on empty cache: {stderr}"
-        assert len(stdout) >= 0  # Some output is expected (even if just "no models")
-
-    def test_health_check_large_cache(self, mlx_knife_process, temp_cache_dir):
-        """Health check should handle larger cache sizes."""
-        # Create many model directories (simulating large cache)
-        for i in range(20):
-            cache_name = f"models--test--model-{i:02d}"
-            model_dir = temp_cache_dir / cache_name / "snapshots" / "main"
-            model_dir.mkdir(parents=True, exist_ok=True)
-            
-            # Create minimal valid model files
-            (model_dir / "config.json").write_text(f'{{"model_type": "test", "id": {i}}}')
-            (model_dir / "tokenizer.json").write_text('{"version": "1.0"}')
-            (model_dir / "model.safetensors").write_bytes(b"fake_data" * 50)
-        
-        proc = mlx_knife_process(["health"])
-        stdout, stderr = proc.communicate(timeout=45)  # Allow more time for large cache
-        
-        assert proc.returncode is not None, "Health check hung on large cache"
-        
-        # Should produce reasonable output
-        output = stdout + stderr
-        assert len(output) > 0, "No output for large cache"
\ No newline at end of file
diff --git a/tests/integration/test_issue_14.py b/tests/integration/test_issue_14.py
deleted file mode 100644
index a56ff45..0000000
--- a/tests/integration/test_issue_14.py
+++ /dev/null
@@ -1,634 +0,0 @@
-"""
-Test for Issue #14: Interactive Chat Self-Conversation Bug
-
-This test ensures that models don't continue conversations autonomously
-by generating "You:", "Human:", "Assistant:" markers after their response.
-
-This test is self-contained and manages its own MLX Knife server instance.
-"""
-
-import logging
-import os
-import re
-import math
-import subprocess
-from functools import lru_cache
-import signal
-import subprocess
-import time
-from typing import List, Tuple
-
-import psutil
-import pytest
-import requests
-try:
-    from tests.support import process_guard as pg  # pytest-run path
-except Exception:
-    try:
-        # Direct-script fallback: add tests/ to sys.path and import support
-        import sys
-        from pathlib import Path
-        sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
-        from support import process_guard as pg  # type: ignore
-    except Exception:
-        # No-op fallback
-        class _PG:
-            @staticmethod
-            def register_popen(*args, **kwargs):
-                pass
-
-            @staticmethod
-            def unregister(*args, **kwargs):
-                pass
-
-            @staticmethod
-            def install_signal_handlers():
-                pass
-
-            @staticmethod
-            def kill_all(*args, **kwargs):
-                pass
-
-        pg = _PG()
-
-logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
-logger = logging.getLogger(__name__)
-
-"""Model RAM estimation helpers.
-
-We need to avoid loading extremely large FP16 models during server tests.
-Previously, we applied 4-bit RAM heuristics to all models by parsing only the
-size string (e.g., "8x7B"). This incorrectly marked non-quantized models like
-"Mixtral-8x7B-Instruct-v0.1" as fitting in 16GB, leading to massive swap usage.
-
-Fix: detect 4-bit quantization in the model name and use separate maps for
-4-bit vs FP16 estimates. Non-quantized Mixtral-8x7B is treated as ~180GB to
-ensure it is skipped on typical machines.
-"""
-
-# 4-bit quantized models (GB)
-MODEL_RAM_REQUIREMENTS_4BIT = {
-    "0.5B": 1,   "1B": 2,    "3B": 4,    "4B": 5,
-    "7B": 8,     "8x7B": 16, "24B": 20,  "30B": 24,
-    "70B": 40,   "480B": 180,
-}
-
-# Approximate FP16/BF16 models (GB) — conservative, intentionally high
-MODEL_RAM_REQUIREMENTS_FP16 = {
-    "0.5B": 2,   "1B": 4,    "3B": 8,    "4B": 12,
-    "7B": 16,    "8x7B": 180, "24B": 48,  "30B": 60,
-    "70B": 140,  "480B": 960,
-}
-
-# Self-conversation patterns to detect Issue #14
-SELF_CONVERSATION_PATTERNS = [
-    r'\nYou:',
-    r'\nHuman:',
-    r'\nAssistant:',
-    r'\nUser:',
-    r'\n\nYou:',
-    r'\n\nHuman:',
-    r'\n\nAssistant:',
-    r'\n\nUser:',
-]
-
-SERVER_BASE_URL = "http://localhost:8000"
-SERVER_PORT = 8000
-
-
-def extract_model_size(model_name: str) -> str:
-    """Extract model size from model name."""
-    # Match patterns like "30B", "8x7B", "480B", "0.5B", "3.2B", "Phi-3-mini" etc.
-    size_patterns = [
-        r'(\d+(?:\.\d+)?(?:x\d+)?B)',  # 30B, 0.5B, 3.2B, 8x7B, 480B
-        r'Phi-3-mini',  # Special case: Phi-3-mini = ~4B
-        r'Qwen2\.5-(\d+(?:\.\d+)?)B', # Qwen2.5-0.5B
-    ]
-    
-    for pattern in size_patterns:
-        match = re.search(pattern, model_name)
-        if match:
-            if 'Phi-3-mini' in model_name:
-                return '4B'  # Phi-3-mini is ~4B parameters
-            elif 'Qwen2.5' in model_name:
-                return f"{match.group(1)}B"  # Extract from Qwen2.5-0.5B
-            else:
-                return match.group(1)
-    
-    return "unknown"
-
-
-def is_quantized_4bit_from_text(text: str) -> bool:
-    t = text.lower()
-    markers = ["4bit", "4-bit", "q4", "int4", "gguf q4", "q4_k", "k_m q4", "q4_"]
-    return any(m in t for m in markers)
-
-
-def is_quantized_4bit(model_name: str) -> bool:
-    """Detect 4-bit quantization using name and `mlxk show` output if available."""
-    # Quick name-based check first
-    name = model_name.lower()
-    if any(m in name for m in ("4bit", "q4", "int4")):
-        return True
-    # Try to refine using show output
-    info = get_model_info_via_show(model_name)
-    if info and info.get("quantization"):
-        return is_quantized_4bit_from_text(info["quantization"])
-    return False
-
-
-def estimate_required_ram_gb(model_name: str, size_str: str) -> int:
-    """Estimate RAM using a combination of show-based disk size and size maps.
-
-    Strategy:
-    - Prefer `mlxk show` disk size and convert to RAM via quantization-specific factor.
-    - If a size token is known, also compute map-based estimate and take the max for safety.
-    - If no disk info and no size token, return a high sentinel to skip.
-    """
-    info = get_model_info_via_show(model_name)
-    q4 = is_quantized_4bit(model_name)
-
-    # Quantization-specific disk→RAM factor
-    try:
-        if q4:
-            factor = float(os.getenv("MLXK_TEST_FACTOR_4BIT", os.getenv("MLXK_TEST_DISK_TO_RAM_FACTOR", "0.6")))
-        else:
-            factor = float(os.getenv("MLXK_TEST_FACTOR_FP16", os.getenv("MLXK_TEST_DISK_TO_RAM_FACTOR", "0.6")))
-        factor = max(0.1, min(2.0, factor))
-    except Exception:
-        factor = 0.6
-
-    disk_ram_est = None
-    if info and info.get("size_gb") is not None:
-        disk_ram_est = max(1, math.ceil(info["size_gb"] * factor))
-    else:
-        # Fallback to list-based size if show failed
-        disk_gb = get_model_disk_size_gb(model_name)
-        if disk_gb is not None:
-            disk_ram_est = max(1, math.ceil(disk_gb * factor))
-
-    map_est = None
-    if size_str != "unknown":
-        if q4:
-            map_est = MODEL_RAM_REQUIREMENTS_4BIT.get(size_str)
-        else:
-            map_est = MODEL_RAM_REQUIREMENTS_FP16.get(size_str)
-
-    # Combine estimates conservatively
-    if disk_ram_est is not None and map_est is not None:
-        return max(disk_ram_est, map_est)
-    if disk_ram_est is not None:
-        return disk_ram_est
-    if map_est is not None:
-        return map_est
-    return 999
-
-
-def parse_size_to_gb(size_str: str) -> float:
-    """Parse a human size like '579.2 MB' or '8.5 GB' to GB as float."""
-    try:
-        parts = size_str.strip().split()
-        if len(parts) < 2:
-            return None
-        value = float(parts[0])
-        unit = parts[1].upper()
-        if unit.startswith('KB'):
-            return value / (1024 ** 2)
-        if unit.startswith('MB'):
-            return value / 1024
-        if unit.startswith('GB'):
-            return value
-        if unit.startswith('TB'):
-            return value * 1024
-        return None
-    except Exception:
-        return None
-
-
-@lru_cache(maxsize=128)
-def get_model_info_via_show(model_name: str) -> dict:
-    """Use `mlxk show <model>` to obtain size and quantization info.
-
-    Returns a dict like {"size_gb": float|None, "quantization": str|None}.
-    """
-    try:
-        res = subprocess.run(["mlxk", "show", model_name], capture_output=True, text=True, timeout=15)
-        if res.returncode != 0:
-            return {}
-        size_gb = None
-        quant_info = None
-        for raw in res.stdout.splitlines():
-            line = raw.strip()
-            if line.startswith("Size:"):
-                # Format: Size: 579.2 MB
-                size_text = line.split("Size:", 1)[1].strip()
-                val = parse_size_to_gb(size_text)
-                if val is not None:
-                    size_gb = val
-            elif line.startswith("Quantization:"):
-                quant_info = line.split("Quantization:", 1)[1].strip()
-        return {"size_gb": size_gb, "quantization": quant_info}
-    except Exception:
-        return {}
-
-
-def get_model_disk_size_gb(model_name: str) -> float:
-    info = get_model_info_via_show(model_name)
-    return info.get("size_gb") if info else None
-
-
-def get_available_models() -> List[str]:
-    """Get list of available models from MLX Knife server."""
-    try:
-        response = requests.get(f"{SERVER_BASE_URL}/v1/models", timeout=10)
-        response.raise_for_status()
-        data = response.json()
-        return [model["id"] for model in data["data"]]
-    except Exception as e:
-        pytest.skip(f"Cannot connect to MLX Knife server: {e}")
-
-
-def get_safe_models_for_system() -> List[Tuple[str, str, int]]:
-    """Get models that fit safely in available system RAM."""
-    total_ram_gb = psutil.virtual_memory().total // (1024**3)
-    available_ram_gb = psutil.virtual_memory().available // (1024**3)
-
-    # Safety margin: configurable via MLXK_TEST_RAM_SAFETY (default 0.8)
-    try:
-        safety_factor = float(os.getenv("MLXK_TEST_RAM_SAFETY", "0.8"))
-        safety_factor = max(0.1, min(1.0, safety_factor))
-    except Exception:
-        safety_factor = 0.8
-
-    # Keep 4GB headroom as hard minimum
-    max_usable_gb = min(available_ram_gb * safety_factor, total_ram_gb - 4)
-    
-    logger.info(f"System RAM: {total_ram_gb}GB total, {available_ram_gb}GB available")
-    logger.info(f"Safe limit for model testing: {max_usable_gb:.1f}GB")
-    
-    safe_models = []
-    all_models = get_available_models()
-    
-    for model in all_models:
-        size_str = extract_model_size(model)
-        required_ram = estimate_required_ram_gb(model, size_str)
-
-        if required_ram <= max_usable_gb:
-            safe_models.append((model, size_str, required_ram))
-            logger.info(f"✅ {model} ({size_str}) - fits in {required_ram}GB")
-        else:
-            logger.warning(f"⏭️  Skipping {model} ({size_str}) - needs {required_ram}GB, have {max_usable_gb:.1f}GB")
-    
-    if not safe_models:
-        pytest.skip("No models fit in available system RAM")
-    
-    return safe_models
-
-
-def has_self_conversation_markers(text: str) -> bool:
-    """Check if text contains self-conversation markers indicating Issue #14."""
-    for pattern in SELF_CONVERSATION_PATTERNS:
-        if re.search(pattern, text):
-            return True
-    return False
-
-
-def chat_completion_request(model_name: str, prompt: str, max_tokens: int = 150) -> str:
-    """Send chat completion request to MLX Knife server."""
-    payload = {
-        "model": model_name,
-        "messages": [{"role": "user", "content": prompt}],
-        "max_tokens": max_tokens,
-        "stream": False
-    }
-    
-    try:
-        response = requests.post(
-            f"{SERVER_BASE_URL}/v1/chat/completions",
-            json=payload,
-            timeout=60
-        )
-        response.raise_for_status()
-        data = response.json()
-        return data["choices"][0]["message"]["content"]
-    except Exception as e:
-        pytest.fail(f"Chat completion failed for {model_name}: {e}")
-
-
-@pytest.mark.server
-def test_issue_14_self_conversation_regression_original(mlx_server, model_name: str, size_str: str, ram_needed: int):
-    """
-    Test Issue #14: Ensure models don't continue conversations autonomously.
-    
-    This test verifies that models stop cleanly after their response without
-    generating additional conversation turns like "You:", "Human:", etc.
-    """
-    logger.info(f"🦫 Testing Issue #14 with {model_name} ({size_str}, {ram_needed}GB)")
-    
-    # Use constrained prompt to encourage natural stopping
-    test_prompt = "Write a short story about a friendly dragon in exactly 50 words."
-    
-    start_time = time.time()
-    response = chat_completion_request(model_name, test_prompt, max_tokens=100)
-    duration = time.time() - start_time
-    
-    logger.info(f"⏱️  Response time: {duration:.2f}s")
-    logger.info(f"📝 Response preview: {response[:100]}...")
-    
-    # Check for Issue #14: self-conversation markers
-    if has_self_conversation_markers(response):
-        # Log the problematic response for debugging
-        logger.error(f"❌ Self-conversation detected in {model_name}:")
-        logger.error(f"Full response: {repr(response)}")
-        pytest.fail(f"Issue #14 regression: {model_name} shows self-conversation markers")
-    
-    logger.info(f"✅ {model_name}: No self-conversation detected - Issue #14 fix working!")
-
-
-def find_existing_mlxk_servers() -> List[psutil.Process]:
-    """Find any existing MLX Knife server processes."""
-    servers = []
-    for proc in psutil.process_iter(['pid', 'name', 'cmdline']):
-        try:
-            if proc.info['cmdline'] and any('mlxk' in arg and 'server' in arg for arg in proc.info['cmdline']):
-                servers.append(proc)
-        except (psutil.NoSuchProcess, psutil.AccessDenied):
-            continue
-    return servers
-
-
-def cleanup_zombie_servers(port: int):
-    """Clean up any zombie MLX Knife servers on the specified port."""
-    logger.info(f"🧹 Checking for existing servers on port {port}")
-    
-    # Check for processes using the port - handle macOS permission issues
-    try:
-        connections = psutil.net_connections(kind='inet')
-    except (psutil.AccessDenied, PermissionError) as e:
-        logger.warning(f"⚠️  Cannot scan network connections (permission denied): {e}")
-        logger.info("🔧 Falling back to process-based cleanup only")
-        connections = []
-    
-    for conn in connections:
-        if conn.laddr.port == port and conn.status == psutil.CONN_LISTEN:
-            try:
-                proc = psutil.Process(conn.pid)
-                logger.warning(f"⚠️  Found process {proc.pid} listening on port {port}: {proc.cmdline()}")
-                
-                if 'mlxk' in ' '.join(proc.cmdline()) and 'server' in ' '.join(proc.cmdline()):
-                    logger.info(f"🛑 Terminating existing MLX Knife server {proc.pid}")
-                    proc.terminate()
-                    try:
-                        proc.wait(timeout=5)
-                        logger.info(f"✅ Server {proc.pid} terminated gracefully")
-                    except psutil.TimeoutExpired:
-                        logger.warning(f"⚡ Force killing server {proc.pid}")
-                        proc.kill()
-                        proc.wait()
-                else:
-                    logger.error(f"❌ Port {port} is occupied by non-MLX process {proc.pid}")
-                    raise RuntimeError(f"Port {port} is busy with: {proc.cmdline()}")
-                    
-            except (psutil.NoSuchProcess, psutil.AccessDenied):
-                continue
-    
-    # Also check for any MLX Knife server processes (even if not on our port)
-    existing_servers = find_existing_mlxk_servers()
-    for server in existing_servers:
-        logger.warning(f"⚠️  Found zombie MLX Knife server: {server.pid}")
-        try:
-            server.terminate()
-            server.wait(timeout=5)
-            logger.info(f"✅ Cleaned up zombie server {server.pid}")
-        except (psutil.TimeoutExpired, psutil.NoSuchProcess):
-            try:
-                server.kill()
-                logger.info(f"⚡ Force killed zombie server {server.pid}")
-            except psutil.NoSuchProcess:
-                pass
-
-
-class MLXKnifeServerManager:
-    """Context manager for MLX Knife server lifecycle with zombie cleanup."""
-    
-    def __init__(self, port: int = 8000):
-        self.port = port
-        self.process = None
-        self.base_url = f"http://localhost:{port}"
-    
-    def start_server(self) -> bool:
-        """Start MLX Knife server and wait for it to be ready."""
-        try:
-            # Ensure signal handlers are installed for robust cleanup (server-only)
-            try:
-                pg.install_signal_handlers()
-            except Exception:
-                pass
-            # First, clean up any zombies or port conflicts
-            cleanup_zombie_servers(self.port)
-            
-            # Check if server is already running (after cleanup)
-            if self.is_server_running():
-                logger.info("🟢 MLX Knife server already running")
-                return True
-            
-            logger.info(f"🚀 Starting MLX Knife server on port {self.port}")
-            
-            # Start server process - use sys.executable to ensure same Python env
-            import sys
-            self.process = subprocess.Popen(
-                [sys.executable, "-m", "mlx_knife.cli", "server", "--port", str(self.port)],
-                stdout=subprocess.PIPE,
-                stderr=subprocess.PIPE,
-                text=True,
-                preexec_fn=os.setsid if hasattr(os, "setsid") else None
-            )
-            # Track for robust cleanup on Ctrl-C or failures
-            pg.register_popen(self.process, label="mlxk-server")
-            
-            logger.info(f"📋 Started process PID: {self.process.pid}")
-            
-            # Give it a moment to fail fast if there's an immediate error
-            time.sleep(1)
-            if self.process.poll() is not None:
-                stdout, stderr = self.process.communicate()
-                logger.error(f"❌ Server failed immediately:")
-                logger.error(f"stdout: {stdout}")
-                logger.error(f"stderr: {stderr}")
-                return False
-            
-            # Wait for server to be ready (max 30 seconds)
-            for _ in range(60):  # 30 seconds, 0.5s intervals
-                if self.is_server_running():
-                    logger.info("✅ MLX Knife server is ready")
-                    return True
-                time.sleep(0.5)
-            
-            # Timeout - get final output
-            stdout, stderr = "", ""
-            if self.process:
-                try:
-                    if self.process.poll() is None:
-                        stdout, stderr = self.process.communicate(timeout=2)
-                    else:
-                        stdout, stderr = self.process.communicate()
-                except subprocess.TimeoutExpired:
-                    stdout, stderr = "timeout", "timeout"
-            
-            logger.error("❌ Server failed to start within timeout")
-            logger.error(f"Final stdout: {stdout}")
-            logger.error(f"Final stderr: {stderr}")
-            self.stop_server()
-            return False
-            
-        except Exception as e:
-            import traceback
-            logger.error(f"❌ Failed to start server: {e}")
-            logger.error(f"Full traceback: {traceback.format_exc()}")
-            self.stop_server()
-            return False
-    
-    def stop_server(self):
-        """Stop MLX Knife server if running."""
-        if self.process:
-            logger.info("🛑 Stopping MLX Knife server")
-            try:
-                self.process.terminate()
-            except Exception:
-                pass
-            try:
-                self.process.wait(timeout=10)
-            except subprocess.TimeoutExpired:
-                logger.warning("⚠️  Server didn't stop gracefully, killing...")
-                # Try process group kill first
-                try:
-                    if hasattr(os, "killpg"):
-                        os.killpg(os.getpgid(self.process.pid), signal.SIGKILL)
-                except Exception:
-                    pass
-                try:
-                    self.process.kill()
-                    self.process.wait(timeout=3)
-                except Exception:
-                    pass
-            try:
-                pg.unregister(self.process.pid)
-            except Exception:
-                pass
-            self.process = None
-    
-    def is_server_running(self) -> bool:
-        """Check if server is running and healthy."""
-        try:
-            response = requests.get(f"{self.base_url}/health", timeout=2)
-            return response.status_code == 200
-        except:
-            return False
-    
-    def __enter__(self):
-        if not self.start_server():
-            pytest.skip("Failed to start MLX Knife server")
-        return self
-    
-    def __exit__(self, exc_type, exc_val, exc_tb):
-        self.stop_server()
-
-
-@pytest.fixture(scope="module")
-def mlx_server():
-    """Pytest fixture to manage MLX Knife server for all tests in module."""
-    with MLXKnifeServerManager(SERVER_PORT) as server:
-        yield server
-
-
-@pytest.mark.server
-def test_server_health(mlx_server):
-    """Verify MLX Knife server is running and healthy."""
-    assert mlx_server.is_server_running(), "MLX Knife server is not healthy"
-    logger.info("🟢 MLX Knife server is healthy")
-
-
-@pytest.mark.server
-def test_issue_14_self_conversation_regression(mlx_server, model_name: str, size_str: str, ram_needed: int):
-    """
-    Test Issue #14: Ensure models don't continue conversations autonomously.
-    
-    This test verifies that models stop cleanly after their response without
-    generating additional conversation turns like "You:", "Human:", etc.
-    """
-    logger.info(f"🦫 Testing Issue #14 with {model_name} ({size_str}, {ram_needed}GB)")
-    
-    # Use constrained prompt to encourage natural stopping
-    test_prompt = "Write a short story about a friendly dragon in exactly 50 words."
-    
-    start_time = time.time()
-    response = chat_completion_request(model_name, test_prompt, max_tokens=100)
-    duration = time.time() - start_time
-    
-    logger.info(f"⏱️  Response time: {duration:.2f}s")
-    logger.info(f"📝 Response preview: {response[:100]}...")
-    
-    # Check for Issue #14: self-conversation markers
-    if has_self_conversation_markers(response):
-        # Log the problematic response for debugging
-        logger.error(f"❌ Self-conversation detected in {model_name}:")
-        logger.error(f"Full response: {repr(response)}")
-        pytest.fail(f"Issue #14 regression: {model_name} shows self-conversation markers")
-    
-    logger.info(f"✅ {model_name}: No self-conversation detected - Issue #14 fix working!")
-
-
-def get_safe_models_lazy():
-    """Lazy evaluation for parametrize to avoid import-time server calls."""
-    try:
-        return get_safe_models_for_system()
-    except:
-        # Fallback for when server isn't running yet
-        return [("test-model", "1B", 1)]
-
-
-# Dynamic test generation at runtime instead of import time
-def pytest_generate_tests(metafunc):
-    """Dynamic test parametrization to avoid import-time server calls."""
-    if "model_name" in metafunc.fixturenames:
-        # Only get models when actually running tests, not during import
-        try:
-            with MLXKnifeServerManager() as server:
-                models = get_safe_models_for_system()
-                metafunc.parametrize("model_name,size_str,ram_needed", models)
-        except Exception as e:
-            pytest.skip(f"Cannot set up server for testing: {e}")
-
-
-if __name__ == "__main__":
-    # Quick smoke test - start server first
-    print("🦫 MLX Knife Issue #14 Test - Smoke Test")
-    print("=" * 50)
-    
-    # Test server start directly without context manager
-    manager = MLXKnifeServerManager()
-    success = manager.start_server()
-    
-    print(f"🏁 Server start result: {success}")
-    
-    if success:
-        try:
-            models = get_safe_models_for_system()
-            print(f"\n📊 Safe models for this system: {len(models)}")
-            
-            total_ram = psutil.virtual_memory().total // (1024**3)
-            available_ram = psutil.virtual_memory().available // (1024**3)
-            print(f"💾 System RAM: {total_ram}GB total, {available_ram}GB available")
-            print()
-            
-            for model, size, ram in models:
-                print(f"  🎯 {model}")
-                print(f"     └─ Size: {size}, RAM needed: {ram}GB")
-            
-            print(f"\n🚀 Ready to run: pytest tests/integration/test_issue_14.py -v")
-        
-        finally:
-            manager.stop_server()
-    
-    else:
-        print("💡 Check the logs above for server start failure details")
diff --git a/tests/integration/test_issue_15_16.py b/tests/integration/test_issue_15_16.py
deleted file mode 100644
index aaa2c8c..0000000
--- a/tests/integration/test_issue_15_16.py
+++ /dev/null
@@ -1,550 +0,0 @@
-"""
-Test for Issues #15 & #16: Dynamic Model-Aware Token Limits
-
-Issue #15: Token-Limit vs Stop-Token Race Condition
-- Models cut off by artificial token limits before natural stopping
-- Solution: Context-aware token policies based on model capabilities
-
-Issue #16: Interactive vs Server Token Limit Policies  
-- Interactive mode should allow unlimited tokens for natural completion
-- Server mode needs DoS protection with reasonable limits
-- Solution: Different token policies per usage context
-
-This test is self-contained and manages its own MLX Knife server instance.
-"""
-
-import json
-import logging
-import re
-import signal
-import subprocess
-import tempfile
-import time
-from pathlib import Path
-from typing import Dict, List, Tuple
-
-import os
-import math
-from functools import lru_cache
-import psutil
-import pytest
-import requests
-try:
-    from tests.support import process_guard as pg  # pytest path
-except Exception:
-    try:
-        import sys
-        from pathlib import Path
-        sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
-        from support import process_guard as pg  # type: ignore
-    except Exception:
-        class _PG:
-            @staticmethod
-            def register_popen(*args, **kwargs):
-                pass
-            @staticmethod
-            def unregister(*args, **kwargs):
-                pass
-            @staticmethod
-            def install_signal_handlers():
-                pass
-            @staticmethod
-            def kill_all(*args, **kwargs):
-                pass
-        pg = _PG()
-
-logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
-logger = logging.getLogger(__name__)
-
-# RAM estimation: separate 4-bit vs FP16 to avoid selecting huge FP16 models
-MODEL_RAM_REQUIREMENTS_4BIT = {
-    "0.5B": 1,   "1B": 2,    "3B": 4,    "4B": 5,
-    "7B": 8,     "8x7B": 16, "24B": 20,  "30B": 24,
-    "70B": 40,   "480B": 180,
-}
-
-MODEL_RAM_REQUIREMENTS_FP16 = {
-    "0.5B": 2,   "1B": 4,    "3B": 8,    "4B": 12,
-    "7B": 16,    "8x7B": 180, "24B": 48,  "30B": 60,
-    "70B": 140,  "480B": 960,
-}
-
-SERVER_BASE_URL = "http://localhost:8001"  # Different port to avoid conflicts
-SERVER_PORT = 8001
-
-
-def extract_model_size(model_name: str) -> str:
-    """Extract model size from model name."""
-    # Match patterns like "30B", "8x7B", "480B", "0.5B", "3.2B", "Phi-3-mini" etc.
-    size_patterns = [
-        r'(\d+x\d+B)',  # MoE models like "8x7B"
-        r'(\d+\.?\d*B)',  # Standard like "30B", "0.5B", "3.2B"
-        r'(mini|small|medium|large)',  # Qualitative sizes
-    ]
-    
-    for pattern in size_patterns:
-        match = re.search(pattern, model_name, re.IGNORECASE)
-        if match:
-            size = match.group(1).lower()
-            # Map qualitative sizes to quantitative
-            if size == 'mini':
-                return '3B'  # Phi-3-mini is ~4B params
-            elif size == 'small':
-                return '1B'
-            elif size == 'medium':
-                return '7B'  
-            elif size == 'large':
-                return '30B'
-            return size.upper()
-    
-    return "3B"  # Default fallback
-
-
-def is_quantized_4bit(model_name: str) -> bool:
-    name = model_name.lower()
-    return (
-        "4bit" in name or
-        "q4" in name or
-        "int4" in name
-    )
-
-
-def estimate_required_ram_gb(model_name: str, size_str: str) -> int:
-    """Estimate RAM using show-based disk size and quantization-aware maps."""
-    info = get_model_info_via_show(model_name)
-    q4 = is_quantized_4bit(model_name)
-
-    try:
-        if q4:
-            factor = float(os.getenv("MLXK_TEST_FACTOR_4BIT", os.getenv("MLXK_TEST_DISK_TO_RAM_FACTOR", "0.6")))
-        else:
-            factor = float(os.getenv("MLXK_TEST_FACTOR_FP16", os.getenv("MLXK_TEST_DISK_TO_RAM_FACTOR", "0.6")))
-        factor = max(0.1, min(2.0, factor))
-    except Exception:
-        factor = 0.6
-
-    disk_ram_est = None
-    if info and info.get("size_gb") is not None:
-        disk_ram_est = max(1, math.ceil(info["size_gb"] * factor))
-
-    map_est = None
-    if size_str:
-        if q4:
-            map_est = MODEL_RAM_REQUIREMENTS_4BIT.get(size_str)
-        else:
-            map_est = MODEL_RAM_REQUIREMENTS_FP16.get(size_str)
-
-    if disk_ram_est is not None and map_est is not None:
-        return max(disk_ram_est, map_est)
-    if disk_ram_est is not None:
-        return disk_ram_est
-    if map_est is not None:
-        return map_est
-    return 999
-
-
-def get_available_ram_gb() -> int:
-    """Get available system RAM in GB (with optional safety margin)."""
-    try:
-        total = int(psutil.virtual_memory().total / (1024**3))
-        available = int(psutil.virtual_memory().available / (1024**3))
-        safety_factor = float(os.getenv("MLXK_TEST_RAM_SAFETY", "1.0"))
-        safety_factor = max(0.1, min(1.0, safety_factor))
-        safe_usable = min(int(available * safety_factor), total - 4)
-        return max(1, safe_usable)
-    except Exception:
-        return 8  # Conservative fallback
-
-
-def get_suitable_models(available_models: List[str]) -> List[str]:
-    """Filter models based on available RAM."""
-    available_ram = get_available_ram_gb()
-    logger.info(f"Available RAM: {available_ram}GB")
-    
-    suitable = []
-    for model in available_models:
-        size = extract_model_size(model)
-        required_ram = estimate_required_ram_gb(model, size)
-
-        if required_ram <= available_ram:
-            suitable.append(model)
-            logger.info(f"✓ {model} ({size}, {required_ram}GB) - Suitable")
-        else:
-            logger.info(f"✗ {model} ({size}, {required_ram}GB) - Too large")
-    
-    return suitable
-
-
-def get_cached_models() -> List[str]:
-    """Get list of cached MLX models."""
-    try:
-        result = subprocess.run(
-            ["mlxk", "list", "--framework", "mlx"],
-            capture_output=True, text=True, timeout=10
-        )
-        if result.returncode != 0:
-            return []
-        
-        models = []
-        for line in result.stdout.split('\n'):
-            line = line.strip()
-            if line and not line.startswith('MODEL') and not line.startswith('NAME'):
-                # Extract model name from table format
-                parts = line.split()
-                if len(parts) >= 1 and not parts[0] in ['MODEL', 'NAME']:
-                    models.append(parts[0])
-        
-        return models
-    except Exception as e:
-        logger.warning(f"Failed to get cached models: {e}")
-        return []
-
-
-def parse_size_to_gb(size_str: str) -> float:
-    try:
-        parts = size_str.strip().split()
-        if len(parts) < 2:
-            return None
-        value = float(parts[0])
-        unit = parts[1].upper()
-        if unit.startswith('KB'):
-            return value / (1024 ** 2)
-        if unit.startswith('MB'):
-            return value / 1024
-        if unit.startswith('GB'):
-            return value
-        if unit.startswith('TB'):
-            return value * 1024
-        return None
-    except Exception:
-        return None
-
-
-@lru_cache(maxsize=128)
-def get_model_info_via_show(model_name: str) -> dict:
-    """Use `mlxk show` to fetch size and quantization details for a model."""
-    try:
-        res = subprocess.run(["mlxk", "show", model_name], capture_output=True, text=True, timeout=15)
-        if res.returncode != 0:
-            return {}
-        size_gb = None
-        quant_info = None
-        for raw in res.stdout.splitlines():
-            line = raw.strip()
-            if line.startswith("Size:"):
-                size_text = line.split("Size:", 1)[1].strip()
-                val = parse_size_to_gb(size_text)
-                if val is not None:
-                    size_gb = val
-            elif line.startswith("Quantization:"):
-                quant_info = line.split("Quantization:", 1)[1].strip()
-        return {"size_gb": size_gb, "quantization": quant_info}
-    except Exception:
-        return {}
-
-
-def extract_context_length_from_model(model_name: str) -> int:
-    """Extract context length from a real model's config."""
-    try:
-        result = subprocess.run(
-            ["mlxk", "show", model_name, "--config"],
-            capture_output=True, text=True, timeout=10
-        )
-        if result.returncode != 0:
-            return 4096
-            
-        # Extract JSON from the output (it comes after "Config:")
-        config_text = result.stdout
-        
-        # Find the JSON part after "Config:"
-        config_start = config_text.find("Config:")
-        if config_start == -1:
-            return 4096
-            
-        json_text = config_text[config_start + 7:].strip()  # Skip "Config:"
-        
-        try:
-            config = json.loads(json_text)
-            context_keys = [
-                "max_position_embeddings",
-                "n_positions", 
-                "context_length",
-                "max_sequence_length",
-                "seq_len"
-            ]
-            
-            for key in context_keys:
-                if key in config:
-                    return config[key]
-                    
-            return 4096
-        except json.JSONDecodeError:
-            return 4096
-            
-    except Exception:
-        return 4096
-
-
-class MLXKnifeServer:
-    """Manages MLX Knife server lifecycle for testing."""
-    
-    def __init__(self, port: int = SERVER_PORT):
-        self.port = port
-        self.process = None
-        self.base_url = f"http://localhost:{port}"
-    
-    def start(self) -> bool:
-        """Start the MLX Knife server."""
-        try:
-            # Ensure signal handlers are installed for robust cleanup (server-only)
-            try:
-                pg.install_signal_handlers()
-            except Exception:
-                pass
-            cmd = [
-                "mlxk", "server", 
-                "--host", "127.0.0.1",
-                "--port", str(self.port),
-                "--max-tokens", "1000",  # Conservative default for testing
-                "--log-level", "warning"
-            ]
-            
-            self.process = subprocess.Popen(
-                cmd,
-                stdout=subprocess.PIPE,
-                stderr=subprocess.PIPE,
-                text=True,
-                preexec_fn=os.setsid if hasattr(os, "setsid") else None
-            )
-            # Track for robust cleanup on Ctrl-C
-            pg.register_popen(self.process, label="mlxk-server")
-            
-            # Wait for server to start
-            for attempt in range(30):
-                try:
-                    response = requests.get(f"{self.base_url}/v1/models", timeout=2)
-                    if response.status_code == 200:
-                        logger.info(f"MLX Knife server started on port {self.port}")
-                        return True
-                except requests.RequestException:
-                    pass
-                
-                if self.process.poll() is not None:
-                    logger.error("Server process died during startup")
-                    return False
-                
-                time.sleep(1)
-            
-            logger.error("Server failed to start within timeout")
-            return False
-            
-        except Exception as e:
-            logger.error(f"Failed to start server: {e}")
-            return False
-    
-    def stop(self):
-        """Stop the MLX Knife server."""
-        if self.process:
-            try:
-                # Try graceful shutdown first
-                try:
-                    self.process.terminate()
-                except Exception:
-                    pass
-                try:
-                    self.process.wait(timeout=10)
-                except subprocess.TimeoutExpired:
-                    # Force kill if not responding
-                    try:
-                        if hasattr(os, "killpg"):
-                            os.killpg(os.getpgid(self.process.pid), signal.SIGKILL)
-                    except Exception:
-                        pass
-                    try:
-                        self.process.kill()
-                        self.process.wait(timeout=3)
-                    except Exception:
-                        pass
-            except Exception as e:
-                logger.warning(f"Error stopping server: {e}")
-            finally:
-                try:
-                    pg.unregister(self.process.pid)
-                except Exception:
-                    pass
-                self.process = None
-    
-    def chat_completion(self, model: str, messages: List[Dict], max_tokens: int = None) -> Dict:
-        """Send chat completion request."""
-        payload = {
-            "model": model,
-            "messages": messages,
-            "temperature": 0.3,
-            "stream": False
-        }
-        if max_tokens:
-            payload["max_tokens"] = max_tokens
-        
-        response = requests.post(
-            f"{self.base_url}/v1/chat/completions",
-            json=payload,
-            timeout=60
-        )
-        response.raise_for_status()
-        return response.json()
-
-
-@pytest.fixture(scope="module")
-def mlx_server():
-    """Provide MLX Knife server for the test session."""
-    server = MLXKnifeServer()
-    
-    if not server.start():
-        pytest.skip("Failed to start MLX Knife server")
-    
-    try:
-        yield server
-    finally:
-        server.stop()
-
-
-@pytest.fixture(scope="module") 
-def available_models():
-    """Get available models suitable for current system."""
-    all_models = get_cached_models()
-    if not all_models:
-        pytest.skip("No MLX models found in cache")
-    
-    suitable = get_suitable_models(all_models)
-    if not suitable:
-        pytest.skip("No suitable models found for current RAM")
-    
-    return suitable
-
-
-@pytest.mark.server
-class TestIssue15TokenLimitVsStopTokenRace:
-    """Test Issue #15: Token-Limit vs Stop-Token Race Condition Resolution."""
-    
-    def test_model_context_length_extraction(self, available_models):
-        """Test that we can extract context length from real models."""
-        model = available_models[0]
-        context_length = extract_context_length_from_model(model)
-        
-        assert context_length >= 512, f"Context length too small for {model}: {context_length}"
-        assert context_length <= 1048576, f"Context length unrealistic for {model}: {context_length}"  # 1M tokens max
-        
-        logger.info(f"Model {model} has context length: {context_length}")
-    
-    def test_realistic_token_limits_prevent_race_condition(self, mlx_server, available_models):
-        """Test that realistic token limits prevent race conditions."""
-        model = available_models[0]
-        context_length = extract_context_length_from_model(model)
-        
-        # Request tokens close to but under the expected server limit (context/2)
-        server_limit = context_length // 2
-        test_tokens = min(server_limit - 100, 500)  # Conservative test
-        
-        messages = [{"role": "user", "content": "Write a short story about a robot."}]
-        
-        response = mlx_server.chat_completion(model, messages, max_tokens=test_tokens)
-        
-        assert "choices" in response
-        assert len(response["choices"]) > 0
-        choice = response["choices"][0]
-        assert "message" in choice
-        assert "content" in choice["message"]
-        
-        content = choice["message"]["content"]
-        assert len(content) > 0, "No content generated"
-        
-        # The key test: model should generate reasonable content within limits
-        # without being cut off mid-sentence due to race conditions
-        logger.info(f"Generated {len(content)} characters with {test_tokens} token limit")
-
-
-@pytest.mark.server  
-class TestIssue16InteractiveVsServerTokenPolicies:
-    """Test Issue #16: Interactive vs Server Token Limit Policies Resolution."""
-    
-    def test_server_mode_uses_dos_protection_limits(self, mlx_server, available_models):
-        """Test that server mode uses DoS protection (context/2)."""
-        model = available_models[0]
-        context_length = extract_context_length_from_model(model)
-        server_limit = context_length // 2
-        
-        # Request more tokens than server limit should allow, but not too excessive for testing
-        excessive_tokens = min(server_limit + 200, 800)  # Keep reasonable for testing
-        
-        messages = [{"role": "user", "content": "Write a brief summary of machine learning."}]
-        
-        # This should work without errors - the server should internally
-        # limit tokens to the DoS protection limit
-        response = mlx_server.chat_completion(model, messages, max_tokens=excessive_tokens)
-        
-        assert "choices" in response
-        assert len(response["choices"]) > 0
-        choice = response["choices"][0]
-        assert "message" in choice
-        assert "content" in choice["message"]
-        
-        content = choice["message"]["content"]
-        assert len(content) > 0
-        
-        # The response should be successful, proving the server handles
-        # excessive token requests gracefully
-        logger.info(f"Server handled excessive token request ({excessive_tokens}) gracefully")
-        logger.info(f"Model context: {context_length}, Server limit: {server_limit}, Generated content length: {len(content)}")
-    
-    def test_server_honors_reasonable_token_requests(self, mlx_server, available_models):
-        """Test that server honors reasonable token requests."""
-        model = available_models[0] 
-        context_length = extract_context_length_from_model(model)
-        server_limit = context_length // 2
-        
-        # Request reasonable number of tokens (well under limit)
-        reasonable_tokens = min(server_limit // 4, 200)
-        
-        messages = [{"role": "user", "content": "Say hello."}]
-        
-        response = mlx_server.chat_completion(model, messages, max_tokens=reasonable_tokens)
-        
-        assert "choices" in response
-        assert len(response["choices"]) > 0
-        choice = response["choices"][0]
-        assert "message" in choice
-        assert "content" in choice["message"]
-        
-        content = choice["message"]["content"]
-        assert len(content) > 0
-        assert "hello" in content.lower() or "hi" in content.lower()
-        
-        logger.info(f"Server honored reasonable token request ({reasonable_tokens})")
-    
-    def test_model_capabilities_vs_hardcoded_limits(self, available_models):
-        """Test that models with different context lengths get appropriate limits."""
-        if len(available_models) < 2:
-            pytest.skip("Need multiple models to compare context lengths")
-        
-        model_contexts = []
-        for model in available_models[:3]:  # Test up to 3 models
-            context_length = extract_context_length_from_model(model)
-            model_contexts.append((model, context_length))
-        
-        # Verify that different models have different context lengths
-        # (or at least our system recognizes their individual capabilities)
-        contexts = [ctx for _, ctx in model_contexts]
-        
-        # At minimum, verify context extraction worked
-        for model, context in model_contexts:
-            assert context >= 1024, f"Model {model} context too small: {context}"
-            logger.info(f"Model {model}: {context} tokens context")
-        
-        # The key insight: No hardcoded 500/2000 token limits!
-        # Each model gets limits based on its actual capabilities
-        for model, context in model_contexts:
-            server_limit = context // 2
-            # Server limits should be much higher than old hardcoded limits
-            # for models with large context windows
-            if context >= 4096:
-                assert server_limit >= 2048, f"Model {model} should have server limit >= 2048, got {server_limit}"
diff --git a/tests/integration/test_lock_cleanup_bug.py b/tests/integration/test_lock_cleanup_bug.py
deleted file mode 100644
index 73be0a2..0000000
--- a/tests/integration/test_lock_cleanup_bug.py
+++ /dev/null
@@ -1,99 +0,0 @@
-#!/usr/bin/env python3
-"""
-Integration test for lock cleanup bug.
-This test reproduces the real bug found in Issue #24.
-"""
-
-from pathlib import Path
-import pytest
-
-from mlx_knife.cache_utils import _cleanup_model_locks
-
-
-@pytest.mark.usefixtures("temp_cache_dir")
-class TestLockCleanupBug:
-    """Integration tests for lock cleanup functionality."""
-
-    def test_lock_cleanup_path_bug(self, temp_cache_dir, patch_model_cache):
-        """Test that reproduces the lock cleanup path bug.
-        
-        The bug: _cleanup_model_locks uses MODEL_CACHE.parent instead of MODEL_CACHE,
-        causing it to look for locks in the wrong directory.
-        
-        HF Cache structure:
-        cache_root/
-        └── hub/                    ← MODEL_CACHE
-            ├── .locks/             ← Correct location  
-            └── models--name/       
-            
-        Bug: looks in cache_root/.locks/ instead of cache_root/hub/.locks/
-        """
-        hub_cache = temp_cache_dir / "hub"
-        
-        with patch_model_cache(hub_cache):
-            # Create test model structure
-            model_name = "test-org/broken-model"
-            cache_dir_name = "models--test-org--broken-model"
-            
-            # Create model directory (not needed for lock cleanup, but realistic)
-            model_dir = hub_cache / cache_dir_name
-            model_dir.mkdir()
-            
-            # Create lock files in CORRECT location: hub/.locks/
-            locks_dir = hub_cache / ".locks" / cache_dir_name
-            locks_dir.mkdir(parents=True)
-            (locks_dir / "download.lock").touch()
-            (locks_dir / "process.lock").touch()
-            (locks_dir / "huggingface.lock").write_text("PID:12345")
-            (locks_dir / "another.lock").touch()
-            
-            # Verify setup
-            assert locks_dir.exists(), "Lock directory should exist"
-            lock_files = list(locks_dir.iterdir())
-            assert len(lock_files) == 4, f"Should have 4 lock files, got {len(lock_files)}"
-            
-            # This should clean up the locks, but currently fails due to path bug
-            _cleanup_model_locks(model_name, force=True)
-            
-            # BUG: Lock directory still exists because function looks in wrong path
-            # This assertion will FAIL until the bug is fixed
-            assert not locks_dir.exists(), (
-                f"❌ BUG REPRODUCED: Lock directory still exists at {locks_dir}. "
-                f"The _cleanup_model_locks function is looking in the wrong path."
-            )
-
-    def test_lock_cleanup_empty_directory(self, temp_cache_dir, patch_model_cache):
-        """Test that _cleanup_model_locks handles empty lock directories gracefully."""
-        hub_cache = temp_cache_dir / "hub"
-        
-        with patch_model_cache(hub_cache):
-            model_name = "test-org/empty-locks"
-            cache_dir_name = "models--test-org--empty-locks"
-            
-            # Create empty lock directory
-            locks_dir = hub_cache / ".locks" / cache_dir_name
-            locks_dir.mkdir(parents=True)
-            
-            assert locks_dir.exists()
-            assert len(list(locks_dir.iterdir())) == 0
-            
-            # Should handle empty directory gracefully (no-op)
-            _cleanup_model_locks(model_name, force=True)
-            
-            # Empty directory should still exist (function returns early)
-            # This will also fail due to path bug, but for different reason
-
-    def test_lock_cleanup_nonexistent_locks(self, temp_cache_dir, patch_model_cache):
-        """Test that _cleanup_model_locks handles missing lock directories gracefully."""
-        hub_cache = temp_cache_dir / "hub"
-        
-        with patch_model_cache(hub_cache):
-            model_name = "test-org/no-locks"
-            
-            # Don't create any lock directory
-            
-            # Should handle gracefully (no-op)
-            _cleanup_model_locks(model_name, force=True)
-            
-            # This should pass (no error thrown)
-            assert True, "Function should handle missing lock directories gracefully"
\ No newline at end of file
diff --git a/tests/integration/test_process_lifecycle.py b/tests/integration/test_process_lifecycle.py
deleted file mode 100644
index 790cb55..0000000
--- a/tests/integration/test_process_lifecycle.py
+++ /dev/null
@@ -1,270 +0,0 @@
-"""
-High Priority Tests: Process Lifecycle Management
-
-Tests ensure clean process handling and resource management:
-- No zombie processes after normal exit or interruption
-- Proper signal handling (SIGTERM, SIGKILL, SIGINT)
-- Resource management (file handles, sockets, memory)
-- Clean streaming interruption
-"""
-import pytest
-import subprocess
-import signal
-import time
-import psutil
-import os
-from pathlib import Path
-
-
-@pytest.mark.timeout(30)
-class TestProcessLifecycle:
-    """Test process lifecycle management and cleanup."""
-    
-    def test_no_zombie_processes_normal_exit(self, mlx_knife_process, process_monitor):
-        """Ensure normal exit leaves no background processes."""
-        # Start a simple command that should exit cleanly
-        proc = mlx_knife_process(["list"])
-        main_pid = proc.pid
-        
-        # Track child processes before termination
-        children_before = process_monitor["get_process_tree"](main_pid)
-        
-        # Wait for normal completion
-        return_code = proc.wait(timeout=10)
-        
-        # Verify main process exited normally
-        assert return_code == 0
-        
-        # Verify no child processes remain
-        assert process_monitor["wait_for_cleanup"](main_pid, timeout=5)
-        
-        # Double-check: no processes should be running
-        for child in children_before:
-            assert not child.is_running(), f"Zombie process detected: PID {child.pid}"
-
-    def test_no_zombie_processes_sigint(self, mlx_knife_process, process_monitor, temp_cache_dir):
-        """Ensure SIGINT (Ctrl+C) kills all child processes."""
-        # Create a mock model for a longer-running command
-        mock_model_cache = self._create_simple_mock_model(temp_cache_dir)
-        
-        # Start a command that would run longer (health check)
-        proc = mlx_knife_process(["health"])
-        main_pid = proc.pid
-        
-        # Give it a moment to start and potentially spawn children
-        time.sleep(0.5)
-        
-        # Track child processes
-        children_before = process_monitor["get_process_tree"](main_pid)
-        
-        # Send SIGINT (Ctrl+C equivalent)
-        proc.send_signal(signal.SIGINT)
-        
-        # Wait for termination
-        try:
-            return_code = proc.wait(timeout=10)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            pytest.fail("Process did not respond to SIGINT within timeout")
-        
-        # Verify process was interrupted
-        assert return_code != 0  # Should not exit normally
-        
-        # Verify all child processes are cleaned up
-        assert process_monitor["wait_for_cleanup"](main_pid, timeout=5)
-        
-        for child in children_before:
-            assert not child.is_running(), f"Child process survived SIGINT: PID {child.pid}"
-
-    def test_no_zombie_processes_sigterm(self, mlx_knife_process, process_monitor, temp_cache_dir):
-        """Ensure SIGTERM leads to graceful shutdown."""
-        # Create a mock model
-        mock_model_cache = self._create_simple_mock_model(temp_cache_dir)
-        
-        # Start health check command
-        proc = mlx_knife_process(["health"])
-        main_pid = proc.pid
-        
-        time.sleep(0.5)
-        children_before = process_monitor["get_process_tree"](main_pid)
-        
-        # Send SIGTERM
-        proc.send_signal(signal.SIGTERM)
-        
-        try:
-            return_code = proc.wait(timeout=10)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            pytest.fail("Process did not respond to SIGTERM within timeout")
-        
-        # Verify graceful shutdown
-        assert return_code != 0  # Interrupted
-        
-        # Verify cleanup
-        assert process_monitor["wait_for_cleanup"](main_pid, timeout=5)
-        
-        for child in children_before:
-            assert not child.is_running(), f"Child process survived SIGTERM: PID {child.pid}"
-
-    def test_process_cleanup_after_sigkill(self, mlx_knife_process, process_monitor, temp_cache_dir):
-        """Test cleanup after SIGKILL (should kill immediately)."""
-        mock_model_cache = self._create_simple_mock_model(temp_cache_dir)
-        
-        proc = mlx_knife_process(["health"])
-        main_pid = proc.pid
-        
-        time.sleep(0.5)
-        children_before = process_monitor["get_process_tree"](main_pid)
-        
-        # SIGKILL should kill immediately
-        proc.send_signal(signal.SIGKILL)
-        
-        try:
-            return_code = proc.wait(timeout=5)
-        except subprocess.TimeoutExpired:
-            pytest.fail("Process did not die from SIGKILL")
-        
-        # SIGKILL has specific return code
-        assert return_code == -signal.SIGKILL
-        
-        # Child processes should be cleaned up by OS
-        assert process_monitor["wait_for_cleanup"](main_pid, timeout=5)
-
-    def test_download_worker_cleanup(self, mlx_knife_process, process_monitor, temp_cache_dir, patch_model_cache):
-        """Ensure download workers don't become zombies - uses isolated cache."""
-        # This test simulates download interruption with Phi-3-mini in ISOLATED cache
-        # Any broken download will be auto-cleaned, user cache stays pristine
-        
-        with patch_model_cache(temp_cache_dir / "hub"):
-            proc = mlx_knife_process(["pull", "mlx-community/Phi-3-mini-4k-instruct-4bit", "--no-progress"])
-            main_pid = proc.pid
-            
-            # Let download start
-            time.sleep(2.0)
-            
-            children_before = process_monitor["get_process_tree"](main_pid)
-            
-            # Interrupt the download
-            proc.send_signal(signal.SIGINT)
-            
-            try:
-                return_code = proc.wait(timeout=15)
-            except subprocess.TimeoutExpired:
-                proc.kill()
-                pytest.fail("Download process did not respond to interruption")
-            
-            # Verify cleanup - this is critical for download workers
-            assert process_monitor["wait_for_cleanup"](main_pid, timeout=10)
-            
-            for child in children_before:
-                if child.is_running():
-                    # Give more details about surviving process
-                    try:
-                        cmd = " ".join(child.cmdline())
-                        pytest.fail(f"Download worker survived: PID {child.pid}, CMD: {cmd}")
-                    except (psutil.NoSuchProcess, psutil.AccessDenied):
-                        pass  # Process died while we were checking
-                        
-            print("✓ Download interrupt test completed - any broken Phi-3-mini in isolated cache will be auto-cleaned")
-
-    def test_streaming_interruption_cleanup(self, mlx_knife_process, process_monitor, temp_cache_dir, patch_model_cache):
-        """Test clean cancellation of token generation streaming - uses tiny test model for isolation."""
-        # Use tiny-random-gpt2 for streaming tests to avoid dependencies on user cache
-        test_model = "hf-internal-testing/tiny-random-gpt2"
-        test_prompt = "Write a long story about a cat and a dog."
-        
-        with patch_model_cache(temp_cache_dir / "hub"):
-            # First download the model for this isolated test
-            from mlx_knife.hf_download import pull_model
-            from unittest.mock import patch
-            
-            with patch('builtins.input', return_value='y'):
-                pull_model(test_model)
-                
-            proc = mlx_knife_process(["run", test_model, test_prompt])
-            
-            # Let it start generating, then interrupt
-            time.sleep(2)  # Give it time to start
-            
-            # Send SIGINT (Ctrl+C) to interrupt gracefully
-            proc.send_signal(signal.SIGINT)
-            
-            try:
-                stdout, stderr = proc.communicate(timeout=10)
-                # Should terminate gracefully
-                assert proc.returncode is not None, "Process didn't terminate after SIGINT"
-            except subprocess.TimeoutExpired:
-                # If it doesn't respond to SIGINT, force kill
-                proc.kill()
-                stdout, stderr = proc.communicate()
-                pytest.fail("Process didn't respond to SIGINT - cleanup may have failed")
-            
-            # Check that we got some output before interruption
-            assert len(stdout) >= 0, "Process should handle interruption gracefully"
-            
-            print("✓ Streaming interrupt test completed - test model in isolated cache will be auto-cleaned")
-
-    def test_file_handle_management(self, mlx_knife_process, temp_cache_dir):
-        """Verify no file handle leaks after process termination."""
-        # Get initial file descriptor count
-        initial_fds = len(os.listdir("/proc/self/fd")) if os.path.exists("/proc/self/fd") else 0
-        
-        mock_model_cache = self._create_simple_mock_model(temp_cache_dir)
-        
-        # Run several operations
-        for _ in range(3):
-            proc = mlx_knife_process(["list"])
-            proc.wait(timeout=10)
-        
-        # Check file descriptors haven't grown significantly
-        if os.path.exists("/proc/self/fd"):
-            final_fds = len(os.listdir("/proc/self/fd"))
-            # Allow some tolerance for test framework overhead
-            assert final_fds <= initial_fds + 5, f"Potential file handle leak: {initial_fds} -> {final_fds}"
-
-    def _create_simple_mock_model(self, temp_cache_dir: Path) -> Path:
-        """Helper to create a simple mock model for testing."""
-        cache_name = "models--test--model"
-        model_dir = temp_cache_dir / cache_name / "snapshots" / "main"
-        model_dir.mkdir(parents=True, exist_ok=True)
-        
-        (model_dir / "config.json").write_text('{"model_type": "test"}')
-        (model_dir / "tokenizer.json").write_text('{"version": "1.0"}')
-        (model_dir / "model.safetensors").write_bytes(b"fake_model_data" * 100)
-        
-        return model_dir
-
-
-@pytest.mark.timeout(60)
-class TestResourceManagement:
-    """Test resource management and memory cleanup."""
-    
-    def test_memory_cleanup_after_operations(self, mlx_knife_process, temp_cache_dir):
-        """Verify memory is properly released after operations."""
-        # This is a basic test - real memory testing would require more sophisticated tools
-        mock_model_cache = self._create_simple_mock_model(temp_cache_dir)
-        
-        # Run operations and ensure they complete without hanging
-        operations = [
-            ["list"],
-            ["health"],
-            ["show", "test/model"]  # This should gracefully handle non-existent model
-        ]
-        
-        for op in operations:
-            proc = mlx_knife_process(op)
-            return_code = proc.wait(timeout=15)
-            # Operations should complete (may fail, but should not hang)
-            assert return_code is not None, f"Operation {op} hung"
-
-    def _create_simple_mock_model(self, temp_cache_dir: Path) -> Path:
-        """Helper to create a simple mock model for testing."""
-        cache_name = "models--test--model"
-        model_dir = temp_cache_dir / cache_name / "snapshots" / "main"
-        model_dir.mkdir(parents=True, exist_ok=True)
-        
-        (model_dir / "config.json").write_text('{"model_type": "test"}')
-        (model_dir / "tokenizer.json").write_text('{"version": "1.0"}')
-        (model_dir / "model.safetensors").write_bytes(b"fake_model_data" * 100)
-        
-        return model_dir
\ No newline at end of file
diff --git a/tests/integration/test_real_model_lifecycle.py b/tests/integration/test_real_model_lifecycle.py
deleted file mode 100644
index 1f88bd1..0000000
--- a/tests/integration/test_real_model_lifecycle.py
+++ /dev/null
@@ -1,349 +0,0 @@
-"""
-Integration tests for real model lifecycle using tiny real models.
-
-This replaces heavily mocked tests with comprehensive integration tests using
-hf-internal-testing/tiny-random-gpt2 (112k params, ~500KB) to test:
-- Real file system operations
-- Real path resolution logic  
-- Real framework detection
-- Real lock cleanup (our main bug from Issue #23)
-- End-to-end model lifecycle: pull → list → show → rm
-
-Strategy: ONE pull for all tests to be efficient, then comprehensive testing
-of the full pipeline with real files and directories.
-"""
-import pytest
-import os
-import shutil
-from pathlib import Path
-from unittest.mock import patch
-from mlx_knife.hf_download import pull_model
-from mlx_knife.cache_utils import (
-    list_models, show_model, rm_model, find_matching_models,
-    resolve_single_model, is_model_healthy, detect_framework,
-    hf_to_cache_dir, MODEL_CACHE
-)
-
-
-class TestRealModelLifecycle:
-    """Test complete model lifecycle with real tiny model in isolated cache."""
-    
-    TEST_MODEL = "hf-internal-testing/tiny-random-gpt2"
-    EXPECTED_SIZE_RANGE = (10_000_000, 15_000_000)  # ~12.5MB expected
-    
-    @staticmethod
-    def get_current_model_cache():
-        """Get the current model cache path (resolves HF_HOME dynamically)."""
-        cache_root = Path(os.environ.get("HF_HOME", Path.home() / ".cache/huggingface"))
-        return cache_root / "hub"
-    
-    @pytest.fixture(scope="class", autouse=True)
-    def setup_isolated_model(self, class_temp_cache_dir):
-        """Download test model to isolated cache before all tests in this class."""
-        print(f"\n=== Downloading {self.TEST_MODEL} to isolated test cache ===")
-        print(f"Test cache location: {class_temp_cache_dir}")
-        
-        # Patch MODEL_CACHE to point to our isolated cache
-        from mlx_knife import cache_utils
-        original_model_cache = cache_utils.MODEL_CACHE
-        cache_utils.MODEL_CACHE = class_temp_cache_dir / "hub"
-        
-        try:
-            # Pull the tiny test model (patch input to auto-confirm)
-            with patch('builtins.input', return_value='y'):
-                pull_model(self.TEST_MODEL)
-            
-            # Verify model exists in isolated cache
-            cache_dir_name = hf_to_cache_dir(self.TEST_MODEL)
-            model_cache_path = cache_utils.MODEL_CACHE / cache_dir_name
-            
-            if not model_cache_path.exists():
-                print(f"HF_HOME: {os.environ.get('HF_HOME', 'not set')}")
-                print(f"Expected cache path: {model_cache_path}")
-                print(f"Cache contents: {list(cache_utils.MODEL_CACHE.iterdir()) if cache_utils.MODEL_CACHE.exists() else 'does not exist'}")
-                pytest.fail(f"Model download failed - cache directory not found: {model_cache_path}")
-                
-            print(f"✅ Successfully downloaded {self.TEST_MODEL}")
-            print(f"📁 Model cached at: {model_cache_path}")
-            print(f"🔒 Using isolated test cache (user cache untouched)")
-            
-            # Fixture runs for all tests in this class
-            yield
-            
-        finally:
-            # Restore original MODEL_CACHE
-            cache_utils.MODEL_CACHE = original_model_cache
-            print(f"\n=== Test cache cleanup and MODEL_CACHE restored ===")
-    
-    def test_01_model_downloaded_successfully(self):
-        """Test that real model download created proper file structure."""
-        from mlx_knife import cache_utils
-        cache_dir_name = hf_to_cache_dir(self.TEST_MODEL)
-        model_cache_path = cache_utils.MODEL_CACHE / cache_dir_name
-        
-        # Verify top-level structure exists
-        assert model_cache_path.exists(), f"Model cache directory missing: {model_cache_path}"
-        assert (model_cache_path / "snapshots").exists(), "Snapshots directory missing"
-        assert (model_cache_path / "refs").exists(), "Refs directory missing"
-        
-        # Verify refs/main exists and points to a hash
-        refs_main = model_cache_path / "refs" / "main"
-        assert refs_main.exists(), "refs/main missing"
-        
-        commit_hash = refs_main.read_text().strip()
-        assert len(commit_hash) >= 8, f"Invalid commit hash: {commit_hash}"
-        
-        # Verify snapshot directory exists for the hash
-        snapshot_dir = model_cache_path / "snapshots" / commit_hash
-        assert snapshot_dir.exists(), f"Snapshot directory missing: {snapshot_dir}"
-        
-        # Verify essential model files exist
-        config_json = snapshot_dir / "config.json"
-        assert config_json.exists(), "config.json missing"
-        
-        # Check file size is reasonable (tiny model should be ~500KB total)
-        total_size = sum(f.stat().st_size for f in snapshot_dir.rglob("*") if f.is_file())
-        assert self.EXPECTED_SIZE_RANGE[0] <= total_size <= self.EXPECTED_SIZE_RANGE[1], \
-            f"Model size {total_size} outside expected range {self.EXPECTED_SIZE_RANGE}"
-        
-        print(f"✓ Real model downloaded: {total_size:,} bytes in {snapshot_dir}")
-    
-    def test_02_list_shows_downloaded_model(self):
-        """Test that list command shows our real downloaded model."""
-        # Use list with health check to verify model is detected and healthy
-        import io
-        import contextlib
-        
-        stdout_capture = io.StringIO()
-        with contextlib.redirect_stdout(stdout_capture):
-            list_models(show_all=True, show_health=True)  # Show all models with health status
-        
-        output = stdout_capture.getvalue()
-        
-        # Verify our test model appears in the output
-        assert self.TEST_MODEL in output or "tiny-random-gpt2" in output, \
-            f"Test model not found in list output: {output}"
-        
-        print(f"✓ Model appears in list output with health status")
-    
-    def test_03_show_detects_real_framework(self):
-        """Test that show command detects framework for real model."""
-        import io
-        import contextlib
-        
-        stdout_capture = io.StringIO()
-        with contextlib.redirect_stdout(stdout_capture):
-            show_model(self.TEST_MODEL)
-        
-        output = stdout_capture.getvalue()
-        
-        # Verify show command produced output about our model
-        assert self.TEST_MODEL in output or "tiny-random-gpt2" in output, \
-            f"Model not found in show output: {output}"
-        
-        # Should have framework detection
-        assert "Framework:" in output, f"Framework detection missing: {output}"
-        
-        # Should have health status
-        assert "Health:" in output, f"Health status missing: {output}"
-        
-        # Should show size information
-        assert any(keyword in output.lower() for keyword in ["size", "gb", "mb", "kb"]), \
-            f"Size information missing: {output}"
-        
-        print(f"✓ Show command detected framework and health for real model")
-    
-    def test_04_find_matching_works_with_real_model(self):
-        """Test that fuzzy matching works with real model."""
-        # Test exact match
-        exact_matches = find_matching_models(self.TEST_MODEL)
-        assert len(exact_matches) >= 1, f"Exact match failed for {self.TEST_MODEL}"
-        
-        # Test partial match
-        partial_matches = find_matching_models("tiny-random")
-        assert len(partial_matches) >= 1, f"Partial match failed for 'tiny-random'"
-        
-        # Verify our model is in the matches
-        model_names = [match[1] for match in partial_matches]
-        assert any(self.TEST_MODEL in name for name in model_names), \
-            f"Test model not found in partial matches: {model_names}"
-        
-        print(f"✓ Fuzzy matching works: {len(partial_matches)} matches for 'tiny-random'")
-    
-    def test_05_resolve_real_model_paths(self):
-        """Test that path resolution works with real model."""
-        # Test exact model resolution
-        model_path, resolved_name, commit_hash = resolve_single_model(self.TEST_MODEL)
-        
-        assert model_path is not None, f"Failed to resolve model path for {self.TEST_MODEL}"
-        assert model_path.exists(), f"Resolved path does not exist: {model_path}"
-        assert resolved_name == self.TEST_MODEL, f"Name resolution incorrect: {resolved_name}"
-        assert commit_hash is not None, f"Commit hash not resolved"
-        assert len(commit_hash) >= 8, f"Invalid commit hash: {commit_hash}"
-        
-        # Test fuzzy resolution
-        fuzzy_path, fuzzy_name, fuzzy_hash = resolve_single_model("tiny-random")
-        
-        assert fuzzy_path is not None, f"Fuzzy resolution failed for 'tiny-random'"
-        assert fuzzy_path.exists(), f"Fuzzy resolved path does not exist: {fuzzy_path}"
-        
-        # Both should resolve to same model
-        assert fuzzy_path == model_path, f"Fuzzy and exact paths differ: {fuzzy_path} vs {model_path}"
-        
-        print(f"✓ Path resolution works: {model_path}")
-    
-    def test_06_health_check_on_real_model(self):
-        """Test health checking on real model files."""
-        # Resolve model to get path
-        model_path, _, _ = resolve_single_model(self.TEST_MODEL)
-        assert model_path is not None, "Model resolution failed"
-        
-        # Test health check
-        is_healthy = is_model_healthy(self.TEST_MODEL)
-        
-        # Real downloaded model should be healthy
-        assert is_healthy, f"Real model reported as unhealthy: {self.TEST_MODEL}"
-        
-        # Test framework detection
-        framework = detect_framework(model_path, self.TEST_MODEL)
-        assert framework is not None, f"Framework detection failed for real model"
-        assert isinstance(framework, str), f"Framework should be string: {framework}"
-        assert len(framework) > 0, f"Empty framework detected: {framework}"
-        
-        print(f"✓ Health check passed, framework: {framework}")
-        
-        # Also test using show command for health verification
-        import io
-        import contextlib
-        
-        stdout_capture = io.StringIO()
-        with contextlib.redirect_stdout(stdout_capture):
-            show_model(self.TEST_MODEL)
-        
-        show_output = stdout_capture.getvalue()
-        assert "Health:" in show_output, f"Health status missing in show output: {show_output}"
-        
-        print(f"✓ Show command also reports health status correctly")
-    
-    def test_07_rm_cleans_locks_and_model(self):
-        """Test that rm command cleans both model AND locks (Issue #23 fix)."""
-        # Verify model exists before deletion
-        model_path, _, _ = resolve_single_model(self.TEST_MODEL)
-        assert model_path is not None, "Model should exist before deletion"
-        assert model_path.exists(), f"Model path should exist before deletion: {model_path}"
-        
-        # Get model cache directory and expected locks directory  
-        from mlx_knife import cache_utils
-        cache_dir_name = hf_to_cache_dir(self.TEST_MODEL)
-        model_cache_path = cache_utils.MODEL_CACHE / cache_dir_name
-        locks_dir = cache_utils.MODEL_CACHE / ".locks" / cache_dir_name
-        
-        # Create some test lock files if they don't exist
-        if not locks_dir.exists():
-            locks_dir.mkdir(parents=True)
-            (locks_dir / "test.lock").touch()
-        
-        lock_files_before = list(locks_dir.iterdir()) if locks_dir.exists() else []
-        
-        print(f"Before deletion:")
-        print(f"  Model cache: {model_cache_path.exists()}")
-        print(f"  Locks dir: {locks_dir.exists()}")  
-        print(f"  Lock files: {len(lock_files_before)}")
-        
-        # Remove model with force=True (no prompts)
-        rm_model(self.TEST_MODEL, force=True)
-        
-        # Verify BOTH model and locks are cleaned up
-        model_exists_after = model_cache_path.exists()
-        locks_exist_after = locks_dir.exists()
-        
-        print(f"After deletion:")
-        print(f"  Model cache: {model_exists_after}")
-        print(f"  Locks dir: {locks_exist_after}")
-        
-        # Issue #23 fix: Both should be deleted
-        assert not model_exists_after, f"Model cache should be deleted: {model_cache_path}"
-        assert not locks_exist_after, f"Locks directory should be deleted: {locks_dir}"
-        
-        print(f"✓ rm command cleaned both model and locks (Issue #23 fix verified)")
-    
-    def test_08_model_completely_removed(self):
-        """Test end-to-end verification that model is completely gone."""
-        # Verify model no longer appears in list
-        import io
-        import contextlib
-        
-        stdout_capture = io.StringIO()
-        with contextlib.redirect_stdout(stdout_capture):
-            list_models(show_all=True)  # Show all models, not just MLX ones
-        
-        output = stdout_capture.getvalue()
-        
-        # Our test model should NOT appear in output anymore
-        assert self.TEST_MODEL not in output, \
-            f"Model still appears in list after deletion: {output}"
-        assert "tiny-random-gpt2" not in output, \
-            f"Model name still appears in list after deletion: {output}"
-        
-        # Verify resolution fails
-        model_path, resolved_name, commit_hash = resolve_single_model(self.TEST_MODEL)
-        assert model_path is None, f"Model path should be None after deletion: {model_path}"
-        assert resolved_name is None, f"Resolved name should be None after deletion: {resolved_name}"
-        
-        # Verify fuzzy matching also fails  
-        matches = find_matching_models("tiny-random")
-        model_names = [match[1] for match in matches] if matches else []
-        assert not any(self.TEST_MODEL in name for name in model_names), \
-            f"Model still found in fuzzy matches: {model_names}"
-        
-        print(f"✓ Model completely removed from cache and indexes")
-
-
-class TestIntegrationTestSelfCheck:
-    """Meta-test: Verify integration tests are working properly."""
-    
-    def test_integration_test_downloads_real_files(self):
-        """Verify this integration test actually downloaded real files."""
-        # This test runs after TestRealModelLifecycle, so model should be cleaned up
-        # But we can verify the test ran by checking if we have network access
-        # and that the model we tried to download is a real HF model
-        
-        model = TestRealModelLifecycle.TEST_MODEL
-        assert "/" in model, f"Model name should have org/repo format: {model}"
-        assert "tiny" in model.lower(), f"Should use tiny model for tests: {model}"
-        assert "gpt2" in model.lower(), f"Should use GPT2 for compatibility: {model}"
-        
-        # Verify size expectations are reasonable for integration tests
-        min_size, max_size = TestRealModelLifecycle.EXPECTED_SIZE_RANGE
-        assert min_size < max_size, "Size range should be valid"
-        assert max_size < 20_000_000, "Test model should be reasonably small for CI efficiency"
-        
-        print(f"✓ Integration test configuration validated: {model}")
-    
-    def test_integration_vs_unit_test_coverage(self):
-        """Verify integration tests cover areas missed by unit tests."""
-        # This integration test should cover:
-        # 1. Real file system operations (not mocked)
-        # 2. Real path resolution logic  
-        # 3. Real framework detection
-        # 4. Real lock cleanup (Issue #23)
-        # 5. End-to-end workflows
-        
-        # Count methods in TestRealModelLifecycle
-        test_methods = [method for method in dir(TestRealModelLifecycle) 
-                       if method.startswith('test_')]
-        
-        # Should have comprehensive lifecycle coverage
-        assert len(test_methods) >= 7, f"Should have comprehensive test coverage: {len(test_methods)} tests"
-        
-        # Should test specific functionality
-        method_names = ' '.join(test_methods)
-        assert 'download' in method_names, "Should test downloading"
-        assert 'list' in method_names, "Should test listing" 
-        assert 'show' in method_names, "Should test showing"
-        assert 'resolve' in method_names, "Should test resolution"
-        assert 'health' in method_names, "Should test health checks"
-        assert 'rm' in method_names or 'remove' in method_names, "Should test removal"
-        assert 'lock' in method_names, "Should test lock cleanup (Issue #23)"
-        
-        print(f"✓ Integration tests provide comprehensive lifecycle coverage: {len(test_methods)} tests")
\ No newline at end of file
diff --git a/tests/integration/test_run_command_advanced.py b/tests/integration/test_run_command_advanced.py
deleted file mode 100644
index b64d0d0..0000000
--- a/tests/integration/test_run_command_advanced.py
+++ /dev/null
@@ -1,431 +0,0 @@
-"""
-Advanced Tests for Run Command
-
-Tests the most problematic aspects of the run command:
-- Process lifecycle during model execution
-- Memory management with model loading/unloading  
-- Streaming interruption handling
-- Error conditions and recovery
-"""
-import pytest
-import subprocess
-import signal
-import time
-import threading
-from pathlib import Path
-
-
-@pytest.mark.timeout(120)
-@pytest.mark.usefixtures("temp_cache_dir")
-class TestRunCommandProcessLifecycle:
-    """Test process management during model execution."""
-    
-    def test_run_command_normal_completion(self, mlx_knife_process, process_monitor, mock_model_cache):
-        """Test run command completes normally and cleans up."""
-        # Create a mock model (won't actually run, but tests process handling)
-        mock_model_cache("test-model", healthy=True)
-        
-        proc = mlx_knife_process(["run", "test-model", "Hello"])
-        main_pid = proc.pid
-        
-        # Track child processes
-        children_before = process_monitor["get_process_tree"](main_pid)
-        
-        try:
-            # Wait for completion (will likely fail due to mock model, but should not hang)
-            return_code = proc.wait(timeout=30)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            pytest.fail("Run command hung during execution")
-        
-        # Should complete (success or failure, but not hang)
-        assert return_code is not None, "Run command did not complete"
-        
-        # Verify child process cleanup
-        assert process_monitor["wait_for_cleanup"](main_pid, timeout=10)
-        
-        for child in children_before:
-            assert not child.is_running(), f"Run command left zombie process: PID {child.pid}"
-
-    def test_run_command_sigint_during_execution(self, mlx_knife_process, process_monitor, mock_model_cache):
-        """Test interruption during model execution."""
-        mock_model_cache("test-model", healthy=True)
-        
-        proc = mlx_knife_process(["run", "test-model", "This is a longer prompt that might take time"])
-        main_pid = proc.pid
-        
-        # Give it time to start
-        time.sleep(2)
-        
-        children_before = process_monitor["get_process_tree"](main_pid)
-        
-        # Send interrupt
-        proc.send_signal(signal.SIGINT)
-        
-        try:
-            return_code = proc.wait(timeout=20)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            pytest.fail("Run command did not respond to SIGINT")
-        
-        # Should exit on interrupt
-        assert return_code is not None
-        assert return_code != 0  # Should not exit normally
-        
-        # Clean up child processes
-        assert process_monitor["wait_for_cleanup"](main_pid, timeout=10)
-        
-        for child in children_before:
-            assert not child.is_running(), f"Run child process survived SIGINT: PID {child.pid}"
-
-    def test_run_command_sigterm_handling(self, mlx_knife_process, process_monitor, mock_model_cache):
-        """Test SIGTERM during model execution."""
-        mock_model_cache("test-model", healthy=True)
-        
-        proc = mlx_knife_process(["run", "test-model", "Test prompt"])
-        main_pid = proc.pid
-        
-        time.sleep(2)
-        children_before = process_monitor["get_process_tree"](main_pid)
-        
-        # Send SIGTERM
-        proc.send_signal(signal.SIGTERM)
-        
-        try:
-            return_code = proc.wait(timeout=20)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            pytest.fail("Run command did not respond to SIGTERM")
-        
-        assert return_code is not None
-        assert return_code != 0
-        
-        # Cleanup verification
-        assert process_monitor["wait_for_cleanup"](main_pid, timeout=10)
-        
-        for child in children_before:
-            assert not child.is_running(), f"Run child survived SIGTERM: PID {child.pid}"
-
-    def test_run_command_model_loading_failure(self, mlx_knife_process, process_monitor):
-        """Test process cleanup when model loading fails."""
-        # Use nonexistent model to trigger loading failure
-        proc = mlx_knife_process(["run", "nonexistent-model-12345", "Test prompt"])
-        main_pid = proc.pid
-        
-        children_before = process_monitor["get_process_tree"](main_pid)
-        
-        try:
-            return_code = proc.wait(timeout=20)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            pytest.fail("Run command hung on model loading failure")
-        
-        # Should fail gracefully
-        assert return_code is not None
-        assert return_code != 0  # Should fail on missing model
-        
-        # Should not leave zombies even on failure
-        assert process_monitor["wait_for_cleanup"](main_pid, timeout=5)
-        
-        for child in children_before:
-            assert not child.is_running(), f"Process survived model loading failure: PID {child.pid}"
-
-
-@pytest.mark.timeout(90)
-@pytest.mark.usefixtures("temp_cache_dir")
-class TestRunCommandMemoryManagement:
-    """Test memory management during run command execution."""
-    
-    def test_run_command_memory_cleanup_after_completion(self, mlx_knife_process, mock_model_cache):
-        """Test memory is released after run command completes."""
-        mock_model_cache("test-model", healthy=True)
-        
-        # Run command multiple times to test memory cleanup
-        for i in range(3):
-            proc = mlx_knife_process(["run", "test-model", f"Test prompt {i}"])
-            
-            try:
-                return_code = proc.wait(timeout=25)
-            except subprocess.TimeoutExpired:
-                proc.kill()
-                pytest.fail(f"Run command {i} hung")
-            
-            # Should complete (may fail, but should not hang)
-            assert return_code is not None, f"Run command {i} did not complete"
-
-    def test_run_command_memory_cleanup_on_interruption(self, mlx_knife_process, process_monitor, mock_model_cache):
-        """Test memory cleanup when run is interrupted."""
-        mock_model_cache("test-model", healthy=True)
-        
-        proc = mlx_knife_process(["run", "test-model", "Longer test prompt for interruption"])
-        main_pid = proc.pid
-        
-        # Let it start
-        time.sleep(3)
-        
-        # Interrupt
-        proc.send_signal(signal.SIGINT)
-        
-        try:
-            return_code = proc.wait(timeout=15)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            pytest.fail("Run command did not handle interruption")
-        
-        # Verify cleanup
-        assert return_code is not None
-        assert process_monitor["wait_for_cleanup"](main_pid, timeout=10)
-
-    def test_run_command_handles_corrupted_model(self, mlx_knife_process, mock_model_cache):
-        """Test run command handles corrupted models gracefully."""
-        # Create corrupted model
-        mock_model_cache("broken-model", healthy=False, corruption_type="truncated_safetensors")
-        
-        proc = mlx_knife_process(["run", "broken-model", "Test prompt"])
-        
-        try:
-            return_code = proc.wait(timeout=20)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            pytest.fail("Run command hung on corrupted model")
-        
-        # Should fail gracefully on corrupted model
-        assert return_code is not None
-        assert return_code != 0  # Should fail
-
-
-@pytest.mark.timeout(60)
-@pytest.mark.usefixtures("temp_cache_dir")
-class TestRunCommandStreamingAndOutput:
-    """Test streaming and output handling in run command."""
-    
-    def test_run_command_streaming_interruption(self, mlx_knife_process):
-        """Test interruption during token streaming with real MLX model."""
-        test_model = "Phi-3-mini-4k-instruct-4bit"
-        # Use prompt that would generate substantial output
-        test_prompt = "Explain machine learning in detail with examples."
-        
-        proc = mlx_knife_process(["run", test_model, test_prompt])
-        
-        # Let streaming start, then interrupt
-        time.sleep(3)  # Allow generation to begin
-        
-        # Send interrupt signal
-        proc.send_signal(signal.SIGINT)
-        
-        try:
-            stdout, stderr = proc.communicate(timeout=15)
-            # Should handle interruption gracefully
-            assert proc.returncode is not None, "Process should terminate after interrupt"
-            # Should have generated some output before interruption
-            assert len(stdout) > 0, "Should have some output before interruption"
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            stdout, stderr = proc.communicate()
-            pytest.fail("Process didn't respond to interruption signal")
-
-    def test_run_command_output_handling(self, mlx_knife_process, mock_model_cache):
-        """Test that run command handles output correctly."""
-        mock_model_cache("test-model", healthy=True)
-        
-        proc = mlx_knife_process(["run", "test-model", "Hello"])
-        
-        try:
-            stdout, stderr = proc.communicate(timeout=20)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            pytest.fail("Run command hung during output")
-        
-        # Should produce some output (even if error message)
-        total_output = len(stdout) + len(stderr)
-        assert total_output > 0, "Run command produced no output"
-
-    def test_run_command_long_prompt_handling(self, mlx_knife_process, mock_model_cache):
-        """Test run command with very long prompts."""
-        mock_model_cache("test-model", healthy=True)
-        
-        # Create long prompt
-        long_prompt = "This is a test prompt. " * 100  # ~2500 characters
-        
-        proc = mlx_knife_process(["run", "test-model", long_prompt])
-        
-        try:
-            return_code = proc.wait(timeout=25)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            pytest.fail("Run command hung on long prompt")
-        
-        # Should handle long prompt without hanging
-        assert return_code is not None
-
-    def test_run_command_special_characters(self, mlx_knife_process, mock_model_cache):
-        """Test run command handles special characters in prompts."""
-        mock_model_cache("test-model", healthy=True)
-        
-        special_prompts = [
-            "Hello 世界",  # Unicode
-            "Test with \"quotes\" and 'apostrophes'",  # Quotes
-            "Newlines\nand\ttabs",  # Whitespace
-            "emoji 🚀 test",  # Emoji
-        ]
-        
-        for prompt in special_prompts:
-            proc = mlx_knife_process(["run", "test-model", prompt])
-            
-            try:
-                return_code = proc.wait(timeout=20)
-            except subprocess.TimeoutExpired:
-                proc.kill()
-                pytest.fail(f"Run command hung on special characters: {prompt[:20]}...")
-            
-            # Should handle special characters gracefully
-            assert return_code is not None
-
-
-@pytest.mark.timeout(45)
-@pytest.mark.usefixtures("temp_cache_dir")
-class TestRunCommandErrorConditions:
-    """Test run command error handling."""
-    
-    def test_run_command_insufficient_memory(self, mlx_knife_process, mock_model_cache):
-        """Test behavior when system might be low on memory."""
-        mock_model_cache("large-model", healthy=True)
-        
-        # We can't actually simulate low memory, but we can test the process handles errors
-        proc = mlx_knife_process(["run", "large-model", "Test prompt"])
-        
-        try:
-            return_code = proc.wait(timeout=25)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            pytest.fail("Run command hung during error condition")
-        
-        # Should complete (success or failure)
-        assert return_code is not None
-
-    def test_run_command_missing_dependencies(self, mlx_knife_process):
-        """Test run command when model dependencies might be missing."""
-        # Try to run with invalid model to test error handling
-        proc = mlx_knife_process(["run", "invalid/missing-model", "Test"])
-        
-        try:
-            return_code = proc.wait(timeout=15)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            pytest.fail("Run command hung on missing dependencies")
-        
-        # Should fail gracefully
-        assert return_code is not None
-        assert return_code != 0
-
-    def test_run_command_multiple_concurrent_executions(self, mlx_knife_process, mock_model_cache):
-        """Test multiple concurrent run commands don't interfere."""
-        mock_model_cache("test-model", healthy=True)
-        
-        processes = []
-        
-        # Start multiple run commands
-        for i in range(3):
-            proc = mlx_knife_process(["run", "test-model", f"Concurrent test {i}"])
-            processes.append(proc)
-        
-        # Wait for all to complete
-        for i, proc in enumerate(processes):
-            try:
-                return_code = proc.wait(timeout=30)
-            except subprocess.TimeoutExpired:
-                proc.kill()
-                pytest.fail(f"Concurrent run command {i} hung")
-            
-            # Each should complete independently
-            assert return_code is not None, f"Concurrent run {i} did not complete"
-
-
-@pytest.mark.timeout(60)
-@pytest.mark.usefixtures("temp_cache_dir")
-class TestRunCommandContextAwareLimits:
-    """Test context-aware token limits in Issues #15 and #16 resolution."""
-    
-    def test_context_length_extraction_from_real_model(self, mlx_knife_process, mock_model_cache):
-        """Test that context length is correctly extracted from real model configs."""
-        # Create a mock model with realistic config.json
-        model_path = mock_model_cache("test-model", healthy=True)
-        
-        # Add custom config.json with specific context length
-        config_content = {
-            "max_position_embeddings": 4096,
-            "hidden_size": 768,
-            "num_attention_heads": 12
-        }
-        import json
-        (model_path / "config.json").write_text(json.dumps(config_content))
-        
-        # Test that the model context length is accessible
-        # This is an indirect test - we test that the run command uses model-aware limits
-        # by checking that it doesn't hang with realistic models
-        proc = mlx_knife_process([
-            "run", "test-model", "Test prompt",
-            "--max-tokens", "8000",  # Request more than typical model context
-            "--verbose"
-        ])
-        
-        try:
-            # Should complete within timeout (won't actually generate due to mock)
-            return_code = proc.wait(timeout=30)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            pytest.fail("Run command hung with high max-tokens")
-            
-        # Should complete (may fail due to mock model, but shouldn't hang)
-        assert return_code is not None, "Run command did not complete with context-aware limits"
-        
-    def test_server_vs_interactive_token_policies(self, mock_model_cache):
-        """Test that server mode uses DoS protection while interactive mode uses full context."""
-        # This test validates the architectural decision:
-        # - Server mode: context_length / 2 (DoS protection)
-        # - Interactive mode: full context_length
-        
-        from mlx_knife.mlx_runner import MLXRunner, get_model_context_length
-        import tempfile
-        import json
-        import os
-        
-        # Create a temporary model directory with config
-        with tempfile.TemporaryDirectory() as temp_dir:
-            config_path = os.path.join(temp_dir, "config.json")
-            config = {"max_position_embeddings": 4096}
-            
-            with open(config_path, 'w') as f:
-                json.dump(config, f)
-            
-            # Test context length extraction
-            context_length = get_model_context_length(temp_dir)
-            assert context_length == 4096, "Context length extraction failed"
-            
-            # Test MLXRunner effective token calculation
-            runner = MLXRunner(temp_dir, verbose=False)
-            runner._context_length = 4096
-            
-            # Interactive mode should use full context
-            interactive_tokens = runner.get_effective_max_tokens(8000, interactive=True)
-            assert interactive_tokens == 4096, f"Interactive mode should use full context: {interactive_tokens}"
-            
-            # Server mode should use half context (DoS protection)
-            server_tokens = runner.get_effective_max_tokens(8000, interactive=False)
-            assert server_tokens == 2048, f"Server mode should use half context: {server_tokens}"
-            
-            # User requests smaller than limits should be honored
-            small_interactive = runner.get_effective_max_tokens(1000, interactive=True)
-            assert small_interactive == 1000, "Small requests should be honored in interactive mode"
-            
-            small_server = runner.get_effective_max_tokens(1000, interactive=False)  
-            assert small_server == 1000, "Small requests should be honored in server mode"
-            
-            # Test None behavior (new CLI default=None logic)
-            # Interactive mode with None should use full context
-            none_interactive = runner.get_effective_max_tokens(None, interactive=True)
-            assert none_interactive == 4096, "None in interactive mode should use full context"
-            
-            # Server mode with None should use server limit
-            none_server = runner.get_effective_max_tokens(None, interactive=False)  
-            assert none_server == 2048, "None in server mode should use server limit (context/2)"
\ No newline at end of file
diff --git a/tests/integration/test_server_functionality.py b/tests/integration/test_server_functionality.py
deleted file mode 100644
index 79c09f0..0000000
--- a/tests/integration/test_server_functionality.py
+++ /dev/null
@@ -1,555 +0,0 @@
-"""
-High Priority Tests: Server Functionality
-
-Tests for the OpenAI-compatible API server:
-- Server startup and shutdown
-- Process lifecycle during server operations  
-- API endpoint availability
-- Request handling and response format
-- Server interruption and cleanup
-"""
-import pytest
-import subprocess
-import time
-import requests
-import signal
-import json
-from pathlib import Path
-
-
-@pytest.mark.timeout(60)
-class TestServerLifecycle:
-    """Test server startup, operation, and shutdown."""
-    
-    def test_server_startup_shutdown(self, mlx_knife_process, process_monitor):
-        """Test server starts and shuts down cleanly."""
-        # Start server
-        proc = mlx_knife_process(["server", "--host", "127.0.0.1", "--port", "8001"])
-        main_pid = proc.pid
-        
-        # Give server time to start
-        time.sleep(3)
-        
-        # Check if server is responsive (basic health check)
-        try:
-            response = requests.get("http://127.0.0.1:8001/health", timeout=5)
-            server_started = response.status_code == 200
-        except requests.exceptions.RequestException:
-            # Server might not have health endpoint, that's OK
-            server_started = proc.poll() is None  # Process still running
-        
-        # Track child processes
-        children_before = process_monitor["get_process_tree"](main_pid)
-        
-        # Shutdown server
-        proc.send_signal(signal.SIGINT)
-        
-        try:
-            return_code = proc.wait(timeout=15)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            pytest.fail("Server did not shutdown within timeout")
-        
-        # Verify clean shutdown
-        assert return_code is not None, "Server process did not terminate"
-        
-        # Verify all child processes cleaned up
-        assert process_monitor["wait_for_cleanup"](main_pid, timeout=10)
-        
-        for child in children_before:
-            assert not child.is_running(), f"Server child process survived: PID {child.pid}"
-
-    def test_server_sigterm_handling(self, mlx_knife_process, process_monitor):
-        """Test server responds to SIGTERM gracefully."""
-        proc = mlx_knife_process(["server", "--host", "127.0.0.1", "--port", "8002"])
-        main_pid = proc.pid
-        
-        time.sleep(3)
-        children_before = process_monitor["get_process_tree"](main_pid)
-        
-        # Send SIGTERM
-        proc.send_signal(signal.SIGTERM)
-        
-        try:
-            return_code = proc.wait(timeout=15)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            pytest.fail("Server did not respond to SIGTERM")
-        
-        # Should exit gracefully
-        assert return_code is not None
-        
-        # Clean up child processes
-        assert process_monitor["wait_for_cleanup"](main_pid, timeout=10)
-        
-        for child in children_before:
-            assert not child.is_running(), f"Server child survived SIGTERM: PID {child.pid}"
-
-    def test_server_sigkill_cleanup(self, mlx_knife_process, process_monitor):
-        """Test cleanup after SIGKILL."""
-        proc = mlx_knife_process(["server", "--host", "127.0.0.1", "--port", "8003"])
-        main_pid = proc.pid
-        
-        time.sleep(3)
-        children_before = process_monitor["get_process_tree"](main_pid)
-        
-        # SIGKILL should kill immediately
-        proc.send_signal(signal.SIGKILL)
-        
-        try:
-            return_code = proc.wait(timeout=10)
-        except subprocess.TimeoutExpired:
-            pytest.fail("Process did not die from SIGKILL")
-        
-        assert return_code == -signal.SIGKILL
-        
-        # Child processes should be cleaned up by OS
-        assert process_monitor["wait_for_cleanup"](main_pid, timeout=10)
-
-    def test_server_port_binding_conflicts(self, mlx_knife_process):
-        """Test server handles port conflicts gracefully."""
-        # Start first server
-        proc1 = mlx_knife_process(["server", "--host", "127.0.0.1", "--port", "8004"])
-        time.sleep(3)
-        
-        # Try to start second server on same port
-        proc2 = mlx_knife_process(["server", "--host", "127.0.0.1", "--port", "8004"])
-        
-        try:
-            # Second server should fail quickly
-            return_code2 = proc2.wait(timeout=10)
-            assert return_code2 != 0, "Second server should fail on port conflict"
-        except subprocess.TimeoutExpired:
-            proc2.kill()
-            pytest.fail("Second server did not fail quickly on port conflict")
-        finally:
-            # Clean up first server
-            if proc1.poll() is None:
-                proc1.send_signal(signal.SIGINT)
-                proc1.wait(timeout=10)
-
-    def test_server_invalid_arguments(self, mlx_knife_process):
-        """Test server handles invalid arguments gracefully."""
-        invalid_configs = [
-            ["server", "--port", "99999"],  # Invalid port
-            ["server", "--host", "invalid-host"],  # Invalid host
-            ["server", "--max-tokens", "-1"],  # Invalid max tokens
-        ]
-        
-        for config in invalid_configs:
-            proc = mlx_knife_process(config)
-            try:
-                return_code = proc.wait(timeout=10)
-                # Should fail gracefully, not hang
-                assert return_code is not None, f"Server hung on invalid config: {config}"
-                assert return_code != 0, f"Server should fail on invalid config: {config}"
-            except subprocess.TimeoutExpired:
-                proc.kill()
-                pytest.fail(f"Server hung on invalid config: {config}")
-
-
-@pytest.mark.timeout(90)
-class TestServerAPI:
-    """Test server API functionality."""
-    
-    def test_server_health_endpoint(self, mlx_knife_process):
-        """Test server health/status endpoint if available."""
-        proc = mlx_knife_process(["server", "--host", "127.0.0.1", "--port", "8005"])
-        
-        # Wait for server to start
-        time.sleep(4)
-        
-        try:
-            # Try common health endpoints
-            health_endpoints = [
-                "http://127.0.0.1:8005/health",
-                "http://127.0.0.1:8005/v1/models",
-                "http://127.0.0.1:8005/",
-            ]
-            
-            server_responsive = False
-            for endpoint in health_endpoints:
-                try:
-                    response = requests.get(endpoint, timeout=5)
-                    if response.status_code in [200, 404]:  # 404 is OK, means server is running
-                        server_responsive = True
-                        break
-                except requests.exceptions.RequestException:
-                    continue
-            
-            # Server should be responsive to at least one endpoint
-            assert server_responsive, "Server not responsive to any health endpoints"
-            
-        finally:
-            # Clean up
-            if proc.poll() is None:
-                proc.send_signal(signal.SIGINT)
-                proc.wait(timeout=15)
-
-    def test_server_openai_models_endpoint(self, mlx_knife_process):
-        """Test OpenAI-compatible /v1/models endpoint."""
-        proc = mlx_knife_process(["server", "--host", "127.0.0.1", "--port", "8006"])
-        
-        time.sleep(4)
-        
-        try:
-            response = requests.get("http://127.0.0.1:8006/v1/models", timeout=10)
-            
-            # Should respond (may be empty list if no models)
-            assert response.status_code == 200, f"Models endpoint failed: {response.status_code}"
-            
-            # Should return valid JSON
-            try:
-                data = response.json()
-                assert isinstance(data, dict), "Models endpoint should return JSON object"
-                # OpenAI format typically has 'data' field
-                if 'data' in data:
-                    assert isinstance(data['data'], list), "Models data should be a list"
-            except json.JSONDecodeError:
-                pytest.fail("Models endpoint returned invalid JSON")
-                
-        except requests.exceptions.RequestException as e:
-            pytest.fail(f"Failed to connect to models endpoint: {e}")
-        finally:
-            if proc.poll() is None:
-                proc.send_signal(signal.SIGINT)
-                proc.wait(timeout=15)
-
-    def test_server_chat_completions_endpoint(self, mlx_knife_process):
-        """Test OpenAI-compatible /v1/chat/completions endpoint structure."""
-        proc = mlx_knife_process(["server", "--host", "127.0.0.1", "--port", "8007"])
-        
-        time.sleep(4)
-        
-        try:
-            # Test with minimal valid request
-            payload = {
-                "model": "test-model",
-                "messages": [{"role": "user", "content": "Hello"}],
-                "max_tokens": 10
-            }
-            
-            response = requests.post(
-                "http://127.0.0.1:8007/v1/chat/completions",
-                json=payload,
-                timeout=15
-            )
-            
-            # Should respond (may be error if no models, but shouldn't hang)
-            assert response.status_code is not None, "Chat completions endpoint hung"
-            
-            # Should return JSON response
-            try:
-                data = response.json()
-                assert isinstance(data, dict), "Chat completions should return JSON object"
-                
-                if response.status_code == 200:
-                    # Valid response should have expected fields
-                    assert 'choices' in data or 'error' in data
-                elif response.status_code == 400:
-                    # Bad request should have error message
-                    assert 'error' in data
-                    
-            except json.JSONDecodeError:
-                pytest.fail("Chat completions returned invalid JSON")
-                
-        except requests.exceptions.RequestException as e:
-            pytest.fail(f"Failed to connect to chat completions endpoint: {e}")
-        finally:
-            if proc.poll() is None:
-                proc.send_signal(signal.SIGINT)
-                proc.wait(timeout=15)
-
-    @pytest.mark.server
-    def test_issue_19_server_token_limits_regression(self, mlx_knife_process):
-        """
-        Regression test for Issue #19: Server output truncation at ~1000 words.
-        
-        Tests that server respects --max-tokens parameter and doesn't truncate
-        responses prematurely due to hardcoded 2000 token default.
-        """
-        # Test with low max-tokens (should truncate early)
-        proc_low = mlx_knife_process([
-            "server", "--host", "127.0.0.1", "--port", "8008", 
-            "--max-tokens", "100"  # Very low limit
-        ])
-        
-        time.sleep(4)
-        
-        try:
-            # Long-form prompt that should trigger Issue #19 behavior
-            # Based on real user scenario that exposed the original truncation bug
-            trilogy_prompt = """Here is the outline for a fantasy trilogy "EMBERS OF THE FORGOTTEN":
-
-**MAIN CHARACTERS:**
-1. Kaelen Veyra - The Exiled Flame Herald (32, war poet, controls Soulfire)
-2. Sylra D'Tharn - The Shadow Warrior (28, assassin, uses Emotionweave)  
-3. Lord Morvath - The Unforgotten King (45, tragic villain with Grief-Crown)
-
-**TRILOGY STRUCTURE:**
-- Book I: "Embers of the Forgotten" - The flame that remembers
-- Book II: "The Lovers' Crucible" - The fire that doesn't burn
-- Book III: "The Fire That Binds" - The flame that connects
-
-**THEMES:** Love as power not weakness, memory as healing, emotions as connection
-
-**YOUR TASK:** Write the complete first chapter of Book I: "The Poet Who Burned" 
-- Focus on Kaelen's exile from Celestine after his beloved Lirien's execution
-- Include his arrival at Veyra (Valley of Faces) with 30 lost masks
-- Show his Soulfire powers and emotional depth
-- Use poetic, mythic language with deep inner rhythm
-- Target: 2000+ words with full character development and dialogue
-- End with the mysterious mask whispering: "You were here - a thousand years ago"
-
-Write the complete chapter now."""
-
-            payload_long = {
-                "model": "test-model", 
-                "messages": [{"role": "user", "content": trilogy_prompt}],
-                "stream": False,
-                "temperature": 0.7
-            }
-            
-            response_low = requests.post(
-                "http://127.0.0.1:8008/v1/chat/completions",
-                json=payload_long,
-                timeout=30
-            )
-            
-            # Should respond with some content but truncated
-            if response_low.status_code == 200:
-                data_low = response_low.json()
-                if 'choices' in data_low and data_low['choices']:
-                    content_low = data_low['choices'][0].get('message', {}).get('content', '')
-                    # With max-tokens=100, content should be short
-                    assert len(content_low.split()) < 200, f"Low token limit not enforced: {len(content_low.split())} words"
-                    
-        except (requests.exceptions.RequestException, json.JSONDecodeError):
-            # If no model available, test structure is still validated
-            pass
-        finally:
-            if proc_low.poll() is None:
-                proc_low.send_signal(signal.SIGINT)
-                proc_low.wait(timeout=15)
-        
-        # Test with high max-tokens (should allow longer responses)  
-        proc_high = mlx_knife_process([
-            "server", "--host", "127.0.0.1", "--port", "8009",
-            "--max-tokens", "10000"  # High limit
-        ])
-        
-        time.sleep(4)
-        
-        try:
-            response_high = requests.post(
-                "http://127.0.0.1:8009/v1/chat/completions",
-                json=payload_long,
-                timeout=60
-            )
-            
-            # Should allow longer responses
-            if response_high.status_code == 200:
-                data_high = response_high.json()
-                if 'choices' in data_high and data_high['choices']:
-                    content_high = data_high['choices'][0].get('message', {}).get('content', '')
-                    # High token limit should allow more content (if model available)
-                    # This test validates server respects the --max-tokens parameter
-                    assert isinstance(content_high, str), "Response content should be string"
-                    
-        except (requests.exceptions.RequestException, json.JSONDecodeError):
-            pass
-        finally:
-            if proc_high.poll() is None:
-                proc_high.send_signal(signal.SIGINT)
-                proc_high.wait(timeout=15)
-
-    def test_server_startup_token_limit_messages(self, mlx_knife_process):
-        """Test that server startup shows correct token limit configuration."""
-        # Test default (None) shows dynamic limits message
-        proc_default = mlx_knife_process(["server", "--host", "127.0.0.1", "--port", "8010"])
-        time.sleep(4)
-        
-        try:
-            # Stop server first to avoid blocking read
-            if proc_default.poll() is None:
-                proc_default.send_signal(signal.SIGINT)
-                proc_default.wait(timeout=15)
-            
-            # Now safely read stdout after server shutdown
-            stdout_data = proc_default.stdout.read() if proc_default.stdout else b""
-            stdout_text = stdout_data.decode('utf-8', errors='ignore')
-            
-            # Should show dynamic limits message when no --max-tokens specified
-            if stdout_text.strip():  # Only check if we got output
-                assert "model-aware dynamic limits" in stdout_text, f"Expected dynamic limits message, got: {stdout_text}"
-            
-        except Exception:
-            # If no stdout capture available, test passes (infrastructure limitation)
-            pass
-                
-        # Test explicit --max-tokens shows numeric value
-        proc_explicit = mlx_knife_process(["server", "--host", "127.0.0.1", "--port", "8011", "--max-tokens", "5000"])
-        time.sleep(4)
-        
-        try:
-            # Stop server first to avoid blocking read
-            if proc_explicit.poll() is None:
-                proc_explicit.send_signal(signal.SIGINT)
-                proc_explicit.wait(timeout=15)
-            
-            # Now safely read stdout after server shutdown
-            stdout_data = proc_explicit.stdout.read() if proc_explicit.stdout else b""
-            stdout_text = stdout_data.decode('utf-8', errors='ignore')
-            
-            # Should show explicit numeric value
-            if stdout_text.strip():  # Only check if we got output
-                assert "5000" in stdout_text, f"Expected '5000' in startup message, got: {stdout_text}"
-            
-        except Exception:
-            pass
-
-    def test_server_streaming_endpoint(self, mlx_knife_process):
-        """Test streaming functionality if available."""
-        proc = mlx_knife_process(["server", "--host", "127.0.0.1", "--port", "8008"])
-        
-        time.sleep(4)
-        
-        try:
-            # Test streaming request
-            payload = {
-                "model": "test-model", 
-                "messages": [{"role": "user", "content": "Hi"}],
-                "max_tokens": 5,
-                "stream": True
-            }
-            
-            response = requests.post(
-                "http://127.0.0.1:8008/v1/chat/completions",
-                json=payload,
-                timeout=20,
-                stream=True
-            )
-            
-            # Should respond to streaming request
-            assert response.status_code is not None, "Streaming endpoint hung"
-            
-            # Should handle streaming gracefully (may error if no model)
-            if response.status_code == 200:
-                # Should return SSE format or similar
-                assert 'text/plain' in response.headers.get('content-type', '') or \
-                       'text/event-stream' in response.headers.get('content-type', '') or \
-                       'application/json' in response.headers.get('content-type', '')
-                       
-        except requests.exceptions.RequestException as e:
-            pytest.fail(f"Streaming endpoint connection failed: {e}")
-        finally:
-            if proc.poll() is None:
-                proc.send_signal(signal.SIGINT)
-                proc.wait(timeout=15)
-
-
-@pytest.mark.timeout(45)
-class TestServerResourceManagement:
-    """Test server resource management."""
-    
-    def test_server_memory_cleanup_after_shutdown(self, mlx_knife_process):
-        """Test that server cleans up memory after shutdown."""
-        # Start and stop server multiple times
-        for i in range(3):
-            proc = mlx_knife_process(["server", "--host", "127.0.0.1", "--port", f"800{9+i}"])
-            
-            time.sleep(2)
-            
-            # Shutdown cleanly
-            proc.send_signal(signal.SIGINT)
-            return_code = proc.wait(timeout=15)
-            
-            assert return_code is not None, f"Server {i} did not shutdown"
-
-    def test_server_handles_multiple_requests(self, mlx_knife_process):
-        """Test server can handle multiple concurrent requests without hanging."""
-        proc = mlx_knife_process(["server", "--host", "127.0.0.1", "--port", "8012"])
-        
-        time.sleep(4)
-        
-        try:
-            # Send multiple requests concurrently
-            import threading
-            import queue
-            
-            results = queue.Queue()
-            
-            def make_request(endpoint):
-                try:
-                    response = requests.get(f"http://127.0.0.1:8012{endpoint}", timeout=10)
-                    results.put(("success", response.status_code))
-                except Exception as e:
-                    results.put(("error", str(e)))
-            
-            # Start multiple threads
-            threads = []
-            endpoints = ["/v1/models", "/v1/models", "/v1/models"]
-            
-            for endpoint in endpoints:
-                thread = threading.Thread(target=make_request, args=(endpoint,))
-                threads.append(thread)
-                thread.start()
-            
-            # Wait for all threads
-            for thread in threads:
-                thread.join(timeout=20)
-                assert not thread.is_alive(), "Request thread hung"
-            
-            # Check results
-            success_count = 0
-            while not results.empty():
-                result_type, result_value = results.get()
-                if result_type == "success":
-                    success_count += 1
-            
-            # At least some requests should succeed
-            assert success_count > 0, "No requests succeeded"
-            
-        finally:
-            if proc.poll() is None:
-                proc.send_signal(signal.SIGINT)
-                proc.wait(timeout=15)
-
-    def test_server_request_interruption(self, mlx_knife_process):
-        """Test server handles request interruption cleanly."""
-        proc = mlx_knife_process(["server", "--host", "127.0.0.1", "--port", "8013"])
-        
-        time.sleep(4)
-        
-        try:
-            # Start a request and interrupt it
-            import threading
-            
-            def make_slow_request():
-                try:
-                    requests.get("http://127.0.0.1:8013/v1/models", timeout=2)
-                except:
-                    pass  # Expected to timeout/fail
-            
-            # Start request in background
-            request_thread = threading.Thread(target=make_slow_request)
-            request_thread.start()
-            
-            # Give request time to start
-            time.sleep(1)
-            
-            # Shutdown server while request is in progress
-            proc.send_signal(signal.SIGINT)
-            return_code = proc.wait(timeout=15)
-            
-            # Server should shutdown cleanly even with active requests
-            assert return_code is not None, "Server hung during request interruption"
-            
-            # Request thread should complete
-            request_thread.join(timeout=10)
-            assert not request_thread.is_alive(), "Request thread hung after server shutdown"
-            
-        finally:
-            if proc.poll() is None:
-                proc.kill()
-                proc.wait()
\ No newline at end of file
diff --git a/tests/support/__init__.py b/tests/support/__init__.py
deleted file mode 100644
index 5db9c06..0000000
--- a/tests/support/__init__.py
+++ /dev/null
@@ -1,2 +0,0 @@
-"""Shared test support utilities (process guards, helpers)."""
-
diff --git a/tests/support/process_guard.py b/tests/support/process_guard.py
deleted file mode 100644
index 9dcc271..0000000
--- a/tests/support/process_guard.py
+++ /dev/null
@@ -1,164 +0,0 @@
-"""
-Process guard for pytest integration tests.
-
-Tracks spawned server subprocesses and ensures they are terminated on
-Ctrl-C (SIGINT), SIGTERM, normal test teardown, and at interpreter exit.
-
-Usage:
-- Call `register_popen(proc, label)` after starting a subprocess.
-- Optionally `unregister(pid)` after clean termination.
-- Handlers are installed automatically when importing this module, but
-  can also be installed explicitly via `install_signal_handlers()`.
-"""
-from __future__ import annotations
-
-import atexit
-import os
-import signal
-import threading
-import time
-from typing import Dict, Optional
-
-import psutil
-
-_registry_lock = threading.RLock()
-_registry: Dict[int, Dict[str, Optional[int]]] = {}
-_handlers_installed = False
-
-
-def _safe_get_pgid(pid: int) -> Optional[int]:
-    try:
-        return os.getpgid(pid)
-    except Exception:
-        return None
-
-
-def register_popen(proc, label: str = "tracked-proc") -> None:
-    """Register a subprocess.Popen for guarded cleanup."""
-    if proc is None:
-        return
-    pid = getattr(proc, "pid", None)
-    if not pid:
-        return
-    pgid = _safe_get_pgid(pid)
-    with _registry_lock:
-        _registry[pid] = {"label": label, "pgid": pgid}
-
-
-def unregister(pid: int) -> None:
-    with _registry_lock:
-        _registry.pop(pid, None)
-
-
-def _kill_pid_tree(pid: int, timeout: float = 1.0) -> None:
-    """Terminate a process and its children, escalating if needed."""
-    try:
-        proc = psutil.Process(pid)
-    except psutil.NoSuchProcess:
-        return
-
-    # Try to terminate children first
-    children = proc.children(recursive=True)
-    for ch in children:
-        try:
-            ch.terminate()
-        except psutil.NoSuchProcess:
-            pass
-
-    # Terminate main process
-    try:
-        proc.terminate()
-    except psutil.NoSuchProcess:
-        return
-
-    t0 = time.time()
-    while time.time() - t0 < timeout:
-        if not proc.is_running():
-            return
-        time.sleep(0.1)
-
-    # Escalate to kill
-    for ch in children:
-        try:
-            ch.kill()
-        except psutil.NoSuchProcess:
-            pass
-    try:
-        proc.kill()
-    except psutil.NoSuchProcess:
-        pass
-
-
-def kill_all(label_filter: Optional[str] = None) -> None:
-    """Kill all tracked processes (optionally filtered by label)."""
-    with _registry_lock:
-        items = list(_registry.items())
-
-    for pid, meta in items:
-        label = (meta or {}).get("label")
-        pgid = (meta or {}).get("pgid")
-        if label_filter and label != label_filter:
-            continue
-        # Try process group termination first (POSIX)
-        if pgid and pgid > 0 and hasattr(os, "killpg"):
-            try:
-                os.killpg(pgid, signal.SIGTERM)
-                # Give the group a moment
-                time.sleep(0.2)
-            except Exception:
-                pass
-        # Fallback to individual tree kill with short timeout
-        _kill_pid_tree(pid, timeout=0.8)
-        # Final escalation: SIGKILL the group if still around
-        if pgid and pgid > 0 and hasattr(os, "killpg"):
-            try:
-                os.killpg(pgid, signal.SIGKILL)
-            except Exception:
-                pass
-        unregister(pid)
-
-
-def _signal_handler_factory(prev_handler):
-    def _handler(signum, frame):
-        # Best-effort kill of tracked server processes
-        try:
-            kill_all()
-        finally:
-            # Chain to previous handler behavior
-            if callable(prev_handler):
-                try:
-                    prev_handler(signum, frame)
-                    return
-                except Exception:
-                    # If previous handler was Python's default raising KeyboardInterrupt,
-                    # re-raise to allow pytest to handle interruption.
-                    raise
-            # If default/ignore, re-send signal to self to honor semantics
-            try:
-                signal.signal(signum, signal.SIG_DFL)
-                os.kill(os.getpid(), signum)
-            except Exception:
-                pass
-    return _handler
-
-
-def install_signal_handlers() -> None:
-    global _handlers_installed
-    if _handlers_installed:
-        return
-    if os.environ.get("MLXK_TEST_DISABLE_PROCESS_GUARD"):
-        _handlers_installed = True
-        return
-    # Chain SIGINT and SIGTERM
-    for sig in (signal.SIGINT, signal.SIGTERM):
-        try:
-            prev = signal.getsignal(sig)
-            signal.signal(sig, _signal_handler_factory(prev))
-        except Exception:
-            pass
-    atexit.register(lambda: kill_all())
-    _handlers_installed = True
-
-
-# Note: Do NOT auto-install on import. Tests that need the guard should call
-# install_signal_handlers() explicitly to avoid interfering with non-server runs.
diff --git a/tests/unit/test_cache_utils.py b/tests/unit/test_cache_utils.py
deleted file mode 100644
index 84641c6..0000000
--- a/tests/unit/test_cache_utils.py
+++ /dev/null
@@ -1,902 +0,0 @@
-"""
-Unit tests for cache_utils.py module.
-
-Tests the core model management functions:
-- Model discovery and metadata extraction
-- Health checking logic
-- Cache operations
-"""
-import pytest
-import tempfile
-import shutil
-import json
-from pathlib import Path
-from unittest.mock import patch, MagicMock, call
-
-# Import the module under test
-from mlx_knife.cache_utils import (
-    expand_model_name,
-    hf_to_cache_dir, 
-    cache_dir_to_hf,
-    is_model_healthy,
-    detect_framework,
-    list_models,
-    find_matching_models,
-    resolve_single_model
-)
-
-
-class TestModelNameExpansion:
-    """Test model name expansion logic."""
-    
-    def test_expand_short_names(self):
-        """Test expansion of common short model names."""
-        test_cases = [
-            ("Phi-3-mini", "mlx-community/Phi-3-mini-4k-instruct-4bit"),
-            ("Mistral-7B", "mlx-community/Mistral-7B-Instruct-v0.3-4bit"),
-            ("Llama-3-8B", "mlx-community/Meta-Llama-3-8B-Instruct-4bit"),
-        ]
-        
-        for short_name, expected in test_cases:
-            try:
-                result = expand_model_name(short_name)
-                # Should either expand correctly or return the original name
-                assert isinstance(result, str)
-                assert len(result) > 0
-            except Exception as e:
-                pytest.fail(f"expand_model_name failed for {short_name}: {e}")
-
-    def test_expand_full_names(self):
-        """Test that full model names are returned unchanged."""
-        full_names = [
-            "mlx-community/Phi-3-mini-4k-instruct-4bit",
-            "microsoft/Phi-3-mini-4k-instruct",
-            "meta-llama/Llama-2-7b-chat-hf"
-        ]
-        
-        for full_name in full_names:
-            try:
-                result = expand_model_name(full_name)
-                # Should return the name as-is or expand it
-                assert isinstance(result, str)
-                assert len(result) > 0
-            except Exception as e:
-                pytest.fail(f"expand_model_name failed for {full_name}: {e}")
-
-    def test_expand_invalid_names(self):
-        """Test handling of invalid or nonsense model names."""
-        invalid_names = [
-            "definitely-not-a-model-12345",
-            "",
-            "   ",
-            "invalid/model/with/too/many/slashes"
-        ]
-        
-        for invalid_name in invalid_names:
-            try:
-                result = expand_model_name(invalid_name)
-                # Should handle gracefully - either return input or raise appropriate error
-                if result is not None:
-                    assert isinstance(result, str)
-            except Exception:
-                # It's OK to raise exceptions for invalid names
-                pass
-
-
-class TestCacheDirectoryConversion:
-    """Test cache directory name conversion functions."""
-    
-    def test_hf_to_cache_dir(self):
-        """Test HuggingFace model name to cache directory conversion."""
-        test_cases = [
-            ("microsoft/Phi-3-mini-4k-instruct", "models--microsoft--Phi-3-mini-4k-instruct"),
-            ("meta-llama/Llama-2-7b", "models--meta-llama--Llama-2-7b"),
-            ("simple-model", "models--simple-model"),
-        ]
-        
-        for hf_name, expected_cache_dir in test_cases:
-            try:
-                result = hf_to_cache_dir(hf_name)
-                assert isinstance(result, str)
-                # Should follow HF cache naming convention
-                assert result.startswith("models--")
-                assert "--" in result
-            except Exception as e:
-                pytest.fail(f"hf_to_cache_dir failed for {hf_name}: {e}")
-
-    def test_cache_dir_to_hf(self):
-        """Test cache directory to HuggingFace model name conversion."""
-        test_cases = [
-            ("models--microsoft--Phi-3-mini-4k-instruct", "microsoft/Phi-3-mini-4k-instruct"),
-            ("models--meta-llama--Llama-2-7b", "meta-llama/Llama-2-7b"),
-            ("models--simple-model", "simple-model"),
-        ]
-        
-        for cache_dir, expected_hf_name in test_cases:
-            try:
-                result = cache_dir_to_hf(cache_dir)
-                assert isinstance(result, str)
-                # Should reverse the cache directory format
-                assert "/" in result or len(result.split("--")) == 1
-            except Exception as e:
-                pytest.fail(f"cache_dir_to_hf failed for {cache_dir}: {e}")
-
-    def test_round_trip_conversion(self):
-        """Test that conversion functions are inverses."""
-        test_names = [
-            "microsoft/Phi-3-mini-4k-instruct",
-            "simple-model",
-            "org/model-name-with-dashes"
-        ]
-        
-        for original_name in test_names:
-            try:
-                cache_dir = hf_to_cache_dir(original_name)
-                recovered_name = cache_dir_to_hf(cache_dir)
-                
-                assert recovered_name == original_name, \
-                    f"Round trip failed: {original_name} -> {cache_dir} -> {recovered_name}"
-            except Exception as e:
-                pytest.fail(f"Round trip conversion failed for {original_name}: {e}")
-
-
-class TestModelHealthCheck:
-    """Test model health checking logic."""
-    
-    def test_healthy_model_structure(self, temp_cache_dir):
-        """Test health check on properly structured model."""
-        # Create a healthy model structure
-        model_dir = temp_cache_dir / "models--test--model" / "snapshots" / "main"
-        model_dir.mkdir(parents=True)
-        
-        # Create required files
-        (model_dir / "config.json").write_text('{"model_type": "test", "architectures": ["TestModel"]}')
-        (model_dir / "tokenizer.json").write_text('{"version": "1.0", "tokenizer": {}}')
-        (model_dir / "model.safetensors").write_bytes(b"fake_model_weights" * 100)
-        
-        try:
-            is_healthy = is_model_healthy(str(model_dir))
-            # Should be True for healthy model
-            assert isinstance(is_healthy, bool)
-        except Exception as e:
-            pytest.fail(f"Health check failed on healthy model: {e}")
-
-    def test_missing_config_detection(self, temp_cache_dir):
-        """Test detection of missing config.json."""
-        model_dir = temp_cache_dir / "models--test--model" / "snapshots" / "main"
-        model_dir.mkdir(parents=True)
-        
-        # Missing config.json
-        (model_dir / "tokenizer.json").write_text('{"version": "1.0"}')
-        (model_dir / "model.safetensors").write_bytes(b"fake_weights")
-        
-        try:
-            is_healthy = is_model_healthy(str(model_dir))
-            # Should detect missing config
-            assert isinstance(is_healthy, bool)
-            # Likely should be False, but depends on implementation
-        except Exception as e:
-            # It's OK to raise exception for missing config
-            pass
-
-    def test_missing_tokenizer_detection(self, temp_cache_dir):
-        """Test detection of missing tokenizer.json."""
-        model_dir = temp_cache_dir / "models--test--model" / "snapshots" / "main"
-        model_dir.mkdir(parents=True)
-        
-        # Missing tokenizer.json
-        (model_dir / "config.json").write_text('{"model_type": "test"}')
-        (model_dir / "model.safetensors").write_bytes(b"fake_weights")
-        
-        try:
-            is_healthy = is_model_healthy(str(model_dir))
-            assert isinstance(is_healthy, bool)
-        except Exception as e:
-            # OK to raise exception for missing tokenizer
-            pass
-
-    def test_missing_model_weights(self, temp_cache_dir):
-        """Test detection of missing model weights."""
-        model_dir = temp_cache_dir / "models--test--model" / "snapshots" / "main"
-        model_dir.mkdir(parents=True)
-        
-        # Missing model files
-        (model_dir / "config.json").write_text('{"model_type": "test"}')
-        (model_dir / "tokenizer.json").write_text('{"version": "1.0"}')
-        # No .safetensors files
-        
-        try:
-            is_healthy = is_model_healthy(str(model_dir))
-            assert isinstance(is_healthy, bool)
-        except Exception as e:
-            # OK to raise exception for missing weights
-            pass
-
-    def test_lfs_pointer_detection(self, temp_cache_dir):
-        """Test detection of LFS pointer files."""
-        model_dir = temp_cache_dir / "models--test--model" / "snapshots" / "main"
-        model_dir.mkdir(parents=True)
-        
-        (model_dir / "config.json").write_text('{"model_type": "test"}')
-        (model_dir / "tokenizer.json").write_text('{"version": "1.0"}')
-        
-        # Create LFS pointer file instead of actual weights
-        lfs_content = (
-            "version https://git-lfs.github.com/spec/v1\n"
-            "oid sha256:abc123def456\n"
-            "size 1000000000\n"
-        )
-        (model_dir / "model.safetensors").write_text(lfs_content)
-        
-        try:
-            is_healthy = is_model_healthy(str(model_dir))
-            # Should detect LFS pointer as unhealthy
-            assert isinstance(is_healthy, bool)
-        except Exception as e:
-            # OK to raise exception for LFS pointers
-            pass
-
-    def test_nonexistent_directory(self):
-        """Test health check on nonexistent directory."""
-        nonexistent_path = "/this/path/definitely/does/not/exist"
-        
-        try:
-            is_healthy = is_model_healthy(nonexistent_path)
-            # Should handle gracefully
-            assert isinstance(is_healthy, bool)
-            assert is_healthy is False  # Nonexistent should be unhealthy
-        except Exception:
-            # OK to raise exception for nonexistent path
-            pass
-
-
-class TestFrameworkDetection:
-    """Test model framework detection logic."""
-    
-    def test_mlx_model_detection(self, temp_cache_dir):
-        """Test detection of MLX-compatible models."""
-        model_dir = temp_cache_dir / "models--mlx-community--test-model" / "snapshots" / "main"
-        model_dir.mkdir(parents=True)
-        
-        # Create MLX model config
-        mlx_config = {
-            "model_type": "llama",
-            "architectures": ["LlamaForCausalLM"],
-            "quantization": {"group_size": 64, "bits": 4}  # MLX quantization
-        }
-        (model_dir / "config.json").write_text(json.dumps(mlx_config))
-        (model_dir / "tokenizer.json").write_text('{"version": "1.0"}')
-        (model_dir / "model.safetensors").write_bytes(b"mlx_weights")
-        
-        try:
-            from pathlib import Path
-            framework = detect_framework(Path(str(model_dir)), "mlx-community/test-model")
-            assert isinstance(framework, str)
-            # Should detect as MLX or compatible
-        except Exception as e:
-            pytest.fail(f"Framework detection failed on MLX model: {e}")
-
-    def test_pytorch_model_detection(self, temp_cache_dir):
-        """Test detection of PyTorch models."""
-        model_dir = temp_cache_dir / "models--pytorch--test-model" / "snapshots" / "main"
-        model_dir.mkdir(parents=True)
-        
-        # Create PyTorch model config
-        pytorch_config = {
-            "model_type": "bert",
-            "architectures": ["BertForSequenceClassification"],
-            "torch_dtype": "float32"
-        }
-        (model_dir / "config.json").write_text(json.dumps(pytorch_config))
-        (model_dir / "tokenizer.json").write_text('{"version": "1.0"}')
-        (model_dir / "pytorch_model.bin").write_bytes(b"pytorch_weights")
-        
-        try:
-            from pathlib import Path
-            framework = detect_framework(Path(str(model_dir)), "pytorch/test-model")
-            assert isinstance(framework, str)
-        except Exception as e:
-            pytest.fail(f"Framework detection failed on PyTorch model: {e}")
-
-
-class TestModelListing:
-    """Test model listing functionality."""
-    
-    @patch('mlx_knife.cache_utils.MODEL_CACHE')
-    def test_list_models_empty_cache(self, mock_cache, temp_cache_dir):
-        """Test model listing in empty cache."""
-        mock_cache.__str__ = lambda: str(temp_cache_dir)
-        mock_cache.exists.return_value = True
-        mock_cache.glob.return_value = []
-        
-        try:
-            # list_models prints to stdout, so we test it doesn't crash
-            list_models(verbose=False)
-        except Exception as e:
-            pytest.fail(f"Model listing failed on empty cache: {e}")
-
-    def test_list_models_real_empty_cache(self, temp_cache_dir):
-        """Test Issue #21: list_models with real empty HF_HOME directory."""
-        import os
-        from mlx_knife.cache_utils import list_models
-        
-        # Create empty cache directory
-        empty_cache = temp_cache_dir / "empty_hf_cache" 
-        empty_cache.mkdir()
-        
-        # Set HF_HOME to empty directory and test
-        original_hf_home = os.environ.get('HF_HOME')
-        try:
-            os.environ['HF_HOME'] = str(empty_cache)
-            # Should not crash and should print helpful message
-            list_models()
-        except FileNotFoundError as e:
-            pytest.fail(f"Issue #21 regression: list_models crashed with empty cache: {e}")
-        finally:
-            if original_hf_home is not None:
-                os.environ['HF_HOME'] = original_hf_home
-            elif 'HF_HOME' in os.environ:
-                del os.environ['HF_HOME']
-
-    @patch('mlx_knife.cache_utils.MODEL_CACHE')
-    def test_list_models_basic_call(self, mock_cache, temp_cache_dir):
-        """Test basic model listing call."""
-        mock_cache.__str__ = lambda: str(temp_cache_dir)
-        mock_cache.exists.return_value = True
-        mock_cache.glob.return_value = []
-        
-        try:
-            # Test various parameter combinations
-            list_models(show_all=True)
-            list_models(framework_filter="MLX")
-            list_models(show_health=True)
-        except Exception as e:
-            pytest.fail(f"Model listing with parameters failed: {e}")
-
-
-class TestModelRemoval:
-    """Test rm_model functionality (Issue #23)."""
-    
-    def setup_method(self):
-        """Setup mock cache structure for each test."""
-        self.test_model_name = "microsoft/DialoGPT-small"
-        self.test_hash = "49c537161a457d5256512f9d2d38a87d81ae0f0e"
-        self.test_hash_short = "49c53716"
-    
-    @patch('mlx_knife.cache_utils.MODEL_CACHE')
-    @patch('mlx_knife.cache_utils.resolve_single_model') 
-    @patch('mlx_knife.cache_utils.shutil.rmtree')
-    @patch('builtins.input', return_value='y')
-    def test_rm_model_fixed_behavior_issue23(self, mock_input, mock_rmtree, mock_resolve, mock_cache, temp_cache_dir):
-        """Test fixed rm behavior - should delete model AND locks (Issue #23 resolved).
-        
-        Setup mocked directory structure as documented in CLAUDE.md:
-        hub/
-        ├── .locks/models--<name>/      # Per-model lock files  
-        └── models--<name>/             # Model data directory
-            ├── blobs/                  # Deduplicated file storage
-            ├── refs/main               # Points to current commit hash
-            └── snapshots/<hash>/       # Specific version
-        """
-        from mlx_knife.cache_utils import rm_model
-        
-        # Create real temp directories that mirror HF cache structure
-        # After fix: MODEL_CACHE points to hub/, locks are at hub/.locks/
-        hub_dir = temp_cache_dir / "hub"
-        model_dir = hub_dir / "models--microsoft--DialoGPT-small"
-        snapshots_dir = model_dir / "snapshots"
-        hash_dir = snapshots_dir / self.test_hash_short
-        refs_dir = model_dir / "refs" 
-        blobs_dir = model_dir / "blobs"
-        locks_dir = hub_dir / ".locks" / "models--microsoft--DialoGPT-small"
-        
-        # Create the directory structure (but don't populate with real files)
-        hash_dir.mkdir(parents=True)
-        refs_dir.mkdir(parents=True) 
-        blobs_dir.mkdir(parents=True)
-        locks_dir.mkdir(parents=True)
-        
-        # Create refs/main file pointing to hash
-        (refs_dir / "main").write_text(self.test_hash_short)
-        
-        # Create some mock lock files
-        (locks_dir / "file1.lock").touch()
-        (locks_dir / "file2.lock").touch()
-        
-        # Mock resolve_single_model to return our temp structure
-        mock_resolve.return_value = (model_dir, self.test_model_name, self.test_hash_short)
-        
-        # Mock MODEL_CACHE to point to hub directory (after fix: locks are at MODEL_CACHE/.locks/)
-        import mlx_knife.cache_utils
-        mlx_knife.cache_utils.MODEL_CACHE = hub_dir
-        
-        # Verify our test structure exists
-        assert model_dir.exists()
-        assert hash_dir.exists() 
-        assert (refs_dir / "main").exists()
-        assert locks_dir.exists()
-        assert len(list(locks_dir.iterdir())) == 2
-        
-        # Test current rm behavior - this should show Issue #23
-        rm_model(f"{self.test_model_name}@{self.test_hash_short}")
-        
-        # Verify what was actually deleted
-        # Fixed behavior: should delete model directory AND locks directory
-        assert mock_rmtree.call_count == 2
-        
-        # Verify both calls: model directory and locks directory
-        calls = [call[0][0] for call in mock_rmtree.call_args_list]
-        model_call = next((call for call in calls if "models--microsoft--DialoGPT-small" in str(call) and ".locks" not in str(call)), None)
-        locks_call = next((call for call in calls if ".locks" in str(call)), None)
-        
-        assert model_call is not None, "Should delete model directory"
-        assert locks_call is not None, "Should delete locks directory"
-    
-    @patch('mlx_knife.cache_utils.MODEL_CACHE')
-    @patch('mlx_knife.cache_utils.resolve_single_model') 
-    @patch('mlx_knife.cache_utils.shutil.rmtree')
-    def test_rm_model_force_parameter(self, mock_rmtree, mock_resolve, mock_cache, temp_cache_dir):
-        """Test rm_model with force=True skips all confirmations."""
-        from mlx_knife.cache_utils import rm_model
-        
-        # Create same temp structure as previous test (updated for fix)
-        hub_dir = temp_cache_dir / "hub"
-        model_dir = hub_dir / "models--microsoft--DialoGPT-small"
-        snapshots_dir = model_dir / "snapshots"
-        hash_dir = snapshots_dir / self.test_hash_short
-        locks_dir = hub_dir / ".locks" / "models--microsoft--DialoGPT-small"
-        
-        # Create the directory structure
-        hash_dir.mkdir(parents=True)
-        locks_dir.mkdir(parents=True)
-        (locks_dir / "file1.lock").touch()
-        (locks_dir / "file2.lock").touch()
-        
-        # Mock resolve_single_model to return our temp structure
-        mock_resolve.return_value = (model_dir, self.test_model_name, self.test_hash_short)
-        
-        # Mock MODEL_CACHE to point to hub directory (after fix)
-        import mlx_knife.cache_utils
-        mlx_knife.cache_utils.MODEL_CACHE = hub_dir
-        
-        # Test with force=True - should NOT call input() at all
-        with patch('builtins.input') as mock_input:
-            rm_model(f"{self.test_model_name}@{self.test_hash_short}", force=True)
-            
-            # Verify input() was never called (no prompts with force=True)
-            mock_input.assert_not_called()
-        
-        # Verify both model and locks were deleted
-        assert mock_rmtree.call_count == 2
-        calls = [call[0][0] for call in mock_rmtree.call_args_list]
-        model_call = next((call for call in calls if "models--microsoft--DialoGPT-small" in str(call) and ".locks" not in str(call)), None)
-        locks_call = next((call for call in calls if ".locks" in str(call)), None)
-        
-        assert model_call is not None, "Should delete model directory with force=True"
-        assert locks_call is not None, "Should delete locks directory with force=True"
-    
-    @patch('mlx_knife.cache_utils.MODEL_CACHE')
-    @patch('mlx_knife.cache_utils.resolve_single_model') 
-    @patch('mlx_knife.cache_utils.shutil.rmtree')
-    def test_rm_model_force_vs_interactive(self, mock_rmtree, mock_resolve, mock_cache, temp_cache_dir):
-        """Test that force=True behaves differently than interactive mode."""
-        from mlx_knife.cache_utils import rm_model
-        
-        # Create temp structure (updated for fix)
-        hub_dir = temp_cache_dir / "hub"
-        model_dir = hub_dir / "models--test--model"
-        snapshots_dir = model_dir / "snapshots"
-        hash_dir = snapshots_dir / "abc12345"
-        locks_dir = hub_dir / ".locks" / "models--test--model"
-        
-        hash_dir.mkdir(parents=True)
-        locks_dir.mkdir(parents=True)
-        (locks_dir / "test.lock").touch()
-        
-        mock_resolve.return_value = (model_dir, "test/model", None)
-        # Mock MODEL_CACHE to point to hub directory (after fix)
-        import mlx_knife.cache_utils
-        mlx_knife.cache_utils.MODEL_CACHE = hub_dir
-        
-        # Test 1: Interactive mode - user says no
-        mock_rmtree.reset_mock()
-        with patch('builtins.input', return_value='n'):
-            rm_model("test/model", force=False)
-            # Should NOT delete anything when user says no
-            mock_rmtree.assert_not_called()
-        
-        # Test 2: Force mode - no prompts, just delete
-        mock_rmtree.reset_mock()
-        with patch('builtins.input') as mock_input:
-            rm_model("test/model", force=True)
-            # Should NOT prompt user
-            mock_input.assert_not_called()
-            # Should delete both model and locks
-            assert mock_rmtree.call_count == 2
-    
-    
-    @patch('mlx_knife.cache_utils.resolve_single_model')
-    def test_rm_model_not_found(self, mock_resolve):
-        """Test rm behavior when model is not found."""
-        from mlx_knife.cache_utils import rm_model
-        
-        # Setup resolve to return None (not found)
-        mock_resolve.return_value = (None, None, None)
-        
-        # Should return early without error
-        result = rm_model("nonexistent/model@hash")
-        assert result is None
-
-
-class TestPartialNameFiltering:
-    """Test partial name filtering for list command (Issue 1)."""
-    
-    def test_find_matching_models_function(self):
-        """Test the find_matching_models helper function."""
-        with patch('mlx_knife.cache_utils.MODEL_CACHE') as mock_cache:
-            # Mock some model directories
-            mock_models = [
-                MagicMock(name="models--mlx-community--Phi-3-mini"),
-                MagicMock(name="models--mlx-community--Phi-3-medium"), 
-                MagicMock(name="models--other--Llama-3-8B"),
-            ]
-            
-            for i, mock_model in enumerate(mock_models):
-                mock_model.name = f"models--{'mlx-community' if i < 2 else 'other'}--{'Phi-3-mini' if i == 0 else 'Phi-3-medium' if i == 1 else 'Llama-3-8B'}"
-            
-            mock_cache.iterdir.return_value = mock_models
-            
-            # Test finding Phi-3 models
-            matches = find_matching_models("Phi-3")
-            assert len(matches) == 2
-            
-            # Test finding non-existent model
-            matches = find_matching_models("nonexistent")
-            assert len(matches) == 0
-    
-    def test_partial_matching_basic_functionality(self):
-        """Test basic partial matching logic without complex mocking."""
-        # Simple functional test of the helper functions
-        try:
-            # These functions exist and can be called
-            assert callable(find_matching_models)
-            # Function handles empty input gracefully
-            matches = find_matching_models("")
-            assert isinstance(matches, list)
-        except Exception as e:
-            pytest.fail(f"Basic functionality test failed: {e}")
-
-
-class TestSingleModelFuzzyMatching:
-    """Test fuzzy matching for single-model commands (Issue 2)."""
-    
-    def test_resolve_single_model_function_exists(self):
-        """Test that resolve_single_model function exists and is callable."""
-        try:
-            assert callable(resolve_single_model)
-            # Function handles invalid input gracefully 
-            result = resolve_single_model("definitely-nonexistent-model-12345")
-            assert isinstance(result, tuple)
-            assert len(result) == 3
-        except Exception as e:
-            pytest.fail(f"Function existence test failed: {e}")
-    
-    @patch('mlx_knife.cache_utils.get_model_path') 
-    @patch('mlx_knife.cache_utils.find_matching_models')
-    def test_resolve_single_model_ambiguous_fuzzy(self, mock_find, mock_get_path, capsys):
-        """Test ambiguous fuzzy match shows error."""
-        # Mock exact match fails, fuzzy finds multiple matches
-        mock_get_path.return_value = (None, None, None)
-        mock_find.return_value = [
-            (MagicMock(), "model-1"),
-            (MagicMock(), "model-2")
-        ]
-        
-        result = resolve_single_model("partial")
-        assert result[0] is None  # Should fail
-        
-        # Check that error message was printed
-        captured = capsys.readouterr()
-        assert "Multiple models match" in captured.out
-        assert "model-1" in captured.out
-        assert "model-2" in captured.out
-    
-    @patch('mlx_knife.cache_utils.get_model_path')
-    @patch('mlx_knife.cache_utils.find_matching_models')
-    def test_resolve_single_model_no_match(self, mock_find, mock_get_path, capsys):
-        """Test no match shows appropriate error."""
-        # Mock both exact and fuzzy matching fail
-        mock_get_path.return_value = (None, None, None)
-        mock_find.return_value = []
-        
-        result = resolve_single_model("nonexistent")
-        assert result[0] is None  # Should fail
-        
-        # Check error message
-        captured = capsys.readouterr()
-        assert "No models found matching" in captured.out
-
-
-class TestShowModelHealthConsistency:
-    """Test for Issue #7 - Health check inconsistency in show command with fuzzy model names."""
-    
-    @patch('mlx_knife.cache_utils.resolve_single_model')
-    @patch('mlx_knife.cache_utils.is_model_healthy')
-    @patch('mlx_knife.cache_utils.get_model_size')
-    @patch('mlx_knife.cache_utils.get_model_modified')
-    @patch('mlx_knife.cache_utils.detect_framework')
-    @patch('builtins.print')
-    def test_show_model_health_consistency_fuzzy_vs_full_name(self, mock_print, mock_framework, 
-                                                              mock_modified, mock_size, mock_healthy, 
-                                                              mock_resolve, temp_cache_dir):
-        """Test that fuzzy and full model names show identical health status.
-        
-        This is a regression test for Issue #7 where:
-        - mlxk show Phi-3 showed "CORRUPTED"  
-        - mlxk show mlx-community/Phi-3-mini-4k-instruct-4bit showed "OK"
-        for the same underlying model.
-        """
-        # Setup mock model path
-        mock_model_path = temp_cache_dir / "models--mlx-community--Phi-3-mini-4k-instruct-4bit" / "snapshots" / "abc123"
-        mock_model_path.mkdir(parents=True)
-        
-        # Mock resolve_single_model to return consistent results
-        # Both fuzzy "Phi-3" and full name should resolve to same model_name
-        mock_resolve.return_value = (
-            mock_model_path,
-            "mlx-community/Phi-3-mini-4k-instruct-4bit",  # Resolved full name
-            "abc123"
-        )
-        
-        # Mock other dependencies
-        mock_size.return_value = "4.2GB"
-        mock_modified.return_value = "2023-12-01 10:00:00"
-        mock_framework.return_value = "MLX"
-        
-        # Test both healthy and unhealthy scenarios
-        for health_status in [True, False]:
-            mock_healthy.return_value = health_status
-            mock_print.reset_mock()
-            
-            # Test fuzzy name
-            from mlx_knife.cache_utils import show_model
-            show_model("Phi-3")  # Fuzzy name
-            fuzzy_calls = [str(call) for call in mock_print.call_args_list]
-            
-            mock_print.reset_mock()
-            
-            # Test full name  
-            show_model("mlx-community/Phi-3-mini-4k-instruct-4bit")  # Full name
-            full_calls = [str(call) for call in mock_print.call_args_list]
-            
-            # Both should have identical health output
-            fuzzy_health_output = [call for call in fuzzy_calls if "Health:" in call]
-            full_health_output = [call for call in full_calls if "Health:" in call]
-            
-            assert len(fuzzy_health_output) == 1, f"Expected 1 health output for fuzzy name, got {len(fuzzy_health_output)}"
-            assert len(full_health_output) == 1, f"Expected 1 health output for full name, got {len(full_health_output)}"
-            assert fuzzy_health_output == full_health_output, f"Health status differs: fuzzy={fuzzy_health_output} vs full={full_health_output}"
-            
-            # Verify is_model_healthy was called with resolved model name (not original spec)
-            expected_calls = [call("mlx-community/Phi-3-mini-4k-instruct-4bit")] * 2
-            assert mock_healthy.call_args_list == expected_calls, f"is_model_healthy should be called with resolved name, got {mock_healthy.call_args_list}"
-            
-            # Reset for next iteration
-            mock_healthy.reset_mock()
-
-
-
-class TestIssue6RepositoryNameValidation:
-    """Test for Issue #6 - Add repository name length validation for HuggingFace Hub."""
-    
-    @patch('builtins.input', return_value='y')  # Mock user input to avoid stdin issues
-    def test_pull_model_rejects_long_names(self, mock_input, capsys):
-        """Test that repository names >96 characters are rejected."""
-        from mlx_knife.hf_download import pull_model
-        
-        # Create a name that exceeds 96 characters after expansion
-        # Use direct long name that doesn't get expanded but is >96 chars
-        long_model_name = "organization-name/very-long-model-name-that-definitely-exceeds-the-character-limit-for-repositories-on-hf-platform"
-        
-        result = pull_model(long_model_name)
-        
-        assert result is False
-        
-        captured = capsys.readouterr()
-        assert "Repository name exceeds HuggingFace Hub limit" in captured.out
-        assert "96 characters" in captured.out
-        assert "cannot exist on HuggingFace Hub" in captured.out
-
-
-class TestIssue13HashBasedDisambiguation:
-    """Test for Issue #13 - Hash-based disambiguation for ambiguous model names."""
-    
-    def test_hash_exists_in_local_cache_full_hash(self):
-        """Test hash_exists_in_local_cache returns full hash when exact match exists."""
-        with patch('mlx_knife.cache_utils.MODEL_CACHE') as mock_cache:
-            mock_hash_dir = MagicMock()
-            mock_hash_dir.exists.return_value = True
-            
-            mock_snapshots_dir = MagicMock()
-            mock_snapshots_dir.exists.return_value = True
-            mock_snapshots_dir.__truediv__.return_value = mock_hash_dir
-            
-            mock_base_dir = MagicMock()
-            mock_base_dir.exists.return_value = True
-            mock_base_dir.__truediv__.return_value = mock_snapshots_dir
-            
-            mock_cache.__truediv__.return_value = mock_base_dir
-            
-            from mlx_knife.cache_utils import hash_exists_in_local_cache
-            
-            full_hash = "a5339a4131f135d0fdc6a5c8b5bbed2753bbe0f3"
-            result = hash_exists_in_local_cache("mlx-community/Phi-3-mini", full_hash)
-            assert result == full_hash
-    
-    def test_hash_exists_in_local_cache_none_no_model(self):
-        """Test hash_exists_in_local_cache returns None when model doesn't exist."""
-        with patch('mlx_knife.cache_utils.MODEL_CACHE') as mock_cache:
-            mock_base_dir = MagicMock()
-            mock_base_dir.exists.return_value = False
-            mock_cache.__truediv__.return_value = mock_base_dir
-            
-            from mlx_knife.cache_utils import hash_exists_in_local_cache
-            
-            result = hash_exists_in_local_cache("nonexistent/model", "hash123")
-            assert result is None
-    
-    def test_hash_exists_in_local_cache_none_no_hash(self):
-        """Test hash_exists_in_local_cache returns None when hash doesn't exist."""
-        with patch('mlx_knife.cache_utils.MODEL_CACHE') as mock_cache:
-            mock_hash_dir = MagicMock()
-            mock_hash_dir.exists.return_value = False
-            
-            mock_snapshots_dir = MagicMock()
-            mock_snapshots_dir.exists.return_value = True
-            mock_snapshots_dir.__truediv__.return_value = mock_hash_dir
-            mock_snapshots_dir.iterdir.return_value = []  # No snapshots
-            
-            mock_base_dir = MagicMock()
-            mock_base_dir.exists.return_value = True
-            mock_base_dir.__truediv__.return_value = mock_snapshots_dir
-            
-            mock_cache.__truediv__.return_value = mock_base_dir
-            
-            from mlx_knife.cache_utils import hash_exists_in_local_cache
-            
-            result = hash_exists_in_local_cache("mlx-community/Phi-3-mini", "nonexistent")
-            assert result is None
-    
-    def test_hash_exists_in_local_cache_short_hash_resolution(self):
-        """Test hash_exists_in_local_cache resolves short hashes locally."""
-        with patch('mlx_knife.cache_utils.MODEL_CACHE') as mock_cache:
-            # Mock exact match fails
-            mock_hash_dir = MagicMock()
-            mock_hash_dir.exists.return_value = False
-            
-            # Mock snapshots directory with matching hash
-            mock_snapshot = MagicMock()
-            mock_snapshot.is_dir.return_value = True
-            mock_snapshot.name = "de2dfaf56839b7d0e834157d2401dee02726874d"
-            
-            mock_snapshots_dir = MagicMock()
-            mock_snapshots_dir.exists.return_value = True
-            mock_snapshots_dir.__truediv__.return_value = mock_hash_dir
-            mock_snapshots_dir.iterdir.return_value = [mock_snapshot]
-            
-            mock_base_dir = MagicMock()
-            mock_base_dir.exists.return_value = True
-            mock_base_dir.__truediv__.return_value = mock_snapshots_dir
-            
-            mock_cache.__truediv__.return_value = mock_base_dir
-            
-            from mlx_knife.cache_utils import hash_exists_in_local_cache
-            
-            result = hash_exists_in_local_cache("mlx-community/Llama-3.3-70B", "de2dfaf5")
-            assert result == "de2dfaf56839b7d0e834157d2401dee02726874d"
-    
-    @patch('mlx_knife.cache_utils.get_model_path')
-    @patch('mlx_knife.cache_utils.hash_exists_in_local_cache')
-    @patch('mlx_knife.cache_utils.find_matching_models')
-    @patch('mlx_knife.cache_utils.MODEL_CACHE')
-    def test_resolve_single_model_hash_disambiguation_success(self, mock_cache, mock_find, mock_hash_exists, mock_get_path):
-        """Test successful hash-based disambiguation when multiple models match."""
-        # Mock find_matching_models to return multiple matches
-        mock_find.return_value = [
-            (MagicMock(), "mlx-community/Llama-3.2-1B-Instruct-4bit"),
-            (MagicMock(), "mlx-community/Llama-3.3-70B-Instruct-4bit"),
-        ]
-        
-        # Mock hash_exists_in_local_cache to return full hash for second model only
-        def mock_hash_exists_side_effect(model_name, commit_hash):
-            if model_name == "mlx-community/Llama-3.3-70B-Instruct-4bit":
-                return "de2dfaf56839b7d0e834157d2401dee02726874d"
-            return None
-        mock_hash_exists.side_effect = mock_hash_exists_side_effect
-        
-        # Mock get_model_path to return success
-        mock_get_path.return_value = (MagicMock(), "mlx-community/Llama-3.3-70B-Instruct-4bit", "de2dfaf5")
-        
-        # Mock MODEL_CACHE behavior for exact match check
-        mock_base_dir = MagicMock()
-        mock_base_dir.exists.return_value = False
-        mock_cache.__truediv__.return_value = mock_base_dir
-        
-        from mlx_knife.cache_utils import resolve_single_model
-        
-        result = resolve_single_model("Llama@de2dfaf5")
-        
-        # Should successfully resolve to the second model
-        assert result[1] == "mlx-community/Llama-3.3-70B-Instruct-4bit"
-        assert result[2] == "de2dfaf5"
-        
-        # Verify hash_exists_in_local_cache was called for both models
-        assert mock_hash_exists.call_count == 2
-        
-        # Verify get_model_path was called with the resolved spec (full hash)
-        mock_get_path.assert_called_once_with("mlx-community/Llama-3.3-70B-Instruct-4bit@de2dfaf56839b7d0e834157d2401dee02726874d")
-    
-    @patch('mlx_knife.cache_utils.hash_exists_in_local_cache')
-    @patch('mlx_knife.cache_utils.find_matching_models')
-    @patch('mlx_knife.cache_utils.MODEL_CACHE')
-    def test_resolve_single_model_hash_disambiguation_no_match(self, mock_cache, mock_find, mock_hash_exists, capsys):
-        """Test hash-based disambiguation when hash doesn't exist in any model."""
-        # Mock find_matching_models to return multiple matches
-        mock_find.return_value = [
-            (MagicMock(), "mlx-community/Llama-3.2-1B-Instruct-4bit"),
-            (MagicMock(), "mlx-community/Llama-3.3-70B-Instruct-4bit"),
-        ]
-        
-        # Mock hash_exists_in_local_cache to return None for all models
-        mock_hash_exists.return_value = None
-        
-        # Mock MODEL_CACHE behavior for exact match check
-        mock_base_dir = MagicMock()
-        mock_base_dir.exists.return_value = False
-        mock_cache.__truediv__.return_value = mock_base_dir
-        
-        from mlx_knife.cache_utils import resolve_single_model
-        
-        result = resolve_single_model("Llama@nonexistent")
-        
-        # Should return None tuple
-        assert result == (None, None, None)
-        
-        # Check error message was printed
-        captured = capsys.readouterr()
-        assert "Hash 'nonexistent' not found in any model matching 'Llama'" in captured.out
-        assert "Available models:" in captured.out
-    
-    @patch('mlx_knife.cache_utils.find_matching_models')
-    @patch('mlx_knife.cache_utils.MODEL_CACHE')
-    def test_resolve_single_model_no_hash_multiple_matches(self, mock_cache, mock_find, capsys):
-        """Test traditional ambiguous model behavior without hash is preserved."""
-        # Mock find_matching_models to return multiple matches
-        mock_find.return_value = [
-            (MagicMock(), "mlx-community/Llama-3.2-1B-Instruct-4bit"),
-            (MagicMock(), "mlx-community/Llama-3.3-70B-Instruct-4bit"),
-        ]
-        
-        # Mock MODEL_CACHE behavior for exact match check
-        mock_base_dir = MagicMock()
-        mock_base_dir.exists.return_value = False
-        mock_cache.__truediv__.return_value = mock_base_dir
-        
-        from mlx_knife.cache_utils import resolve_single_model
-        
-        result = resolve_single_model("Llama")  # No hash specified
-        
-        # Should return None tuple
-        assert result == (None, None, None)
-        
-        # Check traditional error message was printed
-        captured = capsys.readouterr()
-        assert "Multiple models match 'Llama'. Please be more specific:" in captured.out
-
-
-# Add pytest fixture at module level
-@pytest.fixture
-def temp_cache_dir():
-    """Create temporary cache directory for testing."""
-    with tempfile.TemporaryDirectory() as temp_dir:
-        yield Path(temp_dir)
\ No newline at end of file
diff --git a/tests/unit/test_cli.py b/tests/unit/test_cli.py
deleted file mode 100644
index 9082064..0000000
--- a/tests/unit/test_cli.py
+++ /dev/null
@@ -1,205 +0,0 @@
-"""
-Unit tests for cli.py module.
-
-Tests the command-line interface functionality:
-- Argument parsing
-- Command dispatch
-- Help and version output
-"""
-import pytest
-import argparse
-from unittest.mock import patch, MagicMock
-import sys
-import os
-
-# Import the module under test
-from mlx_knife.cli import main
-
-
-class TestMainFunctionBasic:
-    """Test basic main function behavior without requiring parser creation."""
-    
-    def test_main_function_exists(self):
-        """Test that main function exists and is callable."""
-        try:
-            assert callable(main)
-        except Exception as e:
-            pytest.fail(f"Main function test failed: {e}")
-
-    def test_version_flag_via_main(self):
-        """Test version flag through main function."""
-        try:
-            with patch('sys.argv', ['mlxk', '--version']):
-                with pytest.raises(SystemExit) as exc_info:
-                    main()
-                # Version should exit cleanly
-                assert exc_info.value.code in [0, None]
-        except Exception as e:
-            # It's OK if version parsing isn't fully implemented yet
-            pass
-
-
-class TestMainFunction:
-    """Test main function behavior."""
-    
-    def test_main_with_help(self):
-        """Test main function with help argument."""
-        try:
-            with patch('sys.argv', ['mlxk', '--help']):
-                with pytest.raises(SystemExit) as exc_info:
-                    main()
-                # Help should exit with code 0
-                assert exc_info.value.code == 0 or exc_info.value.code is None
-        except Exception as e:
-            pytest.fail(f"Main function help test failed: {e}")
-
-    def test_main_with_invalid_command(self):
-        """Test main function with invalid command."""
-        try:
-            with patch('sys.argv', ['mlxk', 'invalid-command-xyz']):
-                with pytest.raises(SystemExit) as exc_info:
-                    main()
-                # Invalid command should exit with non-zero code
-                assert exc_info.value.code != 0
-        except Exception as e:
-            pytest.fail(f"Main function invalid command test failed: {e}")
-
-    @patch('mlx_knife.cache_utils.list_models')
-    def test_main_with_list_command(self, mock_list_models):
-        """Test main function with list command."""
-        try:
-            # Mock the list_models function to avoid actual cache interaction
-            mock_list_models.return_value = None
-            
-            with patch('sys.argv', ['mlxk', 'list']):
-                try:
-                    main()
-                except SystemExit as e:
-                    # List command might exit with 0 on success
-                    assert e.code == 0 or e.code is None
-        except Exception as e:
-            pytest.fail(f"Main function list command test failed: {e}")
-
-    @patch('mlx_knife.cache_utils.check_all_models_health')
-    def test_main_with_health_command(self, mock_health_check):
-        """Test main function with health command."""
-        try:
-            # Mock the health check function
-            mock_health_check.return_value = None
-            
-            with patch('sys.argv', ['mlxk', 'health']):
-                try:
-                    main()
-                except SystemExit as e:
-                    # Health command should exit gracefully
-                    assert e.code == 0 or e.code is None
-        except Exception as e:
-            pytest.fail(f"Main function health command test failed: {e}")
-
-    def test_main_no_arguments(self):
-        """Test main function with no arguments."""
-        try:
-            with patch('sys.argv', ['mlxk']):
-                # The CLI shows help when no args are provided - this is valid behavior
-                main()  # Should complete successfully showing help
-        except SystemExit as e:
-            # Also valid - some CLIs exit after showing help
-            pass
-        except Exception as e:
-            pytest.fail(f"Main function no arguments test failed: {e}")
-
-
-class TestErrorHandling:
-    """Test CLI error handling."""
-    
-    def test_keyboard_interrupt_handling(self):
-        """Test handling of KeyboardInterrupt (Ctrl+C)."""
-        try:
-            # Test that KeyboardInterrupt doesn't crash the CLI completely
-            with patch('sys.argv', ['mlxk', 'list']):
-                with patch('builtins.print', side_effect=KeyboardInterrupt()):
-                    try:
-                        main()
-                    except KeyboardInterrupt:
-                        # KeyboardInterrupt propagating up is acceptable
-                        pass
-                    except SystemExit:
-                        # Graceful exit is also acceptable
-                        pass
-        except Exception as e:
-            pytest.fail(f"Keyboard interrupt handling test failed: {e}")
-
-    def test_basic_command_robustness(self):
-        """Test that basic commands don't crash unexpectedly."""
-        try:
-            # Test that list command runs successfully (already working based on earlier test)
-            with patch('sys.argv', ['mlxk', 'list']):
-                main()  # Should work fine
-        except SystemExit:
-            # Exit is acceptable for some CLI implementations
-            pass
-        except Exception as e:
-            pytest.fail(f"Basic command robustness test failed: {e}")
-
-
-class TestHealthCommandDefaultBehavior:
-    """Test health command default behavior (Issue 3)."""
-    
-    @patch('mlx_knife.cli.check_all_models_health')
-    def test_health_command_without_args_calls_all(self, mock_check_all):
-        """Test that 'mlxk health' (no args) calls check_all_models_health."""
-        mock_check_all.return_value = True
-        
-        try:
-            with patch('sys.argv', ['mlxk', 'health']):
-                main()
-            
-            # Should have called check_all_models_health
-            assert mock_check_all.called
-            mock_check_all.assert_called_once()
-        except SystemExit:
-            # Exit is acceptable after running the command
-            assert mock_check_all.called
-        except Exception as e:
-            pytest.fail(f"Health command default behavior test failed: {e}")
-    
-    @patch('mlx_knife.cli.check_model_health')
-    @patch('mlx_knife.cli.check_all_models_health')
-    def test_health_command_with_specific_model(self, mock_check_all, mock_check_specific):
-        """Test that 'mlxk health model-name' calls check_model_health."""
-        mock_check_specific.return_value = True
-        
-        try:
-            with patch('sys.argv', ['mlxk', 'health', 'some-model']):
-                main()
-            
-            # Should have called check_model_health with the specific model
-            assert mock_check_specific.called
-            mock_check_specific.assert_called_once_with('some-model')
-            
-            # Should NOT have called check_all_models_health
-            assert not mock_check_all.called
-        except SystemExit:
-            # Exit is acceptable after running the command
-            assert mock_check_specific.called
-            assert not mock_check_all.called
-        except Exception as e:
-            pytest.fail(f"Health command specific model test failed: {e}")
-    
-    @patch('mlx_knife.cli.check_all_models_health')
-    def test_health_command_backward_compatibility_with_all_flag(self, mock_check_all):
-        """Test that 'mlxk health --all' still works for backward compatibility."""
-        mock_check_all.return_value = True
-        
-        try:
-            with patch('sys.argv', ['mlxk', 'health', '--all']):
-                main()
-            
-            # Should have called check_all_models_health  
-            assert mock_check_all.called
-            mock_check_all.assert_called_once()
-        except SystemExit:
-            # Exit is acceptable after running the command
-            assert mock_check_all.called
-        except Exception as e:
-            pytest.fail(f"Health command --all flag test failed: {e}")
\ No newline at end of file
diff --git a/tests/unit/test_health_multishard.py b/tests/unit/test_health_multishard.py
deleted file mode 100644
index 1fe46e8..0000000
--- a/tests/unit/test_health_multishard.py
+++ /dev/null
@@ -1,128 +0,0 @@
-"""
-Strict health checks for multi-shard/index models (Issue #27 backport).
-"""
-import json
-from pathlib import Path
-
-from mlx_knife.cache_utils import is_model_healthy
-
-
-def _write_json(p: Path, data: dict) -> None:
-    p.write_text(json.dumps(data))
-
-
-def _mk_snapshot(tmp_dir: Path, name: str = "models--org--model", snap: str = "main") -> Path:
-    d = tmp_dir / "hub" / name / "snapshots" / snap
-    d.mkdir(parents=True, exist_ok=True)
-    return d
-
-
-def test_index_complete_healthy(temp_cache_dir: Path):
-    d = _mk_snapshot(temp_cache_dir)
-    # Valid config
-    _write_json(d / "config.json", {"model_type": "test"})
-    # Two shards + index
-    (d / "model-00001-of-00002.safetensors").write_bytes(b"a" * 1024)
-    (d / "model-00002-of-00002.safetensors").write_bytes(b"b" * 1024)
-    _write_json(
-        d / "model.safetensors.index.json",
-        {"weight_map": {"w1": "model-00001-of-00002.safetensors", "w2": "model-00002-of-00002.safetensors"}},
-    )
-    assert is_model_healthy(str(d)) is True
-
-
-def test_index_missing_shard_unhealthy(temp_cache_dir: Path):
-    d = _mk_snapshot(temp_cache_dir)
-    _write_json(d / "config.json", {"model_type": "test"})
-    # Only one shard present
-    (d / "model-00001-of-00002.safetensors").write_bytes(b"a" * 1024)
-    _write_json(
-        d / "model.safetensors.index.json",
-        {"weight_map": {"w1": "model-00001-of-00002.safetensors", "w2": "model-00002-of-00002.safetensors"}},
-    )
-    assert is_model_healthy(str(d)) is False
-
-
-def test_index_empty_shard_unhealthy(temp_cache_dir: Path):
-    d = _mk_snapshot(temp_cache_dir)
-    _write_json(d / "config.json", {"model_type": "test"})
-    (d / "model-00001-of-00002.safetensors").write_bytes(b"")  # empty
-    (d / "model-00002-of-00002.safetensors").write_bytes(b"b" * 1024)
-    _write_json(
-        d / "model.safetensors.index.json",
-        {"weight_map": {"w1": "model-00001-of-00002.safetensors", "w2": "model-00002-of-00002.safetensors"}},
-    )
-    assert is_model_healthy(str(d)) is False
-
-
-def test_index_lfs_pointer_unhealthy(temp_cache_dir: Path):
-    d = _mk_snapshot(temp_cache_dir)
-    _write_json(d / "config.json", {"model_type": "test"})
-    # One shard becomes an LFS pointer (small text with signature)
-    (d / "model-00001-of-00002.safetensors").write_text(
-        "version https://git-lfs.github.com/spec/v1\n"
-        "oid sha256:deadbeef\n"
-        "size 100\n"
-    )
-    (d / "model-00002-of-00002.safetensors").write_bytes(b"b" * 1024)
-    _write_json(
-        d / "model.safetensors.index.json",
-        {"weight_map": {"w1": "model-00001-of-00002.safetensors", "w2": "model-00002-of-00002.safetensors"}},
-    )
-    assert is_model_healthy(str(d)) is False
-
-
-def test_pattern_complete_without_index_unhealthy(temp_cache_dir: Path):
-    d = _mk_snapshot(temp_cache_dir)
-    _write_json(d / "config.json", {"model_type": "test"})
-    # Two shards with pattern but no index
-    (d / "model-00001-of-00002.safetensors").write_bytes(b"a" * 1024)
-    (d / "model-00002-of-00002.safetensors").write_bytes(b"b" * 1024)
-    assert is_model_healthy(str(d)) is False
-
-
-def test_pattern_incomplete_unhealthy(temp_cache_dir: Path):
-    d = _mk_snapshot(temp_cache_dir)
-    _write_json(d / "config.json", {"model_type": "test"})
-    # Only one pattern shard present, no index
-    (d / "model-00001-of-00002.safetensors").write_bytes(b"a" * 1024)
-    assert is_model_healthy(str(d)) is False
-
-
-def test_partial_marker_unhealthy(temp_cache_dir: Path):
-    d = _mk_snapshot(temp_cache_dir)
-    _write_json(d / "config.json", {"model_type": "test"})
-    (d / "model.safetensors.partial").write_bytes(b"x")
-    assert is_model_healthy(str(d)) is False
-
-
-def test_single_file_safetensors_ok(temp_cache_dir: Path):
-    d = _mk_snapshot(temp_cache_dir)
-    _write_json(d / "config.json", {"model_type": "test"})
-    (d / "model.safetensors").write_bytes(b"weights" * 256)
-    assert is_model_healthy(str(d)) is True
-
-
-def test_single_file_gguf_ok(temp_cache_dir: Path):
-    d = _mk_snapshot(temp_cache_dir)
-    _write_json(d / "config.json", {"model_type": "test"})
-    (d / "model.q4_0.gguf").write_bytes(b"gguf-weights" * 256)
-    assert is_model_healthy(str(d)) is True
-
-
-def test_pytorch_index_complete_ok(temp_cache_dir: Path):
-    d = _mk_snapshot(temp_cache_dir)
-    _write_json(d / "config.json", {"model_type": "test"})
-    (d / "pytorch_model-00001-of-00002.bin").write_bytes(b"a" * 1024)
-    (d / "pytorch_model-00002-of-00002.bin").write_bytes(b"b" * 1024)
-    _write_json(
-        d / "pytorch_model.bin.index.json",
-        {
-            "weight_map": {
-                "w1": "pytorch_model-00001-of-00002.bin",
-                "w2": "pytorch_model-00002-of-00002.bin",
-            }
-        },
-    )
-    assert is_model_healthy(str(d)) is True
-
diff --git a/tests/unit/test_mlx_runner_memory.py b/tests/unit/test_mlx_runner_memory.py
deleted file mode 100644
index 3dcdc88..0000000
--- a/tests/unit/test_mlx_runner_memory.py
+++ /dev/null
@@ -1,551 +0,0 @@
-"""
-Unit tests for MLXRunner memory management robustness and context length handling.
-
-Tests context manager implementation, exception handling, cleanup guarantees,
-and model context length extraction without requiring actual MLX models.
-"""
-import json
-import os
-import tempfile
-import unittest
-from unittest.mock import MagicMock, patch, PropertyMock
-import gc
-
-
-class TestMLXRunnerMemoryManagement(unittest.TestCase):
-    """Test MLXRunner memory management robustness."""
-    
-    @patch('mlx_knife.mlx_runner.mx')
-    @patch('mlx_knife.mlx_runner.load')
-    def test_context_manager_basic_flow(self, mock_load, mock_mx):
-        """Test basic context manager flow with successful execution."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        # Setup mocks
-        mock_model = MagicMock()
-        mock_tokenizer = MagicMock()
-        mock_tokenizer.eos_token = '</s>'
-        mock_tokenizer.eos_token_id = 2
-        mock_load.return_value = (mock_model, mock_tokenizer)
-        mock_mx.get_active_memory.return_value = 1024 * 1024 * 1024  # 1GB
-        
-        # Test successful context manager usage
-        with MLXRunner("test_model", verbose=False) as runner:
-            self.assertIsNotNone(runner.model)
-            self.assertIsNotNone(runner.tokenizer)
-            self.assertTrue(runner._model_loaded)
-            self.assertTrue(runner._context_entered)
-        
-        # After exiting context, model should be cleaned up
-        self.assertIsNone(runner.model)
-        self.assertIsNone(runner.tokenizer)
-        self.assertFalse(runner._model_loaded)
-        self.assertFalse(runner._context_entered)
-        
-        # Verify cleanup was called
-        mock_mx.clear_cache.assert_called()
-    
-    @patch('mlx_knife.mlx_runner.mx')
-    @patch('mlx_knife.mlx_runner.load')
-    def test_context_manager_exception_in_load(self, mock_load, mock_mx):
-        """Test cleanup when exception occurs during model loading."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        # Setup mock to fail during load
-        mock_load.side_effect = RuntimeError("Model loading failed")
-        mock_mx.get_active_memory.return_value = 1024 * 1024 * 1024
-        
-        # Test that exception is propagated and cleanup happens
-        with self.assertRaises(RuntimeError) as cm:
-            with MLXRunner("test_model", verbose=False) as runner:
-                pass  # Should never reach here
-        
-        self.assertIn("Failed to load model", str(cm.exception))
-        
-        # Verify cleanup was called even on failure
-        mock_mx.clear_cache.assert_called()
-    
-    @patch('mlx_knife.mlx_runner.mx')
-    @patch('mlx_knife.mlx_runner.load')
-    def test_context_manager_exception_in_body(self, mock_load, mock_mx):
-        """Test cleanup when exception occurs in context body."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        # Setup successful mocks
-        mock_model = MagicMock()
-        mock_tokenizer = MagicMock()
-        mock_tokenizer.eos_token = '</s>'
-        mock_tokenizer.eos_token_id = 2
-        mock_load.return_value = (mock_model, mock_tokenizer)
-        mock_mx.get_active_memory.return_value = 1024 * 1024 * 1024
-        
-        # Test exception in context body
-        with self.assertRaises(ValueError):
-            with MLXRunner("test_model", verbose=False) as runner:
-                self.assertTrue(runner._model_loaded)
-                raise ValueError("User error")
-        
-        # Cleanup should still happen
-        self.assertIsNone(runner.model)
-        self.assertIsNone(runner.tokenizer)
-        self.assertFalse(runner._model_loaded)
-        mock_mx.clear_cache.assert_called()
-    
-    @patch('mlx_knife.mlx_runner.mx')
-    @patch('mlx_knife.mlx_runner.load')
-    def test_prevent_nested_context_usage(self, mock_load, mock_mx):
-        """Test that nested context manager usage is prevented."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        # Setup mocks
-        mock_model = MagicMock()
-        mock_tokenizer = MagicMock()
-        mock_tokenizer.eos_token = '</s>'
-        mock_tokenizer.eos_token_id = 2
-        mock_load.return_value = (mock_model, mock_tokenizer)
-        mock_mx.get_active_memory.return_value = 1024 * 1024 * 1024
-        
-        runner = MLXRunner("test_model", verbose=False)
-        
-        # First context should work
-        with runner:
-            self.assertTrue(runner._context_entered)
-            
-            # Nested context should fail
-            with self.assertRaises(RuntimeError) as cm:
-                with runner:
-                    pass
-            
-            self.assertIn("cannot be entered multiple times", str(cm.exception))
-        
-        # After exiting, should be able to use again
-        self.assertFalse(runner._context_entered)
-        
-        # Second usage should work
-        with runner:
-            self.assertTrue(runner._context_entered)
-    
-    @patch('mlx_knife.mlx_runner.mx')
-    @patch('mlx_knife.mlx_runner.load')
-    def test_partial_loading_failure_cleanup(self, mock_load, mock_mx):
-        """Test cleanup when loading partially succeeds then fails."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        # Setup mock to partially succeed
-        mock_model = MagicMock()
-        mock_tokenizer = MagicMock()
-        
-        # Missing required attributes to trigger failure in _extract_stop_tokens
-        del mock_tokenizer.eos_token
-        del mock_tokenizer.eos_token_id
-        mock_tokenizer.encode.side_effect = Exception("Tokenizer error")
-        
-        mock_load.return_value = (mock_model, mock_tokenizer)
-        mock_mx.get_active_memory.return_value = 1024 * 1024 * 1024
-        
-        runner = MLXRunner("test_model", verbose=False)
-        
-        # Load should succeed even with tokenizer issues
-        try:
-            runner.load_model()
-            # Model should be loaded even if stop token extraction had issues
-            self.assertIsNotNone(runner.model)
-            self.assertIsNotNone(runner.tokenizer)
-        finally:
-            # Cleanup should work regardless
-            runner.cleanup()
-            self.assertIsNone(runner.model)
-            self.assertIsNone(runner.tokenizer)
-            mock_mx.clear_cache.assert_called()
-    
-    @patch('mlx_knife.mlx_runner.mx')
-    def test_cleanup_idempotency(self, mock_mx):
-        """Test that cleanup can be called multiple times safely."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        mock_mx.get_active_memory.return_value = 1024 * 1024 * 1024
-        
-        runner = MLXRunner("test_model", verbose=False)
-        runner.model = MagicMock()
-        runner.tokenizer = MagicMock()
-        runner._model_loaded = True
-        
-        # Call cleanup multiple times
-        for _ in range(3):
-            runner.cleanup()
-            self.assertIsNone(runner.model)
-            self.assertIsNone(runner.tokenizer)
-            self.assertFalse(runner._model_loaded)
-        
-        # Should have been called at least once
-        mock_mx.clear_cache.assert_called()
-    
-    @patch('mlx_knife.mlx_runner.mx')
-    @patch('mlx_knife.mlx_runner.load')
-    def test_memory_baseline_tracking(self, mock_load, mock_mx):
-        """Test memory baseline is properly tracked."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        # Setup mocks
-        mock_model = MagicMock()
-        mock_tokenizer = MagicMock()
-        mock_tokenizer.eos_token = '</s>'
-        mock_tokenizer.eos_token_id = 2
-        mock_load.return_value = (mock_model, mock_tokenizer)
-        
-        # Simulate memory growth during loading
-        memory_values = [
-            1 * 1024**3,  # 1GB baseline
-            5 * 1024**3,  # 5GB after loading
-            5 * 1024**3,  # 5GB when querying stats
-        ]
-        mock_mx.get_active_memory.side_effect = memory_values
-        
-        runner = MLXRunner("test_model", verbose=False)
-        runner.load_model()
-        
-        # Check baseline was captured
-        self.assertEqual(runner._memory_baseline, 1.0)  # 1GB
-        
-        # Check memory usage calculation
-        memory_stats = runner.get_memory_usage()
-        self.assertEqual(memory_stats["model_gb"], 4.0)  # 5GB - 1GB = 4GB
-    
-    @patch('mlx_knife.mlx_runner.mx')
-    @patch('mlx_knife.mlx_runner.load')
-    def test_generate_without_loading(self, mock_load, mock_mx):
-        """Test that generate methods fail gracefully without loaded model."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        runner = MLXRunner("test_model", verbose=False)
-        
-        # Try to generate without loading
-        with self.assertRaises(RuntimeError) as cm:
-            list(runner.generate_streaming("test prompt"))
-        self.assertIn("Model not loaded", str(cm.exception))
-        
-        with self.assertRaises(RuntimeError) as cm:
-            runner.generate_batch("test prompt")
-        self.assertIn("Model not loaded", str(cm.exception))
-    
-    @patch('mlx_knife.mlx_runner.mx')
-    @patch('mlx_knife.mlx_runner.load')
-    def test_server_usage_without_context_manager(self, mock_load, mock_mx):
-        """Test server-style usage without context manager."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        # Setup mocks
-        mock_model = MagicMock()
-        mock_tokenizer = MagicMock()
-        mock_tokenizer.eos_token = '</s>'
-        mock_tokenizer.eos_token_id = 2
-        mock_load.return_value = (mock_model, mock_tokenizer)
-        mock_mx.get_active_memory.return_value = 1024 * 1024 * 1024
-        
-        # Server style: manual load and cleanup
-        runner = MLXRunner("test_model", verbose=False)
-        
-        try:
-            runner.load_model()
-            self.assertTrue(runner._model_loaded)
-            self.assertIsNotNone(runner.model)
-            
-            # Simulate server keeping model loaded
-            # and potentially switching models
-            runner.cleanup()
-            self.assertFalse(runner._model_loaded)
-            self.assertIsNone(runner.model)
-            
-            # Load again (simulating model switch)
-            runner.load_model()
-            self.assertTrue(runner._model_loaded)
-            
-        finally:
-            # Ensure cleanup happens
-            runner.cleanup()
-            self.assertFalse(runner._model_loaded)
-    
-    @patch('mlx_knife.mlx_runner.mx')
-    @patch('mlx_knife.mlx_runner.load')
-    def test_exception_during_cleanup(self, mock_load, mock_mx):
-        """Test that cleanup handles exceptions gracefully."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        # Setup mocks
-        mock_model = MagicMock()
-        mock_tokenizer = MagicMock()
-        mock_tokenizer.eos_token = '</s>'
-        mock_tokenizer.eos_token_id = 2
-        mock_load.return_value = (mock_model, mock_tokenizer)
-        mock_mx.get_active_memory.return_value = 1024 * 1024 * 1024
-        
-        # Make clear_cache raise an exception
-        mock_mx.clear_cache.side_effect = Exception("Cache clear failed")
-        
-        runner = MLXRunner("test_model", verbose=False)
-        runner.load_model()
-        
-        # Cleanup should complete even if mx.clear_cache fails
-        runner.cleanup()  # Should not raise
-        
-        # State should still be cleaned
-        self.assertIsNone(runner.model)
-        self.assertIsNone(runner.tokenizer)
-        self.assertFalse(runner._model_loaded)
-
-
-class TestModelContextLength(unittest.TestCase):
-    """Test model context length extraction functionality."""
-    
-    def test_get_model_context_length_with_max_position_embeddings(self):
-        """Test context length extraction from max_position_embeddings."""
-        from mlx_knife.mlx_runner import get_model_context_length
-        
-        with tempfile.TemporaryDirectory() as temp_dir:
-            config_path = os.path.join(temp_dir, "config.json")
-            config = {
-                "max_position_embeddings": 4096,
-                "hidden_size": 768,
-                "num_attention_heads": 12
-            }
-            
-            with open(config_path, 'w') as f:
-                json.dump(config, f)
-            
-            context_length = get_model_context_length(temp_dir)
-            self.assertEqual(context_length, 4096)
-    
-    def test_get_model_context_length_with_n_positions(self):
-        """Test context length extraction from n_positions (GPT-style)."""
-        from mlx_knife.mlx_runner import get_model_context_length
-        
-        with tempfile.TemporaryDirectory() as temp_dir:
-            config_path = os.path.join(temp_dir, "config.json")
-            config = {
-                "n_positions": 2048,
-                "n_embd": 512,
-                "n_head": 8
-            }
-            
-            with open(config_path, 'w') as f:
-                json.dump(config, f)
-            
-            context_length = get_model_context_length(temp_dir)
-            self.assertEqual(context_length, 2048)
-    
-    def test_get_model_context_length_with_context_length(self):
-        """Test context length extraction from context_length field."""
-        from mlx_knife.mlx_runner import get_model_context_length
-        
-        with tempfile.TemporaryDirectory() as temp_dir:
-            config_path = os.path.join(temp_dir, "config.json")
-            config = {
-                "context_length": 8192,
-                "hidden_size": 1024
-            }
-            
-            with open(config_path, 'w') as f:
-                json.dump(config, f)
-            
-            context_length = get_model_context_length(temp_dir)
-            self.assertEqual(context_length, 8192)
-    
-    def test_get_model_context_length_with_max_sequence_length(self):
-        """Test context length extraction from max_sequence_length."""
-        from mlx_knife.mlx_runner import get_model_context_length
-        
-        with tempfile.TemporaryDirectory() as temp_dir:
-            config_path = os.path.join(temp_dir, "config.json")
-            config = {
-                "max_sequence_length": 32768,
-                "d_model": 2048
-            }
-            
-            with open(config_path, 'w') as f:
-                json.dump(config, f)
-            
-            context_length = get_model_context_length(temp_dir)
-            self.assertEqual(context_length, 32768)
-    
-    def test_get_model_context_length_with_seq_len(self):
-        """Test context length extraction from seq_len field."""
-        from mlx_knife.mlx_runner import get_model_context_length
-        
-        with tempfile.TemporaryDirectory() as temp_dir:
-            config_path = os.path.join(temp_dir, "config.json")
-            config = {
-                "seq_len": 16384,
-                "embedding_size": 1536
-            }
-            
-            with open(config_path, 'w') as f:
-                json.dump(config, f)
-            
-            context_length = get_model_context_length(temp_dir)
-            self.assertEqual(context_length, 16384)
-    
-    def test_get_model_context_length_priority_order(self):
-        """Test that max_position_embeddings takes priority over other fields."""
-        from mlx_knife.mlx_runner import get_model_context_length
-        
-        with tempfile.TemporaryDirectory() as temp_dir:
-            config_path = os.path.join(temp_dir, "config.json")
-            config = {
-                "max_position_embeddings": 4096,  # Should be used (first in priority)
-                "n_positions": 2048,
-                "context_length": 8192,
-                "max_sequence_length": 16384,
-                "seq_len": 1024
-            }
-            
-            with open(config_path, 'w') as f:
-                json.dump(config, f)
-            
-            context_length = get_model_context_length(temp_dir)
-            self.assertEqual(context_length, 4096)
-    
-    def test_get_model_context_length_missing_config_file(self):
-        """Test default context length when config.json is missing."""
-        from mlx_knife.mlx_runner import get_model_context_length
-        
-        with tempfile.TemporaryDirectory() as temp_dir:
-            # No config.json file created
-            context_length = get_model_context_length(temp_dir)
-            self.assertEqual(context_length, 4096)  # Default fallback
-    
-    def test_get_model_context_length_invalid_json(self):
-        """Test default context length when config.json is malformed."""
-        from mlx_knife.mlx_runner import get_model_context_length
-        
-        with tempfile.TemporaryDirectory() as temp_dir:
-            config_path = os.path.join(temp_dir, "config.json")
-            
-            # Write invalid JSON
-            with open(config_path, 'w') as f:
-                f.write("{ invalid json content")
-            
-            context_length = get_model_context_length(temp_dir)
-            self.assertEqual(context_length, 4096)  # Default fallback
-    
-    def test_get_model_context_length_empty_config(self):
-        """Test default context length when config.json has no context fields."""
-        from mlx_knife.mlx_runner import get_model_context_length
-        
-        with tempfile.TemporaryDirectory() as temp_dir:
-            config_path = os.path.join(temp_dir, "config.json")
-            config = {
-                "hidden_size": 768,
-                "num_attention_heads": 12,
-                "model_type": "test_model"
-            }
-            
-            with open(config_path, 'w') as f:
-                json.dump(config, f)
-            
-            context_length = get_model_context_length(temp_dir)
-            self.assertEqual(context_length, 4096)  # Default fallback
-
-
-class TestMLXRunnerContextAwareLimits(unittest.TestCase):
-    """Test MLXRunner context-aware token limits."""
-    
-    @patch('mlx_knife.mlx_runner.get_model_context_length')
-    def test_get_effective_max_tokens_interactive_mode(self, mock_get_context):
-        """Test effective max tokens in interactive mode (uses full context)."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        mock_get_context.return_value = 4096
-        
-        runner = MLXRunner("test_model", verbose=False)
-        runner._context_length = 4096
-        
-        # Interactive mode: should use full context length
-        effective = runner.get_effective_max_tokens(8000, interactive=True)
-        self.assertEqual(effective, 4096)  # Limited by model context
-        
-        effective = runner.get_effective_max_tokens(2000, interactive=True)  
-        self.assertEqual(effective, 2000)  # User request is smaller
-    
-    @patch('mlx_knife.mlx_runner.get_model_context_length')
-    def test_get_effective_max_tokens_server_mode(self, mock_get_context):
-        """Test effective max tokens in server mode (uses half context for DoS protection)."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        mock_get_context.return_value = 4096
-        
-        runner = MLXRunner("test_model", verbose=False)
-        runner._context_length = 4096
-        
-        # Server mode: should use half context length
-        effective = runner.get_effective_max_tokens(8000, interactive=False)
-        self.assertEqual(effective, 2048)  # Limited by server limit (4096 / 2)
-        
-        effective = runner.get_effective_max_tokens(1000, interactive=False)
-        self.assertEqual(effective, 1000)  # User request is smaller
-    
-    @patch('mlx_knife.mlx_runner.get_model_context_length')
-    def test_get_effective_max_tokens_no_context_length(self, mock_get_context):
-        """Test effective max tokens when context length is unknown."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        runner = MLXRunner("test_model", verbose=False)
-        runner._context_length = None  # Context length unknown
-        
-        # Should fallback to requested tokens
-        effective = runner.get_effective_max_tokens(1500, interactive=True)
-        self.assertEqual(effective, 1500)
-        
-        effective = runner.get_effective_max_tokens(2500, interactive=False)
-        self.assertEqual(effective, 2500)
-    
-    @patch('mlx_knife.mlx_runner.get_model_context_length')
-    def test_get_effective_max_tokens_none_interactive_mode(self, mock_get_context):
-        """Test that None (no --max-tokens) uses full context in interactive mode."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        mock_get_context.return_value = 4096
-        
-        runner = MLXRunner("test_model", verbose=False)
-        runner._context_length = 4096
-        
-        # None (user didn't specify --max-tokens) should use full context
-        effective = runner.get_effective_max_tokens(None, interactive=True)
-        self.assertEqual(effective, 4096)
-        
-        # Explicit values should still be respected
-        effective = runner.get_effective_max_tokens(500, interactive=True)
-        self.assertEqual(effective, 500)  # Now 500 is treated as explicit user choice
-    
-    @patch('mlx_knife.mlx_runner.get_model_context_length')
-    def test_get_effective_max_tokens_none_server_mode(self, mock_get_context):
-        """Test that None uses server default in server mode."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        mock_get_context.return_value = 4096
-        
-        runner = MLXRunner("test_model", verbose=False)
-        runner._context_length = 4096
-        
-        # None in server mode should use server limit (context / 2)
-        effective = runner.get_effective_max_tokens(None, interactive=False)
-        self.assertEqual(effective, 2048)  # 4096 / 2
-    
-    @patch('mlx_knife.mlx_runner.get_model_context_length')
-    def test_get_effective_max_tokens_none_unknown_context(self, mock_get_context):
-        """Test None behavior when context length is unknown."""
-        from mlx_knife.mlx_runner import MLXRunner
-        
-        runner = MLXRunner("test_model", verbose=False)
-        runner._context_length = None
-        
-        # Interactive mode: should use 4096 fallback when None
-        effective = runner.get_effective_max_tokens(None, interactive=True)
-        self.assertEqual(effective, 4096)
-        
-        # Server mode: should use 2048 fallback when None
-        effective = runner.get_effective_max_tokens(None, interactive=False)
-        self.assertEqual(effective, 2048)
-
-
-if __name__ == '__main__':
-    unittest.main()
\ No newline at end of file
diff --git a/tests/unit/test_model_card_detection.py b/tests/unit/test_model_card_detection.py
deleted file mode 100644
index 8c801b1..0000000
--- a/tests/unit/test_model_card_detection.py
+++ /dev/null
@@ -1,150 +0,0 @@
-import json
-from pathlib import Path
-
-import pytest
-
-from mlx_knife.cache_utils import detect_framework, detect_model_type, run_model
-
-
-def _make_base(temp_cache_dir: Path, org: str, name: str) -> Path:
-    base = temp_cache_dir / "hub" / f"models--{org}--{name}" / "snapshots" / "main"
-    base.mkdir(parents=True, exist_ok=True)
-    return base
-
-
-def test_readme_only_mlx_chat_detection(temp_cache_dir: Path):
-    base = _make_base(temp_cache_dir, "private", "my-mlx-chat")
-    # Minimal config to look like a model snapshot
-    (base / "config.json").write_text(json.dumps({"model_type": "llama"}))
-    # README with front matter
-    readme = """---
-tags: [mlx, chat]
-pipeline_tag: text-generation
-library_name: mlx
----
-
-# Model Card
-"""
-    (base / "README.md").write_text(readme)
-
-    framework = detect_framework(base.parent.parent, "private/my-mlx-chat")
-    model_type = detect_model_type(base.parent.parent, "private/my-mlx-chat")
-    assert framework == "MLX"
-    assert model_type == "chat"
-
-
-def test_tokenizer_only_chat_type(temp_cache_dir: Path):
-    base = _make_base(temp_cache_dir, "someone", "no-readme")
-    (base / "config.json").write_text(json.dumps({"model_type": "llama"}))
-    (base / "tokenizer_config.json").write_text(json.dumps({"chat_template": "{{ bos_token }} {{ messages }}"}))
-
-    framework = detect_framework(base.parent.parent, "someone/no-readme")
-    model_type = detect_model_type(base.parent.parent, "someone/no-readme")
-    # Framework via fallback; likely Tokenizer/PyTorch/Unknown depending on size and files
-    assert model_type == "chat"
-    assert framework in {"Tokenizer", "PyTorch", "Unknown", "GGUF", "MLX"}
-
-
-def test_no_hints_fallbacks(temp_cache_dir: Path):
-    base = _make_base(temp_cache_dir, "other", "plain-model")
-    (base / "config.json").write_text(json.dumps({"model_type": "bert"}))
-    (base / "pytorch_model.bin").write_bytes(b"weights")
-
-    framework = detect_framework(base.parent.parent, "other/plain-model")
-    model_type = detect_model_type(base.parent.parent, "other/plain-model")
-    assert framework in {"PyTorch", "Tokenizer", "Unknown"}
-    assert model_type == "base"
-
-
-def test_run_model_accepts_mlx_via_readme(monkeypatch, temp_cache_dir: Path):
-    base = _make_base(temp_cache_dir, "org", "mlxish")
-    (base / "config.json").write_text(json.dumps({"model_type": "llama"}))
-    (base / "README.md").write_text("""---
-tags: [mlx, chat]
-pipeline_tag: text-generation
----
-""")
-
-    # Patch resolve_single_model to return our base
-    from mlx_knife import cache_utils as cu
-
-    def fake_resolve(spec):
-        return base, "org/mlxish", "main"
-
-    called = {"ok": False}
-
-    def fake_run_enhanced(**kwargs):
-        called["ok"] = True
-
-    monkeypatch.setattr(cu, "resolve_single_model", fake_resolve)
-    import mlx_knife.mlx_runner as mr
-    monkeypatch.setattr(mr, "run_model_enhanced", fake_run_enhanced, raising=False)
-
-    # Should not raise or exit; should call enhanced runner
-    run_model("org/mlxish", prompt="hi", interactive=False)
-    assert called["ok"] is True
-
-
-def _create_model_with_readme(temp_cache_dir: Path, org: str, name: str, readme_front_matter: str) -> Path:
-    base = temp_cache_dir / "hub" / f"models--{org}--{name}" / "snapshots" / "main"
-    base.mkdir(parents=True, exist_ok=True)
-    (base / "config.json").write_text(json.dumps({"model_type": "llama"}))
-    (base / "model.safetensors").write_bytes(b"weights" * 100)
-    (base / "README.md").write_text(readme_front_matter)
-    return base
-
-
-def test_list_filters_non_chat_by_default(temp_cache_dir: Path, patch_model_cache, capsys):
-    # Create a chat-capable MLX model via README front matter
-    chat_front_matter = """---
-tags: [mlx, chat]
-pipeline_tag: text-generation
-library_name: mlx
----
-"""
-    _create_model_with_readme(temp_cache_dir, "org", "chat-model", chat_front_matter)
-
-    # Create a non-chat MLX model (embedding) via README front matter
-    embed_front_matter = """---
-tags: [mlx, embedding]
-pipeline_tag: sentence-similarity
-library_name: mlx
----
-"""
-    _create_model_with_readme(temp_cache_dir, "org", "embed-model", embed_front_matter)
-
-    from mlx_knife.cache_utils import list_models
-    with patch_model_cache(temp_cache_dir / "hub"):
-        list_models()  # default strict view
-    out = capsys.readouterr().out
-    assert "org/chat-model" in out
-    assert "org/embed-model" not in out  # non-chat should be hidden in strict view
-
-
-def test_list_all_includes_non_chat_with_type_column(temp_cache_dir: Path, patch_model_cache, capsys):
-    # Reuse the same setup as previous test
-    chat_front_matter = """---
-tags: [mlx, chat]
-pipeline_tag: text-generation
-library_name: mlx
----
-"""
-    _create_model_with_readme(temp_cache_dir, "org2", "chat-model2", chat_front_matter)
-
-    embed_front_matter = """---
-tags: [mlx, embedding]
-pipeline_tag: sentence-similarity
-library_name: mlx
----
-"""
-    _create_model_with_readme(temp_cache_dir, "org2", "embed-model2", embed_front_matter)
-
-    from mlx_knife.cache_utils import list_models
-    with patch_model_cache(temp_cache_dir / "hub"):
-        list_models(show_all=True)
-    out = capsys.readouterr().out
-    # Header contains TYPE column in --all mode
-    assert "TYPE" in out.splitlines()[0]
-    # Both models appear
-    assert "org2/chat-model2" in out
-    assert "org2/embed-model2" in out
diff --git a/tests_2.0/__init__.py b/tests_2.0/__init__.py
new file mode 100644
index 0000000..fac5d8c
--- /dev/null
+++ b/tests_2.0/__init__.py
@@ -0,0 +1 @@
+# MLX-Knife 2.0 Tests
\ No newline at end of file
diff --git a/tests_2.0/conftest.py b/tests_2.0/conftest.py
new file mode 100644
index 0000000..3cf35b6
--- /dev/null
+++ b/tests_2.0/conftest.py
@@ -0,0 +1,741 @@
+from __future__ import annotations
+
+"""Test fixtures for MLX-Knife 2.0 isolated testing."""
+
+# Ensure lightweight stubs are used for heavy deps (mlx, mlx_lm) during unit tests
+import sys
+from pathlib import Path
+_stubs_path = Path(__file__).parent / "stubs"
+if str(_stubs_path) not in sys.path:
+    sys.path.insert(0, str(_stubs_path))
+
+import os
+import tempfile
+import pytest
+from pathlib import Path
+from typing import Generator
+from contextlib import contextmanager
+import shutil
+import random
+import json as _json
+
+TEST_SENTINEL = "models--TEST-CACHE-SENTINEL--mlxk2-safety-check"
+
+
+def assert_is_test_cache(cache_path: Path):
+    """Ensure operations run against the isolated test cache only."""
+    path_str = str(cache_path)
+    if "/var/folders/" not in path_str or "mlxk2_test_" not in path_str:
+        raise RuntimeError(f"WARNING: Unexpected cache path - should be test cache: {path_str}")
+    sentinel_dir = cache_path / TEST_SENTINEL
+    if not sentinel_dir.exists():
+        raise RuntimeError(f"MISSING CANARY: Test cache sentinel not found in {cache_path}")
+
+
+@pytest.fixture
+def isolated_cache() -> Generator[Path, None, None]:
+    """Create isolated cache for MLX-Knife 2.0 tests - NEVER touches user cache."""
+    with tempfile.TemporaryDirectory(prefix="mlxk2_test_") as temp_dir:
+        cache_path = Path(temp_dir) / "test_cache"
+        cache_path.mkdir()
+        
+        # Create hub subdirectory (HuggingFace standard structure)
+        hub_path = cache_path / "hub"
+        hub_path.mkdir()
+        
+        # Store original HF_HOME and expose it to user-copy helpers as MLXK2_USER_HF_HOME
+        old_hf_home = os.environ.get("HF_HOME")
+        injected_user_hf_home = False
+        if not os.environ.get("MLXK2_USER_HF_HOME"):
+            # Prefer original HF_HOME if provided
+            if old_hf_home:
+                os.environ["MLXK2_USER_HF_HOME"] = old_hf_home
+                injected_user_hf_home = True
+            else:
+                # Fall back to common default: ~/.cache/huggingface
+                default_hf = Path.home() / ".cache" / "huggingface"
+                if (default_hf / "hub").exists():
+                    os.environ["MLXK2_USER_HF_HOME"] = str(default_hf)
+                    injected_user_hf_home = True
+
+        # Point HF_HOME to the isolated test cache (code under test will use this)
+        os.environ["HF_HOME"] = str(cache_path)
+        
+        # CRITICAL: Patch MODEL_CACHE to use our isolated cache
+        from mlxk2.core import cache
+        original_cache = cache.MODEL_CACHE
+        cache.MODEL_CACHE = hub_path
+        
+        # SAFETY CANARY: Create sentinel model to verify we're in test cache
+        sentinel_dir = hub_path / TEST_SENTINEL
+        sentinel_snapshot = sentinel_dir / "snapshots" / "test123456789abcdef0123456789abcdef0123"
+        sentinel_snapshot.mkdir(parents=True)
+        (sentinel_snapshot / "config.json").write_text('{"model_type": "test_sentinel", "test_cache": true}')
+        # Enable strict deletion safety inside tests
+        old_strict = os.environ.get("MLXK2_STRICT_TEST_DELETE")
+        os.environ["MLXK2_STRICT_TEST_DELETE"] = "1"
+        
+        try:
+            yield hub_path  # Return hub path (where models-- directories go)
+        finally:
+            # Restore everything
+            cache.MODEL_CACHE = original_cache
+            if old_hf_home:
+                os.environ["HF_HOME"] = old_hf_home
+            elif "HF_HOME" in os.environ:
+                del os.environ["HF_HOME"]
+            # Remove injected MLXK2_USER_HF_HOME if we set it
+            if injected_user_hf_home:
+                # Only remove if it matches our injected values to avoid
+                # deleting a user-provided variable
+                injected_vals = set()
+                if old_hf_home:
+                    injected_vals.add(old_hf_home)
+                injected_vals.add(str(Path.home() / ".cache" / "huggingface"))
+                if os.environ.get("MLXK2_USER_HF_HOME") in injected_vals:
+                    del os.environ["MLXK2_USER_HF_HOME"]
+            # Restore strict delete flag
+            if old_strict is not None:
+                os.environ["MLXK2_STRICT_TEST_DELETE"] = old_strict
+            elif "MLXK2_STRICT_TEST_DELETE" in os.environ:
+                del os.environ["MLXK2_STRICT_TEST_DELETE"]
+
+
+@pytest.fixture 
+def mock_models(isolated_cache):
+    """Create realistic mock models in isolated cache."""
+    
+    def create_model(hf_name: str, commit_hash: str = "abcdef123456789", healthy: bool = True):
+        """Create a mock model with proper directory structure."""
+        from mlxk2.core.cache import hf_to_cache_dir
+        
+        cache_dir_name = hf_to_cache_dir(hf_name)
+        model_base_dir = isolated_cache / cache_dir_name
+        
+        # Create snapshots directory
+        snapshots_dir = model_base_dir / "snapshots"
+        snapshot_dir = snapshots_dir / commit_hash
+        snapshot_dir.mkdir(parents=True)
+        
+        if healthy:
+            # Create healthy model files
+            (snapshot_dir / "config.json").write_text('{"model_type": "test", "hidden_size": 768}')
+            (snapshot_dir / "tokenizer.json").write_text('{"version": "1.0"}')
+            (snapshot_dir / "model.safetensors").write_bytes(b"fake_model_weights" * 1000)
+        else:
+            # Create corrupted model (missing files)
+            (snapshot_dir / "config.json").write_text('invalid json {')
+        
+        return model_base_dir, snapshot_dir
+    
+    # Pre-create diverse test models for framework detection
+    models_created = {}
+    
+    # MLX models (detected by "mlx-community" in name)
+    models_created["mlx-community/Phi-3-mini-4k-instruct-4bit"] = create_model(
+        "mlx-community/Phi-3-mini-4k-instruct-4bit", 
+        "e9675aa3def456789abcdef0123456789abcdef0"
+    )
+    
+    models_created["mlx-community/Qwen3-30B-A3B-Instruct-2507-4bit"] = create_model(
+        "mlx-community/Qwen3-30B-A3B-Instruct-2507-4bit",
+        "e9675aa3def456789abcdef0123456789abcdef0"  # Same short hash for testing
+    )
+    
+    # Second Qwen model for ambiguous matching tests (mock only - different hash)
+    models_created["Qwen/Qwen3-Coder-480B-A35B-Instruct"] = create_model(
+        "Qwen/Qwen3-Coder-480B-A35B-Instruct", 
+        "beef1234567890abcdef1234567890abcdefbeef"  # Different hash from above
+    )
+    
+    # PyTorch models (detected by .safetensors files)
+    pytorch_model = create_model(
+        "microsoft/DialoGPT-small",
+        "fedcba987654321fedcba987654321fedcba98"
+    )
+    # Add safetensors file for PyTorch detection
+    (pytorch_model[1] / "model.safetensors").write_bytes(b"fake_safetensors" * 100)
+    models_created["microsoft/DialoGPT-small"] = pytorch_model
+    
+    # GGUF model (detected by .gguf files) 
+    gguf_model = create_model(
+        "TheBloke/Llama-2-7B-Chat-GGUF",
+        "1234567890abcdef1234567890abcdef12345678"
+    )
+    # Add GGUF file
+    (gguf_model[1] / "q4_0.gguf").write_bytes(b"fake_gguf_model" * 200)
+    models_created["TheBloke/Llama-2-7B-Chat-GGUF"] = gguf_model
+    
+    # Embeddings model (different model_type in config)
+    embed_model = create_model(
+        "sentence-transformers/all-MiniLM-L6-v2",
+        "abcd1234567890abcdef1234567890abcdef12"
+    )
+    # Override config for embeddings
+    (embed_model[1] / "config.json").write_text('{"model_type": "bert", "task": "feature-extraction"}')
+    models_created["sentence-transformers/all-MiniLM-L6-v2"] = embed_model
+    
+    # Corrupted model for testing tolerance
+    models_created["corrupted/model"] = create_model(
+        "corrupted/model",
+        "corrupted123456789abcdef0123456789abcdef0",
+        healthy=False
+    )
+    
+    return models_created
+
+
+@pytest.fixture
+def create_corrupted_cache_entry(isolated_cache):
+    """Create corrupted cache entries for testing naming tolerance."""
+    
+    def create_corrupted(cache_name: str):
+        """Create a corrupted cache directory name (violates naming rules)."""
+        corrupted_dir = isolated_cache / cache_name
+        snapshots_dir = corrupted_dir / "snapshots" / "main"  
+        snapshots_dir.mkdir(parents=True)
+        
+        # Create minimal files so it's detected as model
+        (snapshots_dir / "config.json").write_text('{"model_type": "corrupted"}')
+        
+        return corrupted_dir
+    
+    return create_corrupted
+
+
+def test_list_models(cache_path):
+    """Test-specific list_models that uses exact cache path provided.
+    
+    This ensures test operations use the same cache consistently.
+    """
+    from mlxk2.core.cache import cache_dir_to_hf
+    
+    # Centralized safety check
+    assert_is_test_cache(cache_path)
+    
+    models = []
+    
+    if not cache_path.exists():
+        return {
+            "status": "success",
+            "command": "list",
+            "data": {
+                "models": models,
+                "count": 0
+            },
+            "error": None
+        }
+    
+    # Find all model directories in the provided cache path
+    for model_dir in cache_path.iterdir():
+        if not model_dir.is_dir() or not model_dir.name.startswith("models--"):
+            continue
+            
+        hf_name = cache_dir_to_hf(model_dir.name)
+        
+        # Get hashes from snapshots
+        hashes = []
+        snapshots_dir = model_dir / "snapshots"
+        if snapshots_dir.exists():
+            for snapshot_dir in snapshots_dir.iterdir():
+                if snapshot_dir.is_dir() and len(snapshot_dir.name) == 40:
+                    hashes.append(snapshot_dir.name)
+        
+        models.append({
+            "name": hf_name,
+            "hashes": sorted(hashes),
+            "cached": True
+        })
+    
+    # Sort by name for consistent output
+    models.sort(key=lambda x: x["name"])
+    
+    return {
+        "status": "success", 
+        "command": "list",
+        "data": {
+            "models": models,
+            "count": len(models)
+        },
+        "error": None
+    }
+
+
+def test_resolve_model_for_operation(cache_path, model_query):
+    """Test-specific model resolution that uses exact cache path provided.
+    
+    This ensures model resolution uses the same cache as other test operations.
+    """
+    # Centralized safety check
+    assert_is_test_cache(cache_path)
+    
+    from mlxk2.core.cache import cache_dir_to_hf
+    
+    # Parse @hash syntax if present
+    if "@" in model_query:
+        model_name, requested_hash = model_query.split("@", 1)
+        requested_hash = requested_hash.lower()
+    else:
+        model_name = model_query
+        requested_hash = None
+    
+    # Find matching models in the provided cache path
+    matching_models = []
+    
+    if not cache_path.exists():
+        return None, None, []
+    
+    for model_dir in cache_path.iterdir():
+        if not model_dir.is_dir() or not model_dir.name.startswith("models--"):
+            continue
+            
+        hf_name = cache_dir_to_hf(model_dir.name)
+        
+        # Skip sentinel model
+        if "TEST-CACHE-SENTINEL" in hf_name:
+            continue
+        
+        # Check for name match (exact, partial, fuzzy)
+        name_matches = False
+        if model_name.lower() == hf_name.lower():
+            name_matches = True  # Exact match
+        elif model_name.lower() in hf_name.lower():
+            name_matches = True  # Partial match
+        elif any(part.lower() in hf_name.lower() for part in model_name.split("-")):
+            name_matches = True  # Fuzzy match
+        
+        if name_matches:
+            # Get available hashes
+            snapshots_dir = model_dir / "snapshots"
+            available_hashes = []
+            if snapshots_dir.exists():
+                for snapshot_dir in snapshots_dir.iterdir():
+                    if snapshot_dir.is_dir() and len(snapshot_dir.name) == 40:
+                        available_hashes.append(snapshot_dir.name)
+            
+            # Check hash match if requested
+            if requested_hash:
+                hash_match = any(h.lower().startswith(requested_hash) for h in available_hashes)
+                if hash_match:
+                    matching_models.append(hf_name)
+            else:
+                matching_models.append(hf_name)
+    
+    # Return resolution results
+    if len(matching_models) == 0:
+        return None, requested_hash, []
+    elif len(matching_models) == 1:
+        return matching_models[0], requested_hash, None
+    else:
+        # Ambiguous - return choices
+        return None, requested_hash, matching_models
+
+
+def test_health_check_operation(cache_path, model_query=None):
+    """Test-specific health check that uses exact cache path provided.
+    
+    This ensures health check uses the same cache as other test operations.
+    """
+    # Centralized safety check
+    assert_is_test_cache(cache_path)
+    
+    from mlxk2.core.cache import cache_dir_to_hf
+    import json
+    
+    healthy_models = []
+    unhealthy_models = []
+    
+    if not cache_path.exists():
+        return {
+            "status": "success",
+            "command": "health",
+            "data": {
+                "healthy": [],
+                "unhealthy": [],
+                "summary": {"total": 0, "healthy_count": 0, "unhealthy_count": 0}
+            },
+            "error": None
+        }
+    
+    # Check all models in cache path
+    for model_dir in cache_path.iterdir():
+        if not model_dir.is_dir() or not model_dir.name.startswith("models--"):
+            continue
+            
+        hf_name = cache_dir_to_hf(model_dir.name)
+        
+        # Skip sentinel model
+        if "TEST-CACHE-SENTINEL" in hf_name:
+            continue
+        
+        # Filter by model_query if specified (supports @hash syntax)
+        if model_query:
+            # Parse @hash syntax if present
+            if "@" in model_query:
+                query_name, requested_hash = model_query.split("@", 1)
+                requested_hash = requested_hash.lower()
+                
+                # Check name match
+                name_matches = (query_name.lower() in hf_name.lower())
+                if not name_matches:
+                    continue
+                
+                # Check hash match
+                snapshots_dir = model_dir / "snapshots"
+                hash_matches = False
+                if snapshots_dir.exists():
+                    for snapshot_dir in snapshots_dir.iterdir():
+                        if snapshot_dir.is_dir() and len(snapshot_dir.name) == 40:
+                            if snapshot_dir.name.lower().startswith(requested_hash):
+                                hash_matches = True
+                                break
+                
+                if not hash_matches:
+                    continue
+            else:
+                # Simple name filtering
+                if model_query.lower() not in hf_name.lower():
+                    continue
+        
+        # Check model health
+        is_healthy = True
+        health_issues = []
+        
+        # Check snapshots directory
+        snapshots_dir = model_dir / "snapshots"
+        if not snapshots_dir.exists():
+            is_healthy = False
+            health_issues.append("Missing snapshots directory")
+        else:
+            # Check for at least one valid snapshot
+            valid_snapshots = []
+            for snapshot_dir in snapshots_dir.iterdir():
+                if snapshot_dir.is_dir() and len(snapshot_dir.name) == 40:
+                    # Check for config.json
+                    config_file = snapshot_dir / "config.json"
+                    if config_file.exists():
+                        try:
+                            with open(config_file, 'r') as f:
+                                json.load(f)
+                            valid_snapshots.append(snapshot_dir.name)
+                        except (json.JSONDecodeError, IOError):
+                            health_issues.append(f"Invalid config.json in {snapshot_dir.name}")
+                    else:
+                        health_issues.append(f"Missing config.json in {snapshot_dir.name}")
+            
+            if not valid_snapshots:
+                is_healthy = False
+                health_issues.append("No valid snapshots found")
+        
+        # Categorize model
+        model_info = {
+            "name": hf_name,
+            "issues": health_issues
+        }
+        
+        if is_healthy:
+            healthy_models.append(model_info)
+        else:
+            unhealthy_models.append(model_info)
+    
+    return {
+        "status": "success",
+        "command": "health", 
+        "data": {
+            "healthy": healthy_models,
+            "unhealthy": unhealthy_models,
+            "summary": {
+                "total": len(healthy_models) + len(unhealthy_models),
+                "healthy_count": len(healthy_models),
+                "unhealthy_count": len(unhealthy_models)
+            }
+        },
+        "error": None
+    }
+
+
+@contextmanager
+def atomic_cache_context(cache_path: Path, expected_context="test"):
+    """Atomic cache switching context manager.
+    
+    Temporarily switches HF_HOME to use specific cache, with verification.
+    """
+    from mlxk2.core.cache import verify_cache_context
+    
+    # Store original HF_HOME
+    original_hf_home = os.environ.get("HF_HOME")
+    
+    try:
+        # Switch to specified cache
+        if cache_path:
+            os.environ["HF_HOME"] = str(cache_path.parent)  # cache_path is hub/, we need parent
+        
+        # Verify we're in the right context
+        verify_cache_context(expected_context)
+        
+        yield cache_path
+        
+    finally:
+        # Restore original HF_HOME
+        if original_hf_home:
+            os.environ["HF_HOME"] = original_hf_home
+        elif "HF_HOME" in os.environ:
+            del os.environ["HF_HOME"]
+
+
+@contextmanager  
+def user_cache_context():
+    """Context manager for user cache operations."""
+    # User cache doesn't need HF_HOME changes - it's the default
+    from mlxk2.core.cache import get_current_model_cache, verify_cache_context
+    
+    # Just verify we're in user cache context
+    verify_cache_context("user")
+    
+    yield get_current_model_cache()
+
+
+@pytest.fixture
+def copy_user_model_to_isolated(isolated_cache):
+    """Utility to copy a real user-cache model into the isolated test cache.
+
+    Safety:
+    - Read-only on user cache.
+    - Requires explicit env var MLXK2_USER_HF_HOME pointing to the user HF_HOME.
+    - Skips if user cache or model not present.
+
+    Usage:
+    >>> copier = copy_user_model_to_isolated
+    >>> path = copier('mlx-community/Phi-3-mini-4k-instruct-4bit', mutations=['remove_config'])
+    """
+    from mlxk2.core.cache import hf_to_cache_dir
+
+    # IMPORTANT: Do NOT use HF_HOME here because the isolated_cache fixture
+    # overrides HF_HOME to point to the test cache. We need the real user cache,
+    # which must be provided via MLXK2_USER_HF_HOME explicitly.
+    user_hf_home = os.environ.get("MLXK2_USER_HF_HOME")
+    if not user_hf_home:
+        pytest.skip("MLXK2_USER_HF_HOME not set; skip user->isolated copy")
+
+    user_hub = Path(user_hf_home) / "hub"
+    if not user_hub.exists():
+        pytest.skip(f"User hub path not found: {user_hub}")
+
+    def mutate_model_dir(model_dir: Path, mutations):
+        if not mutations:
+            return
+        # Normalize list
+        if isinstance(mutations, str):
+            mutations_list = [mutations]
+        else:
+            mutations_list = list(mutations)
+
+        # Find a snapshot dir (prefer any 40-char hex dir)
+        snapshots = model_dir / "snapshots"
+        snap_dirs = [d for d in snapshots.iterdir() if d.is_dir() and len(d.name) == 40] if snapshots.exists() else []
+        target_snap = snap_dirs[0] if snap_dirs else None
+
+        # Helper: load index
+        def _load_index():
+            if target_snap is None:
+                return None
+            sft_idx = target_snap / "model.safetensors.index.json"
+            pt_idx = target_snap / "pytorch_model.bin.index.json"
+            for idx in (sft_idx, pt_idx):
+                if idx.exists():
+                    try:
+                        return _json.loads(idx.read_text())
+                    except Exception:
+                        return None
+            return None
+
+        # Helper: get referenced shard paths
+        def _referenced_shards():
+            index = _load_index()
+            if not index or not isinstance(index.get("weight_map"), dict) or target_snap is None:
+                return []
+            files = sorted(set(index["weight_map"].values()))
+            return [target_snap / f for f in files]
+
+        for m in mutations_list:
+            if m == 'remove_config' and target_snap is not None:
+                cfg = target_snap / "config.json"
+                if cfg.exists():
+                    cfg.unlink()
+            elif m == 'truncate_weight' and target_snap is not None:
+                # Truncate first weight-like file
+                candidates = list(target_snap.glob("**/*.safetensors")) or list(target_snap.glob("**/*.gguf"))
+                if candidates:
+                    p = candidates[0]
+                    p.write_bytes(b"")
+            elif m == 'remove_snapshot' and target_snap is not None:
+                shutil.rmtree(target_snap, ignore_errors=True)
+                target_snap = None
+            elif m == 'drop_random_files' and target_snap is not None:
+                files = [f for f in target_snap.rglob("*") if f.is_file()]
+                for f in random.sample(files, k=min(len(files), max(1, len(files)//4))):
+                    try:
+                        f.unlink()
+                    except Exception:
+                        pass
+            elif m == 'inject_invalid_config' and target_snap is not None:
+                (target_snap / "config.json").write_text('invalid json {')
+            elif m == 'add_partial_tmp' and target_snap is not None:
+                (target_snap / ".partial.tmp").write_bytes(b"downloading...")
+            elif m == 'delete_indexed_shard' and target_snap is not None:
+                # Delete one referenced shard (if index exists)
+                refs = _referenced_shards()
+                if refs:
+                    try:
+                        refs[0].unlink(missing_ok=True)
+                    except Exception:
+                        pass
+            elif m == 'truncate_indexed_shard' and target_snap is not None:
+                refs = _referenced_shards()
+                if refs:
+                    refs[0].write_bytes(b"")
+            elif m == 'lfsify_indexed_shard' and target_snap is not None:
+                refs = _referenced_shards()
+                if refs:
+                    lfs_content = (
+                        "version https://git-lfs.github.com/spec/v1\n"
+                        "oid sha256:123\nsize 123\n"
+                    )
+                    refs[0].write_text(lfs_content)
+            elif m == 'remove_index' and target_snap is not None:
+                idx = target_snap / "model.safetensors.index.json"
+                if idx.exists():
+                    idx.unlink()
+
+    def _latest_snapshot_dir(model_dir: Path) -> Path | None:
+        snaps = model_dir / "snapshots"
+        if not snaps.exists():
+            return None
+        dirs = [d for d in snaps.iterdir() if d.is_dir()]
+        if not dirs:
+            return None
+        return max(dirs, key=lambda p: p.stat().st_mtime)
+
+    def copier(hf_name: str, *, mutations=None) -> Path:
+        src = user_hub / hf_to_cache_dir(hf_name)
+        if not src.exists():
+            pytest.skip(f"User model not found: {hf_name} -> {src}")
+
+        dst = isolated_cache / hf_to_cache_dir(hf_name)
+        if dst.exists():
+            shutil.rmtree(dst)
+
+        # Minimal copy strategy (implicit):
+        # - If an index exists, copy the index and the N smallest referenced shards (default N=1).
+        # - Otherwise, copy shards matching the safetensors pattern and limit to N (default N=1).
+        subset_count = int(os.environ.get("MLXK2_SUBSET_COUNT", "1"))
+        min_free_mb = int(os.environ.get("MLXK2_MIN_FREE_MB", "512"))
+
+        # Create dst structure minimally
+        (dst / "snapshots").mkdir(parents=True, exist_ok=True)
+        src_snap = _latest_snapshot_dir(src)
+        if src_snap is None:
+            pytest.skip("Source model has no snapshots")
+        dst_snap = (dst / "snapshots" / src_snap.name)
+        dst_snap.mkdir(parents=True, exist_ok=True)
+
+        # Decide which files to copy
+        selected: list[Path] = []
+        sft_idx = src_snap / "model.safetensors.index.json"
+        pt_idx = src_snap / "pytorch_model.bin.index.json"
+        idx = sft_idx if sft_idx.exists() else (pt_idx if pt_idx.exists() else None)
+        if idx is not None and idx.exists():
+            try:
+                index = _json.loads(idx.read_text())
+                wm = index.get("weight_map") or {}
+                shard_names = sorted(set(wm.values()))
+            except Exception:
+                shard_names = []
+            # pick N smallest shards by size to minimize copy volume
+            shard_paths = [src_snap / name for name in shard_names]
+            shard_paths = [p for p in shard_paths if p.exists()]
+            shard_paths.sort(key=lambda p: p.stat().st_size)
+            for p in shard_paths[:max(0, subset_count)]:
+                selected.append(p)
+            selected.append(idx)
+        else:
+            # pattern subset: pick shards by filename pattern
+            import re
+            rgx = re.compile(r"model-\d{5}-of-\d{5}\.safetensors$")
+            shard_files = [p for p in src_snap.iterdir() if p.is_file() and rgx.search(p.name)]
+            shard_files.sort()
+            selected.extend(shard_files[:subset_count])
+            # include index if present (unlikely in this branch but safe)
+            if sft_idx.exists():
+                selected.append(sft_idx)
+            elif pt_idx.exists():
+                selected.append(pt_idx)
+        # Always include config.json if present
+        cfg = src_snap / "config.json"
+        if cfg.exists():
+            selected.append(cfg)
+
+        # Disk space check (on the test cache volume)
+        total_bytes = 0
+        for p in selected:
+            try:
+                total_bytes += p.stat().st_size
+            except FileNotFoundError:
+                pass
+        free_bytes = shutil.disk_usage(str(isolated_cache)).free
+        if free_bytes < total_bytes + (min_free_mb * 1024 * 1024):
+            pytest.skip(f"Not enough free space for subset copy: need ~{(total_bytes/1e6):.1f}MB + safety, have {(free_bytes/1e6):.1f}MB")
+
+        # Copy selected files
+        for p in selected:
+            rel = p.relative_to(src_snap)
+            dst_file = dst_snap / rel
+            dst_file.parent.mkdir(parents=True, exist_ok=True)
+            if p.exists():
+                shutil.copy2(p, dst_file)
+
+        # Also place index file at model root so tests can detect it without network
+        if idx is not None and idx.exists():
+            try:
+                shutil.copy2(idx, dst / idx.name)
+            except Exception:
+                pass
+
+        mutate_model_dir(dst, mutations)
+
+        # Optional: bootstrap index files into the ISOLATED cache (never user cache)
+        # Enable with MLXK2_BOOTSTRAP_INDEX=1 to reduce SKIPs for Issue #27 when the
+        # selected model doesn't ship an index in your user cache.
+        try_bootstrap = os.environ.get("MLXK2_BOOTSTRAP_INDEX") == "1"
+        if try_bootstrap:
+            # Quick existence check at model root (tests look here first)
+            root_sft = dst / "model.safetensors.index.json"
+            root_pt = dst / "pytorch_model.bin.index.json"
+            if not root_sft.exists() and not root_pt.exists():
+                try:
+                    # Use hf snapshot_download with allow_patterns to fetch ONLY index files
+                    # into the isolated HF_HOME (set by isolated_cache fixture).
+                    from huggingface_hub import snapshot_download
+                    _ = snapshot_download(
+                        repo_id=hf_name,
+                        allow_patterns=[
+                            "**/model.safetensors.index.json",
+                            "**/pytorch_model.bin.index.json",
+                        ],
+                        local_files_only=False,
+                        resume_download=True,
+                        token=(os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_HUB_TOKEN")),
+                    )
+                    # Copy any fetched index up to model root so tests can detect it
+                    fetched = list((dst / "snapshots").rglob("*index.json"))
+                    for f in fetched:
+                        try:
+                            shutil.copy2(f, dst / f.name)
+                        except Exception:
+                            pass
+                except Exception:
+                    # Ignore bootstrap failures; tests will skip as before
+                    pass
+        return dst
+
+    return copier
diff --git a/tests_2.0/conftest_runner.py b/tests_2.0/conftest_runner.py
new file mode 100644
index 0000000..a65c5e6
--- /dev/null
+++ b/tests_2.0/conftest_runner.py
@@ -0,0 +1,83 @@
+"""
+Fixtures for MLXRunner testing - solves mock complexity issues.
+"""
+
+import pytest
+import tempfile
+from pathlib import Path
+from unittest.mock import Mock, patch
+from contextlib import contextmanager
+
+
+@pytest.fixture
+def temp_cache_dir():
+    """Isolated cache directory for testing"""
+    with tempfile.TemporaryDirectory() as tmpdir:
+        yield Path(tmpdir)
+
+
+@contextmanager
+def mock_mlx_runner_environment(temp_cache_dir, model_name="test-model", context_length=8192):
+    """Complete mock environment for MLXRunner that handles all dependencies."""
+    
+    # Create proper directory structure
+    model_cache_dir = temp_cache_dir / f"models--{model_name}"
+    snapshots_dir = model_cache_dir / "snapshots" / "abc123"
+    snapshots_dir.mkdir(parents=True)
+    
+    # Create mock config.json
+    config_path = snapshots_dir / "config.json"
+    config_path.write_text(f'{{"max_position_embeddings": {context_length}}}')
+    
+    with patch('mlxk2.core.runner.resolve_model_for_operation') as mock_resolve, \
+         patch('mlxk2.core.runner.get_current_model_cache') as mock_cache, \
+         patch('mlxk2.core.runner.hf_to_cache_dir') as mock_hf_to_cache, \
+         patch('mlxk2.core.runner.load') as mock_load, \
+         patch('mlxk2.core.runner.generate_step') as mock_gen_step:
+        
+        # Setup return values
+        mock_resolve.return_value = (model_name, None, None)
+        mock_cache.return_value = temp_cache_dir
+        mock_hf_to_cache.return_value = f"models--{model_name}"
+        
+        # Setup model and tokenizer mocks
+        mock_model = Mock()
+        mock_tokenizer = Mock()
+        mock_tokenizer.eos_token = "</s>"
+        mock_tokenizer.eos_token_id = 2
+        mock_tokenizer.eos_token_ids = {mock_tokenizer.eos_token_id}
+        mock_tokenizer.pad_token = None
+        mock_tokenizer.additional_special_tokens = []
+        mock_tokenizer.added_tokens_decoder = {}
+        
+        # Common encode/decode behavior
+        mock_tokenizer.encode.return_value = [1, 2, 3, 4, 5]
+        mock_tokenizer.decode.side_effect = lambda tokens: " ".join(f"token{t}" for t in tokens)
+        
+        mock_load.return_value = (mock_model, mock_tokenizer)
+        
+        # Setup generation step mock
+        mock_gen_step.return_value = iter([
+            (Mock(item=lambda: 1), Mock()),
+            (Mock(item=lambda: 2), Mock()),
+            (Mock(item=lambda: 3), Mock())
+        ])
+        
+        yield {
+            'mock_resolve': mock_resolve,
+            'mock_cache': mock_cache,
+            'mock_hf_to_cache': mock_hf_to_cache,
+            'mock_load': mock_load,
+            'mock_model': mock_model,
+            'mock_tokenizer': mock_tokenizer,
+            'mock_gen_step': mock_gen_step,
+            'temp_cache_dir': temp_cache_dir,
+            'model_path': snapshots_dir
+        }
+
+
+@pytest.fixture
+def mock_runner_env(temp_cache_dir):
+    """Fixture version of mock_mlx_runner_environment."""
+    with mock_mlx_runner_environment(temp_cache_dir) as env:
+        yield env
diff --git a/tests_2.0/live/test_clone_live.py b/tests_2.0/live/test_clone_live.py
new file mode 100644
index 0000000..a01ae56
--- /dev/null
+++ b/tests_2.0/live/test_clone_live.py
@@ -0,0 +1,204 @@
+"""Opt-in live clone test.
+
+Runs only when explicitly selected via markers/env, per TESTING.md mini‑matrix.
+Validates ADR-007 Phase 1 compliance: real pull→temp cache→APFS same-volume clone→workspace.
+
+Enable with ALL required env vars:
+- MLXK2_ENABLE_ALPHA_FEATURES=1 (clone is alpha)
+- MLXK2_LIVE_CLONE=1 (enable live test)
+- HF_TOKEN=<your_token> (for model access)
+- MLXK2_LIVE_CLONE_MODEL=<model_name> (e.g., "mlx-community/bge-small-en-v1.5-4bit")
+- MLXK2_LIVE_CLONE_WORKSPACE=<workspace_path> (must be on same volume as HF_HOME for APFS)
+
+Run:
+- pytest -m live_clone -v
+- or umbrella: pytest -m wet -v
+
+ADR-007 Phase 1 Requirements:
+- Same volume: workspace and HF_HOME cache must be on same volume
+- APFS filesystem: required for copy-on-write optimization
+- User cache safety: never touched, always use temp cache isolation
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import shutil
+import sys
+from pathlib import Path
+
+import pytest
+
+
+# Environment validation
+alpha_enabled = os.environ.get("MLXK2_ENABLE_ALPHA_FEATURES") == "1"
+live_enabled = os.environ.get("MLXK2_LIVE_CLONE") == "1"
+hf_token_present = bool(os.environ.get("HF_TOKEN"))
+model = os.environ.get("MLXK2_LIVE_CLONE_MODEL")
+workspace = os.environ.get("MLXK2_LIVE_CLONE_WORKSPACE")
+
+pytestmark = [
+    pytest.mark.wet,
+    pytest.mark.live_clone,
+    pytest.mark.skipif(
+        not (alpha_enabled and live_enabled and hf_token_present and model and workspace),
+        reason=(
+            "Live clone disabled. Set MLXK2_ENABLE_ALPHA_FEATURES=1, MLXK2_LIVE_CLONE=1, "
+            "HF_TOKEN, MLXK2_LIVE_CLONE_MODEL, and MLXK2_LIVE_CLONE_WORKSPACE to enable."
+        ),
+    ),
+]
+
+
+def _run_cli(argv: list[str], capsys) -> str:
+    """Run CLI command and return captured output."""
+    from mlxk2.cli import main as cli_main
+    old_argv = sys.argv[:]
+    sys.argv = argv[:]
+    try:
+        with pytest.raises(SystemExit):
+            cli_main()
+    finally:
+        sys.argv = old_argv
+    captured = capsys.readouterr()
+    return captured.out.strip()
+
+
+def test_live_clone_workflow_adr007_phase1(capsys, tmp_path):
+    """Test complete live clone workflow following ADR-007 Phase 1 constraints.
+
+    This test validates:
+    1. Real HuggingFace model pull → temp cache
+    2. Real APFS volume detection and same-volume validation
+    3. Real health check integration with health_from_cache
+    4. Real APFS copy-on-write clone → workspace
+    5. User cache safety (never touched)
+
+    Expected workflow:
+    - Pull model to isolated temp cache (not user cache)
+    - Validate same volume constraint (workspace + HF_HOME)
+    - Health check via health_from_cache (full _check_snapshot_health)
+    - APFS clone with copy-on-write optimization
+    - Clean workspace output ready for development
+    """
+    # Ensure clean workspace
+    workspace_path = Path(workspace)
+    if workspace_path.exists():
+        shutil.rmtree(workspace_path)
+
+    # Run live clone operation
+    result_json = _run_cli([
+        "mlxk2", "clone", model, workspace, "--json"
+    ], capsys)
+
+    # Parse JSON response
+    try:
+        result = json.loads(result_json)
+    except json.JSONDecodeError as e:
+        pytest.fail(f"Invalid JSON response: {e}\nOutput: {result_json}")
+
+    # Validate successful clone
+    assert result["status"] == "success", f"Clone failed: {result.get('error', 'Unknown error')}"
+    assert result["command"] == "clone"
+    assert result["data"]["model"] == model
+    assert result["data"]["target_dir"] == str(Path(workspace).resolve())
+
+    # ADR-007 Phase 1 validation
+    assert result["data"]["health_check"] is True, "Health check should be enabled by default"
+    assert "clone_status" in result["data"], "Clone status should be reported"
+
+    # Verify workspace was created and contains model files
+    assert workspace_path.exists(), f"Workspace {workspace} was not created"
+    assert workspace_path.is_dir(), f"Workspace {workspace} is not a directory"
+
+    # Validate essential model files are present
+    config_file = workspace_path / "config.json"
+    assert config_file.exists(), "config.json missing from workspace"
+
+    # Verify at least one weight file exists
+    weight_files = (
+        list(workspace_path.glob("*.safetensors")) +
+        list(workspace_path.glob("*.bin")) +
+        list(workspace_path.glob("*.gguf"))
+    )
+    assert weight_files, "No weight files found in workspace"
+
+    # Verify files are real (not LFS pointers)
+    for weight_file in weight_files[:1]:  # Check first weight file
+        assert weight_file.stat().st_size > 200, f"Weight file {weight_file.name} appears to be LFS pointer"
+
+    print(f"✅ Live clone test successful: {model} → {workspace}")
+    print(f"📁 Workspace files: {len(list(workspace_path.iterdir()))} items")
+    print(f"⚖️  Weight files: {len(weight_files)} files")
+
+
+def test_live_clone_health_check_integration(capsys, tmp_path):
+    """Test that health check integration works with real models.
+
+    This validates that health_from_cache properly integrates with
+    _check_snapshot_health for real model validation.
+    """
+    from mlxk2.operations.health import health_from_cache
+    from mlxk2.core.cache import get_current_cache_root
+
+    # Note: This test assumes the previous test ran and workspace exists
+    workspace_path = Path(workspace)
+    if not workspace_path.exists():
+        pytest.skip(f"Workspace {workspace} not found - run full clone test first")
+
+    # For this test, we create a temporary cache and copy the workspace
+    # to simulate the temp cache state during clone operation
+    temp_cache = tmp_path / "temp_cache_health_test"
+    temp_cache.mkdir()
+
+    # Create model structure in temp cache (simulate clone operation state)
+    from mlxk2.core.cache import hf_to_cache_dir
+    model_cache_dir = temp_cache / hf_to_cache_dir(model)
+    snapshots_dir = model_cache_dir / "snapshots"
+    snapshot_dir = snapshots_dir / "test_snapshot"
+    snapshot_dir.mkdir(parents=True)
+
+    # Copy workspace content to simulate temp cache snapshot
+    for item in workspace_path.iterdir():
+        if item.is_file():
+            shutil.copy2(item, snapshot_dir)
+        elif item.is_dir():
+            shutil.copytree(item, snapshot_dir / item.name)
+
+    # Test health_from_cache integration
+    healthy, message = health_from_cache(model, temp_cache)
+
+    assert healthy is True, f"Health check failed: {message}"
+    assert "healthy" in message.lower() or "complete" in message.lower(), f"Unexpected health message: {message}"
+
+    print(f"✅ Health check integration successful: {message}")
+
+
+def test_live_clone_workspace_validation(capsys):
+    """Test workspace validation with real filesystem constraints."""
+    # Test that workspace directory must be empty or non-existent
+    workspace_path = Path(workspace)
+
+    if workspace_path.exists():
+        # Create a dummy file to make workspace non-empty
+        dummy_file = workspace_path / "dummy.txt"
+        dummy_file.write_text("test")
+
+        # Clone should fail with non-empty workspace
+        result_json = _run_cli([
+            "mlxk2", "clone", model, workspace, "--json"
+        ], capsys)
+
+        try:
+            result = json.loads(result_json)
+        except json.JSONDecodeError:
+            pytest.fail(f"Invalid JSON response: {result_json}")
+
+        assert result["status"] == "error", "Clone should fail with non-empty workspace"
+        assert "not empty" in result["error"]["message"].lower(), "Error should mention non-empty workspace"
+
+        # Clean up dummy file
+        dummy_file.unlink()
+
+        print("✅ Workspace validation successful: non-empty workspace properly rejected")
\ No newline at end of file
diff --git a/tests_2.0/live/test_list_human_live.py b/tests_2.0/live/test_list_human_live.py
new file mode 100644
index 0000000..ad99193
--- /dev/null
+++ b/tests_2.0/live/test_list_human_live.py
@@ -0,0 +1,85 @@
+"""Opt-in live E2E test for human list rendering using the real HF cache.
+
+Per TESTING.md mini‑matrix, this test is collected by default but
+only runs when explicitly selected with the `live_list` marker.
+
+Run:
+- pytest -m live_list -v
+- umbrella: pytest -m wet -v
+"""
+
+from __future__ import annotations
+
+import json
+import sys
+from typing import List, Dict
+
+import pytest
+
+pytestmark = [pytest.mark.wet, pytest.mark.live_list]
+
+
+def _run_cli(argv: List[str], capsys) -> str:
+    from mlxk2.cli import main as cli_main
+    old_argv = sys.argv[:]
+    sys.argv = argv[:]
+    try:
+        with pytest.raises(SystemExit):
+            cli_main()
+    finally:
+        sys.argv = old_argv
+    out = capsys.readouterr().out
+    return out
+
+
+def _json_models(capsys) -> List[Dict]:
+    out = _run_cli(["mlxk2", "list", "--json"], capsys)
+    data = json.loads(out)
+    assert data["status"] == "success" and data["command"] == "list"
+    return data["data"]["models"]
+
+
+def _display_name_for_default(name: str) -> str:
+    # In compact default view, we strip mlx-community/ prefix
+    return name.split("/", 1)[1] if name.startswith("mlx-community/") else name
+
+
+def test_live_list_human_variants(capsys, request):
+    # Only run when explicitly selected with -m live_list
+    selected = request.config.getoption("-m") or ""
+    if "live_list" not in selected:
+        pytest.skip("Run with -m live_list to enable this end-to-end test")
+    models = _json_models(capsys)
+
+    mlx = [m for m in models if m.get("framework") == "MLX"]
+    mlx_chat = [m for m in mlx if m.get("model_type") == "chat"]
+    mlx_base = [m for m in mlx if m.get("model_type") == "base"]
+    other = [m for m in models if m.get("framework") != "MLX"]
+
+    # Fail if the cache doesn't have the necessary models
+    assert mlx_chat, "Need at least one MLX chat model in HF cache"
+    assert mlx_base, "Need at least one MLX base model in HF cache"
+
+    chat_name = mlx_chat[0]["name"]
+    base_name = mlx_base[0]["name"]
+
+    # Default list: only MLX chat
+    out_default = _run_cli(["mlxk2", "list"], capsys)
+    assert _display_name_for_default(chat_name) in out_default
+    assert _display_name_for_default(base_name) not in out_default
+
+    # Verbose: all MLX (chat + base)
+    out_verbose = _run_cli(["mlxk2", "list", "--verbose"], capsys)
+    assert chat_name in out_verbose
+    assert base_name in out_verbose
+
+    # All: all frameworks
+    out_all = _run_cli(["mlxk2", "list", "--all"], capsys)
+    assert _display_name_for_default(chat_name) in out_all or chat_name in out_all
+    assert _display_name_for_default(base_name) in out_all or base_name in out_all
+
+    if other:
+        other_name = other[0]["name"]
+        # Non-MLX names are never stripped by default rule
+        assert other_name in out_all
+
diff --git a/tests_2.0/live/test_push_live.py b/tests_2.0/live/test_push_live.py
new file mode 100644
index 0000000..80648da
--- /dev/null
+++ b/tests_2.0/live/test_push_live.py
@@ -0,0 +1,64 @@
+"""Opt-in live push test.
+
+Runs only when explicitly selected via markers/env, per TESTING.md mini‑matrix.
+
+Enable with BOTH:
+- MLXK2_LIVE_PUSH=1
+- HF_TOKEN=<your_write_token>
+- MLXK2_LIVE_REPO=org/model (target repo)
+- MLXK2_LIVE_WORKSPACE=/abs/path/to/workspace (folder to push)
+
+Run:
+- pytest -m live_push -v
+- or umbrella: pytest -m wet -v
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import sys
+
+import pytest
+
+
+live_enabled = os.environ.get("MLXK2_LIVE_PUSH") == "1"
+hf_token_present = bool(os.environ.get("HF_TOKEN"))
+repo = os.environ.get("MLXK2_LIVE_REPO")
+workspace = os.environ.get("MLXK2_LIVE_WORKSPACE")
+
+pytestmark = [
+    pytest.mark.wet,
+    pytest.mark.live_push,
+    pytest.mark.skipif(
+        not (live_enabled and hf_token_present and repo and workspace),
+        reason=(
+            "Live push disabled. Set MLXK2_LIVE_PUSH=1, HF_TOKEN, MLXK2_LIVE_REPO, "
+            "and MLXK2_LIVE_WORKSPACE to enable."
+        ),
+    ),
+]
+
+
+def _run_cli(argv: list[str], capsys) -> str:
+    from mlxk2.cli import main as cli_main
+    old_argv = sys.argv[:]
+    sys.argv = argv[:]
+    try:
+        with pytest.raises(SystemExit):
+            cli_main()
+    finally:
+        sys.argv = old_argv
+    out = capsys.readouterr().out
+    return out
+
+
+def test_live_push_json_success(capsys):
+    out = _run_cli(["mlxk2", "push", "--private", workspace, repo, "--json"], capsys)
+    data = json.loads(out)
+    assert data["command"] == "push"
+    assert data["status"] in {"success", "error"}
+    if data["status"] == "error":
+        # Provide a helpful hint on failure and skip instead of failing the suite
+        pytest.skip(f"Live push error: {data['error']}")
+
diff --git a/tests_2.0/spec/test_cli_commands_json_flag.py b/tests_2.0/spec/test_cli_commands_json_flag.py
new file mode 100644
index 0000000..65f325e
--- /dev/null
+++ b/tests_2.0/spec/test_cli_commands_json_flag.py
@@ -0,0 +1,25 @@
+import sys
+import json
+import pytest
+
+
+@pytest.mark.spec
+def test_cli_list_accepts_json_after_command(monkeypatch, capsys, isolated_cache):
+    from mlxk2 import cli
+
+    monkeypatch.setenv("PYTHONWARNINGS", "ignore")
+    monkeypatch.setenv("PYTHONDONTWRITEBYTECODE", "1")
+
+    # Ensure we pass --json after the subcommand, as users would
+    monkeypatch.setattr(sys, "argv", ["mlxk2", "list", "--json"]) 
+    with pytest.raises(SystemExit) as exc:
+        cli.main()
+    assert exc.value.code == 0
+
+    out = capsys.readouterr().out.strip()
+    data = json.loads(out)
+    assert data["status"] == "success"
+    assert data["command"] == "list"
+    assert data["error"] is None
+    assert "data" in data and "models" in data["data"] and "count" in data["data"]
+
diff --git a/tests_2.0/spec/test_cli_version_output.py b/tests_2.0/spec/test_cli_version_output.py
new file mode 100644
index 0000000..3ebd56e
--- /dev/null
+++ b/tests_2.0/spec/test_cli_version_output.py
@@ -0,0 +1,27 @@
+import sys
+import json
+import pytest
+
+
+@pytest.mark.spec
+def test_cli_version_json_output(monkeypatch, capsys):
+    from mlxk2 import __version__
+    from mlxk2.spec import JSON_API_SPEC_VERSION
+    from mlxk2 import cli
+
+    monkeypatch.setenv("PYTHONWARNINGS", "ignore")
+    monkeypatch.setenv("PYTHONDONTWRITEBYTECODE", "1")
+
+    monkeypatch.setattr(sys, "argv", ["mlxk2", "--version", "--json"]) 
+    with pytest.raises(SystemExit) as exc:
+        cli.main()
+    assert exc.value.code == 0
+
+    out = capsys.readouterr().out.strip()
+    data = json.loads(out)
+    assert data["status"] == "success"
+    assert data["command"] == "version"
+    assert data["error"] is None
+    assert data["data"]["cli_version"] == __version__
+    assert data["data"]["json_api_spec_version"] == JSON_API_SPEC_VERSION
+
diff --git a/tests_2.0/spec/test_code_outputs_validate_against_schema.py b/tests_2.0/spec/test_code_outputs_validate_against_schema.py
new file mode 100644
index 0000000..fbf8424
--- /dev/null
+++ b/tests_2.0/spec/test_code_outputs_validate_against_schema.py
@@ -0,0 +1,192 @@
+from __future__ import annotations
+
+"""Validate actual command outputs against the JSON schema.
+
+This complements the doc example validation by checking the live outputs
+returned from operations and the CLI, using the isolated test cache.
+If jsonschema is not installed locally, these tests are skipped.
+"""
+
+import json
+from pathlib import Path
+import sys
+import pytest
+
+
+def _load_schema():
+    try:
+        import jsonschema  # noqa: F401
+    except Exception:
+        pytest.skip("jsonschema not installed; skipping schema validation tests", allow_module_level=True)
+
+    schema_path = Path("docs/json-api-schema.json")
+    assert schema_path.exists(), "Schema file docs/json-api-schema.json missing"
+    return json.loads(schema_path.read_text(encoding="utf-8"))
+
+
+def _get_validator():
+    try:
+        from jsonschema import Draft7Validator
+    except Exception:
+        pytest.skip("jsonschema not available", allow_module_level=True)
+    return Draft7Validator(_load_schema())
+
+
+@pytest.mark.spec
+def test_list_output_matches_schema(mock_models, isolated_cache):
+    from mlxk2.operations.list import list_models
+    validator = _get_validator()
+
+    data = list_models()
+    errors = sorted(validator.iter_errors(data), key=lambda e: e.path)
+    assert not errors, f"list output invalid: {errors[0].message} at {'/'.join(map(str, errors[0].path)) or '<root>'}"
+
+
+@pytest.mark.spec
+def test_show_outputs_match_schema(mock_models, isolated_cache):
+    from mlxk2.operations.show import show_model_operation
+    validator = _get_validator()
+
+    name = "mlx-community/Phi-3-mini-4k-instruct-4bit"
+
+    base = show_model_operation(name)
+    files = show_model_operation(name, include_files=True, include_config=False)
+    cfg = show_model_operation(name, include_files=False, include_config=True)
+
+    for label, payload in ("base", base), ("files", files), ("config", cfg):
+        errors = sorted(_get_validator().iter_errors(payload), key=lambda e: e.path)
+        assert not errors, f"show ({label}) output invalid: {errors[0].message} at {'/'.join(map(str, errors[0].path)) or '<root>'}"
+
+
+@pytest.mark.spec
+def test_health_output_matches_schema(mock_models, isolated_cache):
+    from mlxk2.operations.health import health_check_operation
+    validator = _get_validator()
+
+    data = health_check_operation()
+    errors = sorted(validator.iter_errors(data), key=lambda e: e.path)
+    assert not errors, f"health output invalid: {errors[0].message} at {'/'.join(map(str, errors[0].path)) or '<root>'}"
+
+
+@pytest.mark.spec
+def test_rm_output_matches_schema(monkeypatch, mock_models, isolated_cache):
+    from mlxk2.operations.rm import rm_operation
+    validator = _get_validator()
+
+    # Delete an existing model in the isolated cache
+    name = "mlx-community/Qwen3-30B-A3B-Instruct-2507-4bit"
+    res = rm_operation(name, force=True)
+    errors = sorted(validator.iter_errors(res), key=lambda e: e.path)
+    assert not errors, f"rm output invalid: {errors[0].message} at {'/'.join(map(str, errors[0].path)) or '<root>'}"
+
+
+@pytest.mark.spec
+def test_pull_output_matches_schema_already_exists():
+    """Test pull response schema with static example data."""
+    validator = _get_validator()
+
+    # Static example of pull operation response for already-cached model
+    pull_response = {
+        "status": "success",
+        "command": "pull",
+        "error": None,
+        "data": {
+            "model": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+            "download_status": "already_exists",
+            "message": "Model mlx-community/Phi-3-mini-4k-instruct-4bit already exists in cache",
+            "expanded_name": None
+        }
+    }
+
+    errors = sorted(validator.iter_errors(pull_response), key=lambda e: e.path)
+    assert not errors, f"pull output invalid: {errors[0].message} at {'/'.join(map(str, errors[0].path)) or '<root>'}"
+
+
+@pytest.mark.spec
+def test_version_output_matches_schema(monkeypatch, capsys):
+    from mlxk2 import cli
+    validator = _get_validator()
+
+    monkeypatch.setattr(sys, "argv", ["mlxk2", "--version", "--json"])
+    with pytest.raises(SystemExit) as exc:
+        cli.main()
+    assert exc.value.code == 0
+
+    out = capsys.readouterr().out.strip()
+    payload = json.loads(out)
+    errors = sorted(validator.iter_errors(payload), key=lambda e: e.path)
+    assert not errors, f"version output invalid: {errors[0].message} at {'/'.join(map(str, errors[0].path)) or '<root>'}"
+
+
+@pytest.mark.spec
+def test_clone_output_matches_schema():
+    """Test clone response schema with static example data."""
+    validator = _get_validator()
+
+    # Static example of clone operation response
+    clone_response = {
+        "status": "success",
+        "command": "clone",
+        "error": None,
+        "data": {
+            "model": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+            "target_dir": "./workspace",
+            "message": "Cloned mlx-community/Phi-3-mini-4k-instruct-4bit to ./workspace",
+            "commit_hash": "a1b2c3d4e5f6789012345678901234567890abcd"
+        }
+    }
+
+    errors = sorted(validator.iter_errors(clone_response), key=lambda e: e.path)
+    assert not errors, f"clone output invalid: {errors[0].message} at {'/'.join(map(str, errors[0].path)) or '<root>'}"
+
+
+@pytest.mark.spec
+def test_push_output_matches_schema():
+    """Test push response schema with static example data."""
+    validator = _get_validator()
+
+    # Static example of push operation response (matches schema requirements)
+    push_response = {
+        "status": "success",
+        "command": "push",
+        "error": None,
+        "data": {
+            "repo_id": "user/custom-model",
+            "branch": "main",
+            "repo_url": "https://huggingface.co/user/custom-model",
+            "uploaded_files_count": 5,
+            "experimental": False,
+            "disclaimer": "Push completed successfully"
+        }
+    }
+
+    errors = sorted(validator.iter_errors(push_response), key=lambda e: e.path)
+    assert not errors, f"push output invalid: {errors[0].message} at {'/'.join(map(str, errors[0].path)) or '<root>'}"
+
+
+@pytest.mark.spec
+def test_run_output_matches_schema():
+    """Test run response schema with static example data."""
+    validator = _get_validator()
+
+    # Static example of run operation response (non-streaming)
+    run_response = {
+        "status": "success",
+        "command": "run",
+        "error": None,
+        "data": {
+            "model": "mlx-community/Phi-3-mini-4k-instruct-4bit",
+            "prompt": "Hello world",
+            "response": "Hello! How can I help you today?",
+            "tokens_generated": 8,
+            "generation_time_s": 0.95
+        }
+    }
+
+    errors = sorted(validator.iter_errors(run_response), key=lambda e: e.path)
+    assert not errors, f"run output invalid: {errors[0].message} at {'/'.join(map(str, errors[0].path)) or '<root>'}"
+
+
+# NOTE: serve/server commands don't produce JSON output - they run as server processes
+# Only error cases would produce JSON, which are covered by general error handling
+
diff --git a/tests_2.0/spec/test_push_error_matches_schema.py b/tests_2.0/spec/test_push_error_matches_schema.py
new file mode 100644
index 0000000..8a23083
--- /dev/null
+++ b/tests_2.0/spec/test_push_error_matches_schema.py
@@ -0,0 +1,46 @@
+"""Validate push(error) output (missing HF_TOKEN) against the JSON schema.
+
+Offline test: no network; ensures error envelope conforms to schema.
+"""
+
+from __future__ import annotations
+
+import json
+import os
+from pathlib import Path
+
+import pytest
+
+# Skip all tests if push is not enabled
+# Push tests now run by default (alpha features included in standard test suite)
+
+import pytest
+
+from mlxk2.operations.push import push_operation
+
+
+def _load_validator():
+    try:
+        from jsonschema import Draft7Validator  # type: ignore
+    except Exception:
+        pytest.skip("jsonschema not available", allow_module_level=True)
+    schema_path = Path("docs/json-api-schema.json")
+    schema = json.loads(schema_path.read_text(encoding="utf-8"))
+    return Draft7Validator(schema)
+
+
+def test_push_missing_token_matches_schema(tmp_path, monkeypatch):
+    validator = _load_validator()
+    # Ensure no token
+    monkeypatch.delenv("HF_TOKEN", raising=False)
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "README.md").write_text("x")
+
+    res = push_operation(str(ws), "user/repo", branch="main")
+    assert res["status"] == "error"
+    assert res["command"] == "push"
+    # Validate against schema (top-level error is globally defined)
+    errors = sorted(e.message for e in validator.iter_errors(res))
+    assert not errors, f"Schema validation errors for push error: {errors}"
+
diff --git a/tests_2.0/spec/test_push_output_matches_schema.py b/tests_2.0/spec/test_push_output_matches_schema.py
new file mode 100644
index 0000000..e74f417
--- /dev/null
+++ b/tests_2.0/spec/test_push_output_matches_schema.py
@@ -0,0 +1,85 @@
+"""Validate push(success) output against the JSON schema without network.
+
+We monkeypatch a fake `huggingface_hub` module into sys.modules so that
+`push_operation` can run to a success path offline.
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import sys
+
+import pytest
+
+# Skip all tests if push is not enabled
+# Push tests now run by default (alpha features included in standard test suite)
+from pathlib import Path
+from types import SimpleNamespace
+
+import pytest
+
+from mlxk2.operations.push import push_operation
+
+
+def _load_validator():
+    try:
+        from jsonschema import Draft7Validator  # type: ignore
+    except Exception:
+        pytest.skip("jsonschema not available", allow_module_level=True)
+    schema_path = Path("docs/json-api-schema.json")
+    schema = json.loads(schema_path.read_text(encoding="utf-8"))
+    return Draft7Validator(schema)
+
+
+class _FakeHfApi:
+    def __init__(self, token: str | None = None) -> None:
+        self.token = token
+
+    def repo_info(self, repo_id: str, repo_type: str, revision: str):
+        # Pretend repo + branch exist
+        return {"id": repo_id, "type": repo_type, "rev": revision}
+
+    def create_repo(self, repo_id: str, repo_type: str, private: bool, exist_ok: bool):
+        return {"created": True}
+
+
+def _install_fake_hf_module(monkeypatch):
+    class _Errors(SimpleNamespace):
+        class HfHubHTTPError(Exception):
+            pass
+
+        class RepositoryNotFoundError(Exception):
+            pass
+
+        class RevisionNotFoundError(Exception):
+            pass
+
+    def upload_folder(**kwargs):
+        # Emulate successful upload return with commit_id attribute
+        return SimpleNamespace(commit_id="abcdef1234567890abcdef1234567890abcdef12")
+
+    fake = SimpleNamespace(HfApi=_FakeHfApi, upload_folder=upload_folder, errors=_Errors)
+    # Use monkeypatch to ensure automatic restoration after each test
+    monkeypatch.setitem(sys.modules, "huggingface_hub", fake)
+    monkeypatch.setitem(sys.modules, "huggingface_hub.errors", _Errors)
+
+
+def test_push_success_shape_matches_schema(tmp_path, monkeypatch):
+    validator = _load_validator()
+    # Prepare workspace
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "README.md").write_text("ok")
+    (ws / ".hfignore").write_text(".DS_Store\n__pycache__/\n")
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+
+    # Fake HF module
+    _install_fake_hf_module(monkeypatch)
+
+    res = push_operation(str(ws), "user/repo", create=False, private=False, branch="main", commit_message="t")
+    assert res["status"] == "success"
+    assert res["command"] == "push"
+    # Validate against schema
+    errors = sorted(e.message for e in validator.iter_errors(res))
+    assert not errors, f"Schema validation errors for push success: {errors}"
diff --git a/tests_2.0/spec/test_spec_doc_examples_validate.py b/tests_2.0/spec/test_spec_doc_examples_validate.py
new file mode 100644
index 0000000..5a4d9a2
--- /dev/null
+++ b/tests_2.0/spec/test_spec_doc_examples_validate.py
@@ -0,0 +1,92 @@
+"""Validate JSON examples in docs/json-api-specification.md against the schema.
+
+This ensures the Spec document examples stay in sync with the current schema.
+If jsonschema is not installed locally, these tests are skipped.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+import json
+import re
+import pytest
+
+
+def _load_schema():
+    try:
+        import jsonschema  # noqa: F401
+    except Exception:
+        pytest.skip("jsonschema not installed; skipping schema validation tests", allow_module_level=True)
+
+    schema_path = Path("docs/json-api-schema.json")
+    assert schema_path.exists(), "Schema file docs/json-api-schema.json missing"
+    return json.loads(schema_path.read_text(encoding="utf-8"))
+
+
+def _iter_json_blocks(md_text: str):
+    # Capture fenced code blocks marked as json
+    # ```json\n ... \n```
+    pattern = re.compile(r"```json\n(.*?)\n```", re.DOTALL)
+    for m in pattern.finditer(md_text):
+        block = m.group(1).strip()
+        if not block:
+            continue
+        yield block
+
+
+@pytest.mark.spec
+def test_spec_document_examples_validate_against_schema():
+    schema = _load_schema()
+    try:
+        from jsonschema import Draft7Validator
+    except Exception:
+        pytest.skip("jsonschema not available", allow_module_level=True)
+
+    validator = Draft7Validator(schema)
+    md_path = Path("docs/json-api-specification.md")
+    assert md_path.exists(), "Spec document missing"
+    text = md_path.read_text(encoding="utf-8")
+
+    had_errors = []
+    validated = 0
+    skipped = 0
+    for idx, block in enumerate(_iter_json_blocks(text), start=1):
+        # Skip illustrative/pseudo examples (contain non-JSON constructs)
+        if "/*" in block or "|" in block or "... omitted" in block:
+            skipped += 1
+            continue
+
+        try:
+            data = json.loads(block)
+        except Exception:
+            # Treat unparsable fenced blocks as illustrative and skip
+            skipped += 1
+            continue
+
+        errors = sorted(validator.iter_errors(data), key=lambda e: e.path)
+        validated += 1
+        if errors:
+            first = errors[0]
+            path = "/".join(map(str, first.path)) or "<root>"
+            had_errors.append(f"Example #{idx} invalid at {path}: {first.message}")
+
+    # Ensure we validated at least one real example
+    assert validated > 0, "No valid JSON examples found to validate in the spec document"
+
+    if had_errors:
+        import os
+        verbose = os.environ.get("MLXK2_SPEC_VALIDATION_VERBOSE") == "1"
+        if verbose:
+            joined = "\n".join(had_errors)
+        else:
+            MAX_SHOW = 5
+            shown = had_errors[:MAX_SHOW]
+            joined = "\n".join(shown)
+            if len(had_errors) > MAX_SHOW:
+                joined += f"\n... and {len(had_errors) - MAX_SHOW} more. Set MLXK2_SPEC_VALIDATION_VERBOSE=1 to see all."
+
+        pytest.fail(
+            "Spec examples do not match the current schema.\n"
+            + joined
+            + "\nUpdate docs examples or docs/json-api-schema.json accordingly."
+        )
diff --git a/tests_2.0/spec/test_spec_version_sync.py b/tests_2.0/spec/test_spec_version_sync.py
new file mode 100644
index 0000000..5968cc9
--- /dev/null
+++ b/tests_2.0/spec/test_spec_version_sync.py
@@ -0,0 +1,26 @@
+"""Ensures the code’s spec version matches docs/json-api-specification.md.
+
+This enforces discipline: Spec, code, and tests must evolve together.
+"""
+
+from pathlib import Path
+import re
+import pytest
+
+from mlxk2.spec import JSON_API_SPEC_VERSION
+
+
+@pytest.mark.spec
+def test_spec_version_matches_docs():
+    docs_path = Path("docs/json-api-specification.md")
+    assert docs_path.exists(), "Spec document missing"
+    content = docs_path.read_text(encoding="utf-8")
+
+    # Extract the version from the first lines like: **Specification Version:** 0.1.2
+    m = re.search(r"\*\*Specification Version:\*\*\s*([0-9]+\.[0-9]+\.[0-9]+)", content)
+    assert m, "Could not parse spec version from docs"
+    docs_version = m.group(1)
+
+    assert (
+        docs_version == JSON_API_SPEC_VERSION
+    ), f"Spec version mismatch: docs={docs_version} code={JSON_API_SPEC_VERSION}"
diff --git a/tests_2.0/stubs/mlx/core.py b/tests_2.0/stubs/mlx/core.py
new file mode 100644
index 0000000..97dcca8
--- /dev/null
+++ b/tests_2.0/stubs/mlx/core.py
@@ -0,0 +1,39 @@
+"""Lightweight test stub for mlx.core to avoid native deps in unit tests.
+
+Only implements the minimal API surface used by the 2.0 unit tests and runner:
+- zeros(n)
+- array(x)
+- clear_cache()
+- get_active_memory()
+"""
+
+class _Array:
+    def __init__(self, data):
+        self._data = data
+
+    def item(self):
+        # mimic behavior of mx.array([...]).item() -> first element
+        if isinstance(self._data, (list, tuple)):
+            return self._data[0]
+        return self._data
+
+
+def zeros(n):
+    # Return a simple Python list as a stand-in
+    return [0] * (n if isinstance(n, int) else 1)
+
+
+def array(x):
+    # Wrap in simple array-like with .item()
+    return _Array(x if isinstance(x, (list, tuple)) else [x])
+
+
+def clear_cache():
+    # No-op for tests
+    return None
+
+
+def get_active_memory():
+    # Return a tiny deterministic value (bytes)
+    return 0
+
diff --git a/tests_2.0/stubs/mlx_lm/__init__.py b/tests_2.0/stubs/mlx_lm/__init__.py
new file mode 100644
index 0000000..38ea0d0
--- /dev/null
+++ b/tests_2.0/stubs/mlx_lm/__init__.py
@@ -0,0 +1,4 @@
+def load(model_path, adapter_path=None):
+    # Placeholder; tests patch mlxk2.core.runner.load directly
+    return object(), object()
+
diff --git a/tests_2.0/stubs/mlx_lm/generate.py b/tests_2.0/stubs/mlx_lm/generate.py
new file mode 100644
index 0000000..1c5d562
--- /dev/null
+++ b/tests_2.0/stubs/mlx_lm/generate.py
@@ -0,0 +1,5 @@
+def generate_step(prompt, model, max_tokens, sampler=None, logits_processors=None):
+    # Minimal stub generator: yield nothing by default
+    if False:
+        yield (0, None)
+
diff --git a/tests_2.0/stubs/mlx_lm/sample_utils.py b/tests_2.0/stubs/mlx_lm/sample_utils.py
new file mode 100644
index 0000000..daa4444
--- /dev/null
+++ b/tests_2.0/stubs/mlx_lm/sample_utils.py
@@ -0,0 +1,9 @@
+def make_repetition_penalty(*args, **kwargs):
+    # Return a simple callable or marker; runner only checks presence
+    return lambda *a, **k: None
+
+
+def make_sampler(*args, **kwargs):
+    # Return a simple callable representing sampler
+    return lambda *a, **k: None
+
diff --git a/tests_2.0/test_adr004_error_logging.py b/tests_2.0/test_adr004_error_logging.py
new file mode 100644
index 0000000..4dc94f1
--- /dev/null
+++ b/tests_2.0/test_adr004_error_logging.py
@@ -0,0 +1,402 @@
+"""
+Tests for ADR-004: Enhanced Error Handling & Logging.
+
+Covers:
+- Error envelope structure and serialization
+- Error type to HTTP status mapping
+- Request ID propagation
+- Log redaction (HF_TOKEN, paths)
+- Structured logging (plain and JSON modes)
+"""
+
+import json
+import os
+import re
+from pathlib import Path
+
+import pytest
+
+from mlxk2.errors import (
+    ErrorType,
+    MLXKError,
+    error_envelope,
+    success_envelope,
+    model_not_found_error,
+    validation_error,
+    server_shutdown_error,
+    internal_error,
+    access_denied_error,
+    ERROR_TYPE_TO_HTTP_STATUS,
+)
+from mlxk2.logging import MLXKLogger
+from mlxk2.context import (
+    generate_request_id,
+    set_request_id,
+    get_request_id,
+    clear_request_id,
+    RequestContext,
+)
+
+
+# ============================================================================
+# Error Envelope Tests
+# ============================================================================
+
+def test_mlxk_error_structure():
+    """Error should serialize to correct structure."""
+    error = MLXKError(
+        type=ErrorType.MODEL_NOT_FOUND,
+        message="Model not found",
+        detail="Additional details",
+        retryable=False
+    )
+
+    error_dict = error.to_dict()
+    assert error_dict["type"] == "model_not_found"
+    assert error_dict["message"] == "Model not found"
+    assert error_dict["detail"] == "Additional details"
+    assert error_dict["retryable"] is False
+
+
+def test_mlxk_error_minimal():
+    """Error should work with minimal fields."""
+    error = MLXKError(
+        type=ErrorType.INTERNAL_ERROR,
+        message="Something went wrong"
+    )
+
+    error_dict = error.to_dict()
+    assert error_dict["type"] == "internal_error"
+    assert error_dict["message"] == "Something went wrong"
+    assert "detail" not in error_dict
+    assert "retryable" not in error_dict
+
+
+def test_error_envelope_structure():
+    """Error envelope should have correct structure."""
+    error = MLXKError(
+        type=ErrorType.VALIDATION_ERROR,
+        message="Invalid request"
+    )
+
+    envelope = error_envelope(error, request_id="test-request-id")
+
+    assert envelope["status"] == "error"
+    assert envelope["error"]["type"] == "validation_error"
+    assert envelope["error"]["message"] == "Invalid request"
+    assert envelope["request_id"] == "test-request-id"
+
+
+def test_success_envelope_structure():
+    """Success envelope should have correct structure."""
+    data = {"models": ["model1", "model2"]}
+    envelope = success_envelope(data, request_id="test-request-id")
+
+    assert envelope["status"] == "success"
+    assert envelope["data"] == data
+    assert envelope["request_id"] == "test-request-id"
+
+
+def test_error_envelope_with_data():
+    """Error envelope can include additional data field."""
+    error = MLXKError(
+        type=ErrorType.AMBIGUOUS_MATCH,
+        message="Multiple matches found"
+    )
+    data = {"candidates": ["model1", "model2"]}
+
+    envelope = error_envelope(error, request_id="req-123", data=data)
+
+    assert envelope["status"] == "error"
+    assert envelope["data"] == data
+    assert envelope["request_id"] == "req-123"
+
+
+# ============================================================================
+# HTTP Status Mapping Tests (ADR-004 Specification)
+# ============================================================================
+
+def test_error_type_to_http_status_mapping():
+    """All error types should map to correct HTTP status codes."""
+    expected_mappings = {
+        ErrorType.ACCESS_DENIED: 403,
+        ErrorType.MODEL_NOT_FOUND: 404,
+        ErrorType.AMBIGUOUS_MATCH: 400,
+        ErrorType.DOWNLOAD_FAILED: 503,
+        ErrorType.VALIDATION_ERROR: 400,
+        ErrorType.PUSH_OPERATION_FAILED: 500,
+        ErrorType.SERVER_SHUTDOWN: 503,
+        ErrorType.INTERNAL_ERROR: 500,
+    }
+
+    for error_type, expected_status in expected_mappings.items():
+        error = MLXKError(type=error_type, message="test")
+        assert error.to_http_status() == expected_status
+        assert ERROR_TYPE_TO_HTTP_STATUS[error_type] == expected_status
+
+
+def test_common_error_constructors():
+    """Common error constructors should create correct error types."""
+    # model_not_found
+    error = model_not_found_error("test-model")
+    assert error.type == ErrorType.MODEL_NOT_FOUND
+    assert "test-model" in error.message
+    assert error.retryable is False
+
+    # validation_error
+    error = validation_error("Invalid input")
+    assert error.type == ErrorType.VALIDATION_ERROR
+    assert error.message == "Invalid input"
+    assert error.retryable is False
+
+    # server_shutdown
+    error = server_shutdown_error()
+    assert error.type == ErrorType.SERVER_SHUTDOWN
+    assert error.retryable is True
+
+    # internal_error
+    error = internal_error("Unexpected error", detail={"stack": "..."})
+    assert error.type == ErrorType.INTERNAL_ERROR
+    assert error.detail == {"stack": "..."}
+    assert error.retryable is None  # Unknown
+
+    # access_denied
+    error = access_denied_error("No permission")
+    assert error.type == ErrorType.ACCESS_DENIED
+    assert error.retryable is False
+
+
+# ============================================================================
+# Request ID Tests
+# ============================================================================
+
+def test_generate_request_id():
+    """generate_request_id should return valid UUID4."""
+    request_id = generate_request_id()
+    # UUID4 format: 8-4-4-4-12 hex characters
+    uuid_pattern = re.compile(
+        r'^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$',
+        re.IGNORECASE
+    )
+    assert uuid_pattern.match(request_id), f"Invalid UUID4: {request_id}"
+
+
+def test_request_id_context_propagation():
+    """Request ID should propagate through context."""
+    # Initially None
+    assert get_request_id() is None
+
+    # Set and retrieve
+    set_request_id("test-id-123")
+    assert get_request_id() == "test-id-123"
+
+    # Clear
+    clear_request_id()
+    assert get_request_id() is None
+
+
+def test_request_context_manager():
+    """RequestContext should manage request_id lifecycle."""
+    # Initially None
+    assert get_request_id() is None
+
+    with RequestContext() as request_id:
+        # Inside context, request_id is set
+        assert request_id is not None
+        assert get_request_id() == request_id
+
+    # After context, request_id is cleared
+    assert get_request_id() is None
+
+
+def test_request_context_nesting():
+    """RequestContext should handle nesting correctly."""
+    with RequestContext() as outer_id:
+        assert get_request_id() == outer_id
+
+        with RequestContext() as inner_id:
+            assert get_request_id() == inner_id
+            assert inner_id != outer_id
+
+        # After inner context, outer_id is restored
+        assert get_request_id() == outer_id
+
+    # After outer context, None is restored
+    assert get_request_id() is None
+
+
+def test_request_context_with_explicit_id():
+    """RequestContext should accept explicit request_id."""
+    explicit_id = "my-custom-id"
+
+    with RequestContext(request_id=explicit_id) as request_id:
+        assert request_id == explicit_id
+        assert get_request_id() == explicit_id
+
+
+# ============================================================================
+# Log Redaction Tests (ADR-004 Security Requirement)
+# ============================================================================
+
+def test_logger_redacts_hf_token():
+    """Logger should redact HF tokens from messages."""
+    logger = MLXKLogger("test")
+
+    # Test token redaction
+    message = "Using token hf_AbCdEfGhIjKlMnOpQrStUvWxYz123456 for auth"
+    redacted = logger._redact(message)
+
+    assert "hf_AbCdEfGhIjKlMnOpQrStUvWxYz123456" not in redacted
+    assert "[REDACTED_TOKEN]" in redacted
+
+
+def test_logger_redacts_home_directory():
+    """Logger should redact user home directory paths."""
+    logger = MLXKLogger("test")
+
+    home_dir = str(Path.home())
+    message = f"Loading model from {home_dir}/models/test"
+    redacted = logger._redact(message)
+
+    # Should replace home directory with ~
+    assert home_dir not in redacted
+    assert "~/models/test" in redacted
+
+
+def test_logger_redacts_multiple_tokens():
+    """Logger should redact multiple tokens in same message."""
+    logger = MLXKLogger("test")
+
+    message = "Token1: hf_TokenAAAAAAAAAAAAAAAAAAAAAAAAAAAA, Token2: hf_TokenBBBBBBBBBBBBBBBBBBBBBBBBBBBB"
+    redacted = logger._redact(message)
+
+    assert "hf_TokenAAAAAAAAAAAAAAAAAAAAAAAAAAAA" not in redacted
+    assert "hf_TokenBBBBBBBBBBBBBBBBBBBBBBBBBBBB" not in redacted
+    assert redacted.count("[REDACTED_TOKEN]") == 2
+
+
+# ============================================================================
+# Structured Logging Tests
+# ============================================================================
+
+def test_logger_plain_text_mode(capsys):
+    """Logger should output plain text by default."""
+    # Ensure JSON mode is off
+    os.environ.pop("MLXK2_LOG_JSON", None)
+
+    logger = MLXKLogger("test-plain")
+    logger.info("Test message")
+
+    captured = capsys.readouterr()
+    assert "Test message" in captured.err
+    # Should NOT be JSON
+    assert not captured.err.strip().startswith("{")
+
+
+def test_logger_json_mode(capsys):
+    """Logger should output JSON when MLXK2_LOG_JSON=1."""
+    # Enable JSON mode
+    os.environ["MLXK2_LOG_JSON"] = "1"
+
+    try:
+        logger = MLXKLogger("test-json")
+        logger.info("Test message", request_id="req-123", model="test-model")
+
+        captured = capsys.readouterr()
+        log_line = captured.err.strip()
+
+        # Should be valid JSON
+        log_entry = json.loads(log_line)
+        assert log_entry["msg"] == "Test message"
+        assert log_entry["level"] == "INFO"
+        assert log_entry["request_id"] == "req-123"
+        assert log_entry["model"] == "test-model"
+        assert "ts" in log_entry  # Timestamp should be present
+
+    finally:
+        # Cleanup
+        os.environ.pop("MLXK2_LOG_JSON", None)
+
+
+def test_logger_levels(capsys):
+    """Logger should support different log levels."""
+    logger = MLXKLogger("test-levels")
+
+    logger.info("Info message")
+    logger.warning("Warning message")
+    logger.error("Error message", error_key="test_error")
+
+    captured = capsys.readouterr()
+    assert "Info message" in captured.err
+    assert "WARN" in captured.err
+    assert "Warning message" in captured.err
+    assert "ERROR" in captured.err
+    assert "Error message" in captured.err
+
+
+def test_logger_debug_requires_verbose(capsys):
+    """DEBUG logs should only appear when verbose=True."""
+    logger = MLXKLogger("test-debug")
+
+    # Verbose off (default)
+    logger.debug("Debug message 1")
+    captured = capsys.readouterr()
+    assert "Debug message 1" not in captured.err
+
+    # Verbose on
+    logger.set_verbose(True)
+    logger.debug("Debug message 2")
+    captured = capsys.readouterr()
+    assert "Debug message 2" in captured.err
+
+
+def test_logger_error_rate_limiting(capsys):
+    """Logger should rate-limit duplicate errors (max 1/5s)."""
+    logger = MLXKLogger("test-ratelimit")
+
+    # First error should be logged
+    logger.error("Repeated error", error_key="duplicate_error")
+    captured = capsys.readouterr()
+    assert "Repeated error" in captured.err
+
+    # Immediate duplicate should be suppressed
+    logger.error("Repeated error", error_key="duplicate_error")
+    captured = capsys.readouterr()
+    assert captured.err == ""  # Suppressed
+
+    # Different error key should be logged
+    logger.error("Different error", error_key="different_error")
+    captured = capsys.readouterr()
+    assert "Different error" in captured.err
+
+
+# ============================================================================
+# Integration Tests (Error Envelope + Request ID + Logging)
+# ============================================================================
+
+def test_error_envelope_includes_request_id():
+    """Error envelope should include request_id when available."""
+    with RequestContext() as request_id:
+        error = model_not_found_error("test-model")
+        envelope = error_envelope(error, request_id=get_request_id())
+
+        assert envelope["request_id"] == request_id
+
+
+def test_logger_uses_request_id(capsys):
+    """Logger should include request_id in JSON logs."""
+    os.environ["MLXK2_LOG_JSON"] = "1"
+
+    try:
+        logger = MLXKLogger("test-request-id")
+
+        with RequestContext() as request_id:
+            logger.info("Test message", request_id=request_id)
+
+        captured = capsys.readouterr()
+        log_entry = json.loads(captured.err.strip())
+
+        assert log_entry["request_id"] == request_id
+
+    finally:
+        os.environ.pop("MLXK2_LOG_JSON", None)
diff --git a/tests_2.0/test_cli_log_json_flag.py b/tests_2.0/test_cli_log_json_flag.py
new file mode 100644
index 0000000..d136f4e
--- /dev/null
+++ b/tests_2.0/test_cli_log_json_flag.py
@@ -0,0 +1,57 @@
+"""
+Test for --log-json CLI flag (ADR-004 improvement).
+
+Minimal test: Verify flag sets MLXK2_LOG_JSON environment variable.
+"""
+
+import os
+import sys
+from unittest.mock import patch, MagicMock
+
+
+def test_serve_log_json_flag_sets_env_var():
+    """--log-json flag should set MLXK2_LOG_JSON=1 environment variable."""
+    # Mock start_server to prevent actual server start
+    with patch('mlxk2.operations.serve.start_server') as mock_start_server:
+        # Simulate CLI invocation: mlxk2 serve --log-json
+        test_args = ['mlxk2', 'serve', '--log-json']
+
+        with patch.object(sys, 'argv', test_args):
+            # Clear MLXK2_LOG_JSON before test
+            os.environ.pop('MLXK2_LOG_JSON', None)
+
+            # Import and run CLI
+            from mlxk2.cli import main
+
+            try:
+                main()
+            except SystemExit:
+                pass  # Ignore exit (server would run indefinitely)
+
+            # Verify environment variable was set
+            assert os.environ.get('MLXK2_LOG_JSON') == '1', \
+                "MLXK2_LOG_JSON should be set to '1' when --log-json flag is present"
+
+            # Verify start_server was called
+            assert mock_start_server.called, "start_server should have been called"
+
+
+def test_serve_without_log_json_flag():
+    """Without --log-json, MLXK2_LOG_JSON should remain unset."""
+    with patch('mlxk2.operations.serve.start_server') as mock_start_server:
+        test_args = ['mlxk2', 'serve']
+
+        with patch.object(sys, 'argv', test_args):
+            # Clear MLXK2_LOG_JSON before test
+            os.environ.pop('MLXK2_LOG_JSON', None)
+
+            from mlxk2.cli import main
+
+            try:
+                main()
+            except SystemExit:
+                pass
+
+            # Verify environment variable was NOT set
+            assert os.environ.get('MLXK2_LOG_JSON') != '1', \
+                "MLXK2_LOG_JSON should not be set without --log-json flag"
diff --git a/tests_2.0/test_cli_push_args.py b/tests_2.0/test_cli_push_args.py
new file mode 100644
index 0000000..b45e377
--- /dev/null
+++ b/tests_2.0/test_cli_push_args.py
@@ -0,0 +1,118 @@
+"""CLI-arg tests for experimental push (offline)."""
+
+from __future__ import annotations
+
+import json
+import os
+import sys
+from pathlib import Path
+from types import SimpleNamespace
+
+import pytest
+
+# Skip all tests if push is not enabled
+# Push tests now run by default (alpha features included in standard test suite)
+
+
+def _run_cli(argv: list[str], capsys):
+    from mlxk2.cli import main as cli_main
+
+    # Replace sys.argv and run
+    old_argv = sys.argv[:]
+    sys.argv = argv[:]
+    try:
+        with pytest.raises(SystemExit):
+            cli_main()
+    finally:
+        sys.argv = old_argv
+    out = capsys.readouterr().out
+    return out
+
+
+def test_cli_push_missing_args_json_error(capsys, monkeypatch):
+    # Missing required positional args but with --json should emit JSON error
+    monkeypatch.setenv("MLXK2_ENABLE_ALPHA_FEATURES", "1")
+    out = _run_cli(["mlxk2", "push", "--private", "--json"], capsys)
+    data = json.loads(out)
+    assert data["status"] == "error"
+    assert data["command"] is None
+    assert isinstance(data["error"], dict)
+
+
+def test_cli_push_workspace_missing_json_error(tmp_path, monkeypatch, capsys):
+    # Provide missing workspace; ensure JSON error and specific error type
+    monkeypatch.setenv("MLXK2_ENABLE_ALPHA_FEATURES", "1")
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+    missing = str(tmp_path / "nope")
+    out = _run_cli(["mlxk2", "push", "--private", missing, "user/repo", "--json"], capsys)
+    data = json.loads(out)
+    assert data["status"] == "error"
+    assert data["command"] == "push"
+    assert data["error"]["type"] == "workspace_not_found"
+
+
+def _install_fake_hf(monkeypatch, mode: str):
+    class _Errors:
+        class HfHubHTTPError(Exception):
+            pass
+
+        class RepositoryNotFoundError(Exception):
+            pass
+
+        class RevisionNotFoundError(Exception):
+            pass
+
+    class _Api:
+        def __init__(self, token=None):
+            self.token = token
+
+        def repo_info(self, repo_id: str, repo_type: str, revision: str):
+            return {"id": repo_id, "type": repo_type, "rev": revision}
+
+    def upload_folder(**kwargs):  # type: ignore
+        if mode == "no_changes":
+            # Return an object without commit_id
+            return SimpleNamespace()
+        else:
+            return SimpleNamespace(commit_id="abcdef1234567890abcdef1234567890abcdef12")
+
+    fake = SimpleNamespace(HfApi=_Api, upload_folder=upload_folder, errors=_Errors)
+    # Use monkeypatch to ensure automatic restoration after each test
+    monkeypatch.setitem(sys.modules, "huggingface_hub", fake)
+    monkeypatch.setitem(sys.modules, "huggingface_hub.errors", _Errors)
+
+
+def test_cli_push_no_changes_json_output(tmp_path, monkeypatch, capsys):
+    # Setup workspace
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "x.txt").write_text("x")
+    monkeypatch.setenv("MLXK2_ENABLE_ALPHA_FEATURES", "1")
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+
+    _install_fake_hf(monkeypatch, mode="no_changes")
+
+    out = _run_cli(["mlxk2", "push", "--private", str(ws), "user/repo", "--json"], capsys)
+    data = json.loads(out)
+    assert data["status"] == "success"
+    assert data["command"] == "push"
+    assert data["data"]["no_changes"] is True
+    assert data["data"]["uploaded_files_count"] == 0
+
+
+def test_cli_push_with_changes_json_output(tmp_path, monkeypatch, capsys):
+    # Setup workspace
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "x.txt").write_text("x")
+    monkeypatch.setenv("MLXK2_ENABLE_ALPHA_FEATURES", "1")
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+
+    _install_fake_hf(monkeypatch, mode="with_changes")
+
+    out = _run_cli(["mlxk2", "push", "--private", str(ws), "user/repo", "--json"], capsys)
+    data = json.loads(out)
+    assert data["status"] == "success"
+    assert data["command"] == "push"
+    assert data["data"]["no_changes"] is False
+    assert isinstance(data["data"]["commit_sha"], str)
diff --git a/tests_2.0/test_clone_operation.py b/tests_2.0/test_clone_operation.py
new file mode 100644
index 0000000..e3d5b29
--- /dev/null
+++ b/tests_2.0/test_clone_operation.py
@@ -0,0 +1,1028 @@
+"""Tests for clone operation following ADR-007 Phase 1: Same-Volume APFS strategy.
+
+Tests for the new isolated temp cache + APFS CoW implementation that replaces
+the deprecated ADR-006 approach.
+
+Key test areas:
+- APFS filesystem validation
+- Temp cache creation with sentinel safety
+- Volume-aware placement
+- APFS copy-on-write cloning
+- Temp cache cleanup safety
+- JSON API 0.1.4 compliance
+"""
+
+import json
+import os
+import shutil
+import subprocess
+import tempfile
+import time
+from pathlib import Path
+from unittest.mock import patch, MagicMock, mock_open
+import pytest
+
+from mlxk2.operations.clone import (
+    clone_operation,
+    _validate_apfs_filesystem,
+    _is_apfs_filesystem,
+    _create_temp_cache_same_volume,
+    _get_volume_mount_point,
+    _resolve_latest_snapshot,
+    _apfs_clone_directory,
+    _cleanup_temp_cache_safe,
+    FilesystemError
+)
+
+
+class TestAPFSFilesystemValidation:
+    """Test suite for APFS filesystem requirement validation."""
+
+    def test_validate_apfs_filesystem_success(self):
+        """Test APFS validation passes on APFS filesystem."""
+        test_path = Path("/tmp/test")
+
+        with patch('mlxk2.operations.clone._is_apfs_filesystem', return_value=True):
+            # Should not raise exception
+            _validate_apfs_filesystem(test_path)
+
+    def test_validate_apfs_filesystem_failure(self):
+        """Test APFS validation fails on non-APFS filesystem."""
+        test_path = Path("/tmp/test")
+
+        with patch('mlxk2.operations.clone._is_apfs_filesystem', return_value=False):
+            with pytest.raises(FilesystemError) as exc_info:
+                _validate_apfs_filesystem(test_path)
+
+            assert "APFS required for clone operations" in str(exc_info.value)
+            assert str(test_path) in str(exc_info.value)
+
+    def test_is_apfs_filesystem_true(self):
+        """Test APFS detection returns True - real test on Phase 1 APFS system."""
+        # Test current working directory - should be APFS on Phase 1 developer system
+        result = _is_apfs_filesystem(Path.cwd())
+        assert result is True
+
+        # Test HF_HOME if set - should also be APFS on Phase 1 system
+        hf_home = os.environ.get('HF_HOME')
+        if hf_home:
+            result = _is_apfs_filesystem(Path(hf_home))
+            assert result is True
+
+    def test_is_apfs_filesystem_false(self):
+        """Test APFS detection returns False for non-APFS."""
+        test_path = Path("/mnt/nfs")
+
+        with patch('subprocess.run') as mock_run:
+            mock_result = MagicMock()
+            mock_result.stdout = '/dev/nfs on /mnt/nfs (nfs, local, nodev, nosuid)\n'
+            mock_run.return_value = mock_result
+
+            result = _is_apfs_filesystem(test_path)
+
+            assert result is False
+
+    def test_is_apfs_filesystem_error_fallback(self):
+        """Test APFS detection safely falls back on subprocess error."""
+        test_path = Path("/invalid/path")
+
+        with patch('subprocess.run') as mock_run:
+            mock_run.side_effect = subprocess.CalledProcessError(1, 'stat')
+
+            result = _is_apfs_filesystem(test_path)
+
+            assert result is False  # Safe fallback
+
+
+class TestVolumeDetection:
+    """Test suite for volume mount point detection."""
+
+    def test_get_volume_mount_point_root(self):
+        """Test volume detection at filesystem root."""
+        test_path = Path("/")
+
+        result = _get_volume_mount_point(test_path)
+
+        assert result == Path("/")
+
+    def test_get_volume_mount_point_same_volume(self):
+        """Test volume detection with same device."""
+        test_path = Path("/Users/test/workspace")
+
+        with patch.object(Path, 'stat') as mock_stat:
+            # All paths have same device ID (same volume)
+            mock_stat.return_value.st_dev = 12345
+
+            result = _get_volume_mount_point(test_path)
+
+            # Should traverse to root
+            assert result == Path("/")
+
+    def test_get_volume_mount_point_mount_boundary(self):
+        """Test volume detection at mount boundary."""
+        test_path = Path("/Volumes/External/workspace")
+
+        # Create path-specific mocks
+        external_stat = MagicMock()
+        external_stat.st_dev = 67890
+        volumes_stat = MagicMock()
+        volumes_stat.st_dev = 12345
+        workspace_stat = MagicMock()
+        workspace_stat.st_dev = 67890
+
+        def mock_stat_for_path(self):
+            if str(self) == "/Volumes/External":
+                return external_stat
+            elif str(self) == "/Volumes":
+                return volumes_stat
+            else:  # workspace or other paths
+                return workspace_stat
+
+        with patch.object(Path, 'stat', mock_stat_for_path):
+            result = _get_volume_mount_point(test_path)
+
+            assert result == Path("/Volumes/External")
+
+    def test_get_volume_mount_point_permission_error(self):
+        """Test volume detection handles permission errors."""
+        test_path = Path("/restricted/path")
+
+        with patch.object(Path, 'stat') as mock_stat:
+            mock_stat.side_effect = PermissionError("Permission denied")
+
+            result = _get_volume_mount_point(test_path)
+
+            # Should fall back to filesystem root
+            assert result == Path("/")
+
+
+class TestRealFilesystemValidation:
+    """Test suite for real filesystem validation without mocks."""
+
+    def test_real_apfs_detection_system_volume(self):
+        """Test APFS detection on real system volume (should be APFS on modern macOS)."""
+        # Test current working directory (should be on system volume)
+        current_path = Path.cwd()
+
+        # This should work without exceptions
+        is_apfs = _is_apfs_filesystem(current_path)
+
+        # On modern macOS, system volume should be APFS
+        # But we don't assert True to avoid false failures on older systems
+        assert isinstance(is_apfs, bool)
+
+    def test_real_apfs_detection_user_home(self):
+        """Test APFS detection on user home directory."""
+        home_path = Path.home()
+
+        # This should work without exceptions
+        is_apfs = _is_apfs_filesystem(home_path)
+        assert isinstance(is_apfs, bool)
+
+    def test_real_volume_mount_point_detection_system(self):
+        """Test real volume mount point detection for system paths."""
+        # Test various system paths
+        test_paths = [
+            Path.cwd(),
+            Path.home(),
+            Path("/Users"),
+            Path("/tmp"),
+        ]
+
+        for path in test_paths:
+            if path.exists():
+                mount_point = _get_volume_mount_point(path)
+
+                # Mount point should be a valid path
+                assert isinstance(mount_point, Path)
+                assert mount_point.exists()
+                assert mount_point.is_dir()
+                # Mount point should be an ancestor of the original path
+                assert path.resolve().is_relative_to(mount_point) or path.resolve() == mount_point
+
+    def test_real_same_volume_detection_consistency(self):
+        """Test that same-volume detection works consistently for Phase 1."""
+        # Test current working directory
+        current_path = Path.cwd()
+        current_mount = _get_volume_mount_point(current_path)
+
+        # Test a subdirectory in the same location
+        if current_path.is_dir():
+            subdir_mount = _get_volume_mount_point(current_path / "subdir_test_path")
+
+            # Should have same mount point (same volume)
+            assert current_mount == subdir_mount
+            print(f"Volume consistency test - Mount: {current_mount}")
+
+    def test_real_apfs_validation_phase1_requirement(self):
+        """Test APFS validation for Phase 1 requirement on current workspace."""
+        # Phase 1: Only test current working directory (realistic workspace)
+        current_path = Path.cwd()
+
+        # APFS detection should work without errors
+        is_apfs = _is_apfs_filesystem(current_path)
+        assert isinstance(is_apfs, bool)
+
+        if is_apfs:
+            # If APFS, validation should pass
+            try:
+                _validate_apfs_filesystem(current_path)
+                print(f"✅ APFS validation passed for: {current_path}")
+            except FilesystemError:
+                pytest.fail("APFS validation failed on APFS filesystem")
+        else:
+            # If not APFS, validation should fail (Phase 1 requirement)
+            with pytest.raises(FilesystemError) as exc_info:
+                _validate_apfs_filesystem(current_path)
+            assert "APFS required" in str(exc_info.value)
+            print(f"⚠️ Non-APFS detected: {current_path} - Phase 1 will reject")
+
+
+class TestTempCacheCreation:
+    """Test suite for temporary cache creation with sentinel safety."""
+
+    def test_create_temp_cache_same_volume(self, tmp_path):
+        """Test temp cache creation on same volume."""
+        target_workspace = tmp_path / "workspace"
+
+        with patch('mlxk2.operations.clone._get_volume_mount_point') as mock_volume:
+            mock_volume.return_value = tmp_path
+
+            temp_cache = _create_temp_cache_same_volume(target_workspace)
+
+            # Verify temp cache is on same volume
+            assert temp_cache.parent == tmp_path
+            assert temp_cache.exists()
+            assert temp_cache.is_dir()
+
+            # Verify sentinel file exists
+            sentinel = temp_cache / ".mlxk2_temp_cache_sentinel"
+            assert sentinel.exists()
+            assert "mlxk2_temp_cache_created_" in sentinel.read_text()
+
+            # Cleanup
+            shutil.rmtree(temp_cache)
+
+    def test_create_temp_cache_unique_names(self, tmp_path):
+        """Test temp cache gets unique names."""
+        target_workspace = tmp_path / "workspace"
+
+        with patch('mlxk2.operations.clone._get_volume_mount_point') as mock_volume:
+            mock_volume.return_value = tmp_path
+
+            cache1 = _create_temp_cache_same_volume(target_workspace)
+            cache2 = _create_temp_cache_same_volume(target_workspace)
+
+            assert cache1 != cache2
+            assert cache1.exists()
+            assert cache2.exists()
+
+            # Cleanup
+            shutil.rmtree(cache1)
+            shutil.rmtree(cache2)
+
+    def test_create_temp_cache_includes_pid(self, tmp_path):
+        """Test temp cache name includes process ID."""
+        target_workspace = tmp_path / "workspace"
+
+        with patch('mlxk2.operations.clone._get_volume_mount_point') as mock_volume:
+            mock_volume.return_value = tmp_path
+
+            temp_cache = _create_temp_cache_same_volume(target_workspace)
+
+            assert str(os.getpid()) in temp_cache.name
+
+            # Cleanup
+            shutil.rmtree(temp_cache)
+
+
+class TestSentinelSafetyMechanism:
+    """Test suite for sentinel-based safety mechanism."""
+
+    def test_cleanup_temp_cache_safe_with_sentinel(self, tmp_path):
+        """Test cleanup succeeds when sentinel exists."""
+        temp_cache = tmp_path / "temp_cache"
+        temp_cache.mkdir()
+
+        # Create sentinel
+        sentinel = temp_cache / ".mlxk2_temp_cache_sentinel"
+        sentinel.write_text("mlxk2_temp_cache_created_123456789")
+
+        result = _cleanup_temp_cache_safe(temp_cache)
+
+        assert result is True
+        assert not temp_cache.exists()
+
+    def test_cleanup_temp_cache_safe_without_sentinel(self, tmp_path):
+        """Test cleanup refuses when sentinel missing."""
+        temp_cache = tmp_path / "temp_cache"
+        temp_cache.mkdir()
+
+        # No sentinel file
+
+        with patch('mlxk2.operations.clone.logger') as mock_logger:
+            result = _cleanup_temp_cache_safe(temp_cache)
+
+            assert result is False
+            assert temp_cache.exists()  # Should not be deleted
+            mock_logger.warning.assert_called_once()
+            assert "no sentinel found" in mock_logger.warning.call_args[0][0]
+
+    def test_cleanup_temp_cache_safe_protects_user_cache(self, tmp_path):
+        """Test cleanup protects user cache directories without sentinel."""
+        # Simulate user cache directory structure
+        user_cache = tmp_path / ".cache" / "huggingface" / "hub"
+        user_cache.mkdir(parents=True)
+        (user_cache / "important_model").mkdir()
+
+        with patch('mlxk2.operations.clone.logger') as mock_logger:
+            result = _cleanup_temp_cache_safe(user_cache)
+
+            assert result is False
+            assert user_cache.exists()
+            assert (user_cache / "important_model").exists()
+            mock_logger.warning.assert_called_once()
+
+    def test_cleanup_temp_cache_safe_handles_nonexistent(self):
+        """Test cleanup handles non-existent paths gracefully."""
+        nonexistent_path = Path("/nonexistent/temp/cache")
+
+        result = _cleanup_temp_cache_safe(nonexistent_path)
+
+        assert result is False
+
+
+# TestHFHomePatch class removed - _patch_hf_home function no longer exists
+# Clone operations now use pull_to_cache with explicit cache_dir parameter
+
+
+class TestSnapshotResolution:
+    """Test suite for latest snapshot resolution in temp cache."""
+
+    def test_resolve_latest_snapshot_success(self, tmp_path):
+        """Test successful snapshot resolution."""
+        temp_cache = tmp_path / "temp_cache"
+        model_name = "mlx-community/Phi-3-mini"
+
+        # Create mock cache structure
+        cache_dir = temp_cache / "models--mlx-community--Phi-3-mini"
+        snapshots_dir = cache_dir / "snapshots"
+        snapshots_dir.mkdir(parents=True)
+
+        # Create mock snapshots with different timestamps
+        snapshot1 = snapshots_dir / "abc123"
+        snapshot2 = snapshots_dir / "def456"
+        snapshot1.mkdir()
+        snapshot2.mkdir()
+
+        # Set different modification times
+        os.utime(snapshot1, (1000, 1000))
+        os.utime(snapshot2, (2000, 2000))  # More recent
+
+        with patch('mlxk2.operations.clone.hf_to_cache_dir') as mock_hf_to_cache:
+            mock_hf_to_cache.return_value = "models--mlx-community--Phi-3-mini"
+
+            result = _resolve_latest_snapshot(temp_cache, model_name)
+
+            assert result == snapshot2  # Should return most recent
+
+    def test_resolve_latest_snapshot_no_cache(self, tmp_path):
+        """Test snapshot resolution when cache doesn't exist."""
+        temp_cache = tmp_path / "temp_cache"
+        model_name = "nonexistent/model"
+
+        with patch('mlxk2.operations.clone.hf_to_cache_dir') as mock_hf_to_cache:
+            mock_hf_to_cache.return_value = "models--nonexistent--model"
+
+            result = _resolve_latest_snapshot(temp_cache, model_name)
+
+            assert result is None
+
+    def test_resolve_latest_snapshot_no_snapshots(self, tmp_path):
+        """Test snapshot resolution when snapshots directory is empty."""
+        temp_cache = tmp_path / "temp_cache"
+        model_name = "empty/model"
+
+        # Create cache structure but no snapshots
+        cache_dir = temp_cache / "models--empty--model"
+        snapshots_dir = cache_dir / "snapshots"
+        snapshots_dir.mkdir(parents=True)
+
+        with patch('mlxk2.operations.clone.hf_to_cache_dir') as mock_hf_to_cache:
+            mock_hf_to_cache.return_value = "models--empty--model"
+
+            result = _resolve_latest_snapshot(temp_cache, model_name)
+
+            assert result is None
+
+
+class TestAPFSCloneDirectory:
+    """Test suite for APFS copy-on-write directory cloning."""
+
+    def test_apfs_clone_directory_success(self, tmp_path):
+        """Test successful APFS directory cloning."""
+        source = tmp_path / "source"
+        target = tmp_path / "target"
+
+        # Create source structure
+        source.mkdir()
+        (source / "file1.txt").write_text("content1")
+        (source / "subdir").mkdir()
+        (source / "subdir" / "file2.txt").write_text("content2")
+
+        with patch('subprocess.run') as mock_run:
+            mock_run.return_value = MagicMock()  # Successful subprocess calls
+
+            result = _apfs_clone_directory(source, target)
+
+            assert result is True
+            # Verify cp -c was called for each file
+            assert mock_run.call_count == 2
+
+            # Check calls used clonefile (-c flag)
+            for call in mock_run.call_args_list:
+                args = call[0][0]
+                assert args[0] == 'cp'
+                assert '-c' in args
+
+    def test_apfs_clone_directory_subprocess_error(self, tmp_path):
+        """Test APFS cloning handles subprocess errors."""
+        source = tmp_path / "source"
+        target = tmp_path / "target"
+
+        source.mkdir()
+        (source / "file.txt").write_text("content")
+
+        with patch('subprocess.run') as mock_run:
+            mock_run.side_effect = subprocess.CalledProcessError(1, 'cp')
+
+            result = _apfs_clone_directory(source, target)
+
+            assert result is False
+
+    def test_apfs_clone_directory_creates_target_structure(self, tmp_path):
+        """Test APFS cloning creates target directory structure."""
+        source = tmp_path / "source"
+        target = tmp_path / "target"
+
+        # Create nested source structure
+        (source / "deep" / "nested" / "path").mkdir(parents=True)
+        (source / "deep" / "nested" / "path" / "file.txt").write_text("content")
+
+        with patch('subprocess.run') as mock_run:
+            mock_run.return_value = MagicMock()
+
+            result = _apfs_clone_directory(source, target)
+
+            assert result is True
+            # Verify target directory structure would be created
+            call_args = mock_run.call_args_list[0][0][0]
+            target_file = call_args[-1]  # Last argument should be target file
+            assert "deep/nested/path/file.txt" in target_file
+
+
+class TestCloneOperationIntegration:
+    """Integration tests for complete clone operation workflow."""
+
+    def test_clone_operation_success_workflow(self, tmp_path):
+        """Test complete successful clone operation workflow."""
+        target_dir = str(tmp_path / "workspace")
+        model_spec = "mlx-community/Phi-3-mini"
+
+        # Create real temp cache with sentinel for realistic cleanup test
+        real_temp_cache = tmp_path / "temp_cache"
+        real_temp_cache.mkdir()
+        sentinel = real_temp_cache / ".mlxk2_temp_cache_sentinel"
+        sentinel.write_text("mlxk2_temp_cache_created_test")
+
+        with patch('mlxk2.operations.clone._validate_apfs_filesystem'), \
+             patch('mlxk2.operations.clone._validate_same_volume'), \
+             patch('mlxk2.operations.clone._create_temp_cache_same_volume') as mock_create_cache, \
+             patch('mlxk2.operations.clone.pull_to_cache') as mock_pull, \
+             patch('mlxk2.operations.clone._resolve_latest_snapshot') as mock_resolve, \
+             patch('mlxk2.operations.health.health_from_cache') as mock_health, \
+             patch('mlxk2.operations.clone._apfs_clone_directory') as mock_clone:
+
+            # Use real temp cache
+            mock_create_cache.return_value = real_temp_cache
+            mock_health.return_value = (True, "Model is healthy")
+
+            mock_pull.return_value = {
+                "status": "success",
+                "data": {"model": model_spec}
+            }
+
+            mock_snapshot = MagicMock()
+            mock_snapshot.exists.return_value = True
+            mock_resolve.return_value = mock_snapshot
+
+            mock_health.return_value = (True, "Model is healthy")
+            mock_clone.return_value = True
+
+            result = clone_operation(model_spec, target_dir)
+
+            # Debug: Print result if error
+            if result["status"] != "success":
+                print(f"Clone operation failed: {result}")
+
+            # Verify success response
+            assert result["status"] == "success"
+            assert result["command"] == "clone"
+            assert result["data"]["model"] == model_spec
+            assert result["data"]["clone_status"] == "success"
+            assert result["error"] is None
+
+            # Verify workflow steps
+            mock_create_cache.assert_called_once()
+            mock_pull.assert_called_once_with(model_spec, real_temp_cache)
+            mock_resolve.assert_called_once()
+            # Health check is currently skipped in clone operation (TODO: implement health_check_to_cache)
+            # mock_health.assert_called_once_with(model_spec)
+            mock_clone.assert_called_once()
+
+            # Verify real cleanup happened (temp cache should be deleted)
+            assert not real_temp_cache.exists()
+
+    def test_clone_operation_apfs_validation_failure(self, tmp_path):
+        """Test clone operation fails APFS validation."""
+        target_dir = str(tmp_path / "workspace")
+        model_spec = "any/model"
+
+        with patch('mlxk2.operations.clone._validate_apfs_filesystem') as mock_validate:
+            mock_validate.side_effect = FilesystemError("APFS required")
+
+            result = clone_operation(model_spec, target_dir)
+
+            assert result["status"] == "error"
+            assert result["data"]["clone_status"] == "filesystem_error"
+            assert "APFS required" in result["error"]["message"]
+
+    def test_clone_operation_pull_failure(self, tmp_path):
+        """Test clone operation handles pull failure."""
+        target_dir = str(tmp_path / "workspace")
+        model_spec = "invalid/model"
+
+        # Create real temp cache for cleanup test
+        real_temp_cache = tmp_path / "temp_cache"
+        real_temp_cache.mkdir()
+        sentinel = real_temp_cache / ".mlxk2_temp_cache_sentinel"
+        sentinel.write_text("mlxk2_temp_cache_created_test")
+
+        with patch('mlxk2.operations.clone._validate_apfs_filesystem'), \
+             patch('mlxk2.operations.clone._validate_same_volume'), \
+             patch('mlxk2.operations.clone._create_temp_cache_same_volume') as mock_create_cache, \
+             patch('mlxk2.operations.clone.pull_to_cache') as mock_pull:
+
+            mock_create_cache.return_value = real_temp_cache
+
+            mock_pull.return_value = {
+                "status": "error",
+                "error": {"message": "Model not found"}
+            }
+
+            result = clone_operation(model_spec, target_dir)
+
+            assert result["status"] == "error"
+            assert result["data"]["clone_status"] == "pull_failed"
+            assert "Pull operation failed" in result["error"]["message"]
+
+            # Verify real cleanup happened even on failure
+            assert not real_temp_cache.exists()
+
+    def test_clone_operation_health_check_failure(self, tmp_path):
+        """Test clone operation handles health check failure."""
+        target_dir = str(tmp_path / "workspace")
+        model_spec = "corrupted/model"
+
+        # Create real temp cache
+        real_temp_cache = tmp_path / "temp_cache"
+        real_temp_cache.mkdir()
+        sentinel = real_temp_cache / ".mlxk2_temp_cache_sentinel"
+        sentinel.write_text("mlxk2_temp_cache_created_test")
+
+        with patch('mlxk2.operations.clone._validate_apfs_filesystem'), \
+             patch('mlxk2.operations.clone._validate_same_volume'), \
+             patch('mlxk2.operations.clone._create_temp_cache_same_volume') as mock_create_cache, \
+             patch('mlxk2.operations.clone.pull_to_cache') as mock_pull, \
+             patch('mlxk2.operations.clone._resolve_latest_snapshot') as mock_resolve, \
+             patch('mlxk2.operations.health.health_from_cache') as mock_health:
+
+            mock_create_cache.return_value = real_temp_cache
+
+            mock_pull.return_value = {"status": "success", "data": {"model": model_spec}}
+
+            mock_snapshot = MagicMock()
+            mock_snapshot.exists.return_value = True
+            mock_resolve.return_value = mock_snapshot
+
+            mock_health.return_value = (False, "Model is corrupted")
+
+            result = clone_operation(model_spec, target_dir)
+
+            assert result["status"] == "error"
+            assert result["data"]["clone_status"] == "health_check_failed"
+            assert "Model failed health check" in result["error"]["message"]
+
+            # Verify real cleanup happened
+            assert not real_temp_cache.exists()
+
+    def test_clone_operation_target_not_empty(self, tmp_path):
+        """Test clone operation fails when target directory not empty."""
+        target_dir = tmp_path / "workspace"
+        target_dir.mkdir()
+        (target_dir / "existing_file.txt").write_text("content")
+
+        result = clone_operation("any/model", str(target_dir))
+
+        assert result["status"] == "error"
+        assert result["data"]["clone_status"] == "error"
+        assert "not empty" in result["error"]["message"]
+
+    def test_clone_operation_target_is_file(self, tmp_path):
+        """Test clone operation fails when target exists as file."""
+        target_file = tmp_path / "workspace.txt"
+        target_file.write_text("content")
+
+        result = clone_operation("any/model", str(target_file))
+
+        assert result["status"] == "error"
+        assert result["data"]["clone_status"] == "error"
+        assert "not a directory" in result["error"]["message"]
+
+    def test_clone_operation_apfs_clone_failure(self, tmp_path):
+        """Test clone operation handles APFS clone failure."""
+        target_dir = str(tmp_path / "workspace")
+        model_spec = "test/model"
+
+        # Create real temp cache
+        real_temp_cache = tmp_path / "temp_cache"
+        real_temp_cache.mkdir()
+        sentinel = real_temp_cache / ".mlxk2_temp_cache_sentinel"
+        sentinel.write_text("mlxk2_temp_cache_created_test")
+
+        with patch('mlxk2.operations.clone._validate_apfs_filesystem'), \
+             patch('mlxk2.operations.clone._validate_same_volume'), \
+             patch('mlxk2.operations.clone._create_temp_cache_same_volume') as mock_create_cache, \
+             patch('mlxk2.operations.clone.pull_to_cache') as mock_pull, \
+             patch('mlxk2.operations.clone._resolve_latest_snapshot') as mock_resolve, \
+             patch('mlxk2.operations.health.health_from_cache') as mock_health, \
+             patch('mlxk2.operations.clone._apfs_clone_directory') as mock_clone:
+
+            mock_create_cache.return_value = real_temp_cache
+
+            mock_pull.return_value = {"status": "success", "data": {"model": model_spec}}
+
+            mock_snapshot = MagicMock()
+            mock_snapshot.exists.return_value = True
+            mock_resolve.return_value = mock_snapshot
+
+            mock_health.return_value = (True, "Model is healthy")
+            mock_clone.return_value = False  # Clone fails
+
+            result = clone_operation(model_spec, target_dir)
+
+            assert result["status"] == "error"
+            assert result["data"]["clone_status"] == "filesystem_error"
+            assert "APFS clone operation failed" in result["error"]["message"]
+
+            # Verify real cleanup happened
+            assert not real_temp_cache.exists()
+
+
+@pytest.mark.spec
+class TestCloneJSONAPICompliance:
+    """Test suite for JSON API 0.1.4 compliance."""
+
+    def test_clone_success_response_schema(self, tmp_path):
+        """Test successful clone response matches JSON API 0.1.4 schema."""
+        target_dir = str(tmp_path / "workspace")
+        model_spec = "mlx-community/Phi-3-mini"
+
+        # Create real temp cache for JSON schema tests
+        real_temp_cache = tmp_path / "temp_cache"
+        real_temp_cache.mkdir()
+        sentinel = real_temp_cache / ".mlxk2_temp_cache_sentinel"
+        sentinel.write_text("mlxk2_temp_cache_created_test")
+
+        with patch('mlxk2.operations.clone._validate_apfs_filesystem'), \
+             patch('mlxk2.operations.clone._validate_same_volume'), \
+             patch('mlxk2.operations.clone._create_temp_cache_same_volume') as mock_create_cache, \
+             patch('mlxk2.operations.clone.pull_to_cache') as mock_pull, \
+             patch('mlxk2.operations.clone._resolve_latest_snapshot') as mock_resolve, \
+             patch('mlxk2.operations.health.health_from_cache') as mock_health, \
+             patch('mlxk2.operations.clone._apfs_clone_directory'):
+
+            mock_create_cache.return_value = real_temp_cache
+            mock_pull.return_value = {"status": "success", "data": {"model": model_spec}}
+            mock_health.return_value = (True, "Model is healthy")
+
+            mock_snapshot = MagicMock()
+            mock_snapshot.exists.return_value = True
+            mock_resolve.return_value = mock_snapshot
+
+            result = clone_operation(model_spec, target_dir)
+
+            # Validate top-level structure
+            assert isinstance(result, dict)
+            assert set(result.keys()) == {"status", "command", "error", "data"}
+
+            # Validate success response
+            assert result["status"] == "success"
+            assert result["command"] == "clone"
+            assert result["error"] is None
+
+            # Validate data section (per JSON API 0.1.4)
+            data = result["data"]
+            required_fields = {"model", "clone_status", "message", "target_dir", "health_check"}
+            assert set(data.keys()) >= required_fields
+
+            assert data["model"] == model_spec
+            assert data["clone_status"] == "success"
+            assert isinstance(data["message"], str)
+            assert isinstance(data["target_dir"], str)
+            assert isinstance(data["health_check"], bool)
+
+    def test_clone_error_response_schema(self, tmp_path):
+        """Test error clone response matches JSON API 0.1.4 schema."""
+        target_dir = str(tmp_path / "workspace")
+        model_spec = "invalid/model"
+
+        with patch('mlxk2.operations.clone._validate_apfs_filesystem') as mock_validate:
+            mock_validate.side_effect = FilesystemError("APFS required")
+
+            result = clone_operation(model_spec, target_dir)
+
+            # Validate error response structure
+            assert result["status"] == "error"
+            assert result["command"] == "clone"
+            assert result["error"] is not None
+
+            # Validate error section
+            error = result["error"]
+            assert "type" in error
+            assert "message" in error
+            assert isinstance(error["type"], str)
+            assert isinstance(error["message"], str)
+
+            # Validate data section still present
+            assert "data" in result
+            assert "clone_status" in result["data"]
+            assert result["data"]["clone_status"] == "filesystem_error"
+
+    def test_clone_response_no_extra_fields(self, tmp_path):
+        """Test clone response doesn't include fields not in JSON API 0.1.4."""
+        target_dir = str(tmp_path / "workspace")
+        model_spec = "test/model"
+
+        with patch('mlxk2.operations.clone._validate_apfs_filesystem'), \
+             patch('mlxk2.operations.clone._create_temp_cache_same_volume'), \
+             patch('mlxk2.operations.clone.pull_to_cache') as mock_pull, \
+             patch('mlxk2.operations.clone._resolve_latest_snapshot') as mock_resolve, \
+             patch('mlxk2.operations.health.health_from_cache') as mock_health, \
+             patch('mlxk2.operations.clone._apfs_clone_directory'), \
+             patch('mlxk2.operations.clone._cleanup_temp_cache_safe'):
+
+            mock_pull.return_value = {"status": "success", "data": {"model": model_spec}}
+            mock_health.return_value = (True, "Model is healthy")
+
+            mock_snapshot = MagicMock()
+            mock_snapshot.exists.return_value = True
+            mock_resolve.return_value = mock_snapshot
+
+            result = clone_operation(model_spec, target_dir)
+
+            # Should not include cache-related fields not in API
+            data = result["data"]
+            assert "cache_cleanup" not in data
+            assert "cache_preserved" not in data
+            assert "copy_method" not in data
+
+
+class TestCloneCoreFeatures:
+    """Test suite for core clone features and scenarios."""
+
+    def test_clone_same_model_twice_different_versions(self, tmp_path):
+        """Test cloning same model multiple times always gets latest version.
+
+        This test validates the core ADR-007 improvement over ADR-006:
+        - User cache is preserved (no destructive deletion)
+        - Each clone gets fresh pull (latest version)
+        - No version conflicts or outdated snapshots
+        """
+        target_dir1 = str(tmp_path / "workspace1")
+        target_dir2 = str(tmp_path / "workspace2")
+        model_spec = "org/model"
+
+        with patch('mlxk2.operations.clone._validate_apfs_filesystem'), \
+             patch('mlxk2.operations.clone._validate_same_volume'), \
+             patch('mlxk2.operations.clone._create_temp_cache_same_volume') as mock_create_cache, \
+             patch('mlxk2.operations.clone.pull_to_cache') as mock_pull, \
+             patch('mlxk2.operations.clone._resolve_latest_snapshot') as mock_resolve, \
+             patch('mlxk2.operations.health.health_from_cache') as mock_health, \
+             patch('mlxk2.operations.clone._apfs_clone_directory'), \
+             patch('mlxk2.operations.clone._cleanup_temp_cache_safe') as mock_cleanup:
+
+            # Setup different temp caches for each clone
+            temp_cache1 = tmp_path / "temp_cache_1"
+            temp_cache1.mkdir()  # Create directory so .exists() returns True
+            temp_cache2 = tmp_path / "temp_cache_2"
+            temp_cache2.mkdir()  # Create directory so .exists() returns True
+            mock_create_cache.side_effect = [temp_cache1, temp_cache2]
+            mock_health.return_value = (True, "Model is healthy")
+
+            # Setup side effects for both clones
+            snapshot1 = MagicMock()
+            snapshot1.exists.return_value = True
+            snapshot2 = MagicMock()
+            snapshot2.exists.return_value = True
+
+            mock_pull.side_effect = [
+                {"status": "success", "data": {"model": "org/model@abc123"}},
+                {"status": "success", "data": {"model": "org/model@def456"}}
+            ]
+            mock_resolve.side_effect = [snapshot1, snapshot2]
+
+            result1 = clone_operation(model_spec, target_dir1)
+            result2 = clone_operation(model_spec, target_dir2)
+
+            # Both should succeed
+            assert result1["status"] == "success"
+            assert result2["status"] == "success"
+
+            # Each gets the version that was current at pull time
+            assert result1["data"]["model"] == "org/model@abc123"
+            assert result2["data"]["model"] == "org/model@def456"
+
+            # Verify separate temp caches were used (isolation)
+            assert mock_create_cache.call_count == 2
+            assert mock_cleanup.call_count == 2
+
+            # Verify each pull was independent (fresh download)
+            assert mock_pull.call_count == 2
+            for call in mock_pull.call_args_list:
+                assert call[0][0] == model_spec  # Same model spec
+
+    def test_clone_preserves_user_cache_with_existing_model(self, tmp_path):
+        """Test clone preserves user cache when model already exists locally.
+
+        ADR-007 core principle: User cache is NEVER touched during clone operations.
+        """
+        target_dir = str(tmp_path / "workspace")
+        model_spec = "existing/model"
+
+        # Simulate existing user cache (this stays untouched)
+        user_cache = tmp_path / "user_cache"
+        user_cache.mkdir()
+
+        with patch('mlxk2.operations.clone._validate_apfs_filesystem'), \
+             patch('mlxk2.operations.clone._validate_same_volume'), \
+             patch('mlxk2.operations.clone._create_temp_cache_same_volume') as mock_create_cache, \
+             patch('mlxk2.operations.clone.pull_to_cache') as mock_pull, \
+             patch('mlxk2.operations.clone._resolve_latest_snapshot') as mock_resolve, \
+             patch('mlxk2.operations.health.health_from_cache') as mock_health, \
+             patch('mlxk2.operations.clone._apfs_clone_directory'), \
+             patch('mlxk2.operations.clone._cleanup_temp_cache_safe') as mock_cleanup:
+
+            # Different temp cache (not user cache)
+            temp_cache = tmp_path / "temp_cache"
+            temp_cache.mkdir()  # Create directory so .exists() returns True
+            mock_create_cache.return_value = temp_cache
+            mock_health.return_value = (True, "Model is healthy")
+
+            mock_snapshot = MagicMock()
+            mock_snapshot.exists.return_value = True
+            mock_resolve.return_value = mock_snapshot
+
+            mock_pull.return_value = {"status": "success", "data": {"model": model_spec}}
+
+            result = clone_operation(model_spec, target_dir)
+
+            assert result["status"] == "success"
+
+            # User cache should still exist (untouched)
+            assert user_cache.exists()
+
+            # Only temp cache should be cleaned up
+            mock_cleanup.assert_called_once_with(temp_cache)
+            # User cache path never passed to cleanup
+            assert all(call[0][0] != user_cache for call in mock_cleanup.call_args_list)
+
+
+class TestCloneEdgeCases:
+    """Test suite for edge cases and error conditions."""
+
+    def test_clone_operation_without_health_check(self, tmp_path):
+        """Test clone operation with health check disabled."""
+        target_dir = str(tmp_path / "workspace")
+        model_spec = "test/model"
+
+        # Create mock temp cache
+        temp_cache = tmp_path / "temp_cache"
+        temp_cache.mkdir()
+
+        with patch('mlxk2.operations.clone._validate_apfs_filesystem'), \
+             patch('mlxk2.operations.clone._validate_same_volume'), \
+             patch('mlxk2.operations.clone._create_temp_cache_same_volume') as mock_create_cache, \
+             patch('mlxk2.operations.clone.pull_to_cache') as mock_pull, \
+             patch('mlxk2.operations.clone._resolve_latest_snapshot') as mock_resolve, \
+             patch('mlxk2.operations.health.health_from_cache') as mock_health, \
+             patch('mlxk2.operations.clone._apfs_clone_directory') as mock_clone, \
+             patch('mlxk2.operations.clone._cleanup_temp_cache_safe'):
+
+            mock_create_cache.return_value = temp_cache
+            mock_pull.return_value = {"status": "success", "data": {"model": model_spec}}
+
+            # Mock snapshot resolution
+            mock_snapshot = MagicMock()
+            mock_snapshot.exists.return_value = True
+            mock_resolve.return_value = mock_snapshot
+
+            mock_clone.return_value = True
+
+            result = clone_operation(model_spec, target_dir, health_check=False)
+
+            assert result["status"] == "success"
+            assert result["data"]["health_check"] is False
+
+            # Health check should not be called
+            mock_health.assert_not_called()
+
+    def test_clone_operation_temp_cache_not_found(self, tmp_path):
+        """Test clone operation when temp cache snapshot not found."""
+        target_dir = str(tmp_path / "workspace")
+        model_spec = "test/model"
+
+        # Create real temp cache
+        real_temp_cache = tmp_path / "temp_cache"
+        real_temp_cache.mkdir()
+        sentinel = real_temp_cache / ".mlxk2_temp_cache_sentinel"
+        sentinel.write_text("mlxk2_temp_cache_created_test")
+
+        with patch('mlxk2.operations.clone._validate_apfs_filesystem'), \
+             patch('mlxk2.operations.clone._validate_same_volume'), \
+             patch('mlxk2.operations.clone._create_temp_cache_same_volume') as mock_create_cache, \
+             patch('mlxk2.operations.clone.pull_to_cache') as mock_pull, \
+             patch('mlxk2.operations.clone._resolve_latest_snapshot') as mock_resolve:
+
+            mock_create_cache.return_value = real_temp_cache
+
+            mock_pull.return_value = {"status": "success", "data": {"model": model_spec}}
+            mock_resolve.return_value = None  # Snapshot not found
+
+            result = clone_operation(model_spec, target_dir)
+
+            assert result["status"] == "error"
+            assert result["data"]["clone_status"] == "cache_not_found"
+            assert "Temp cache snapshot not found" in result["error"]["message"]
+
+            # Verify real cleanup happened
+            assert not real_temp_cache.exists()
+
+    def test_clone_operation_target_existing_empty(self, tmp_path):
+        """Test clone operation with existing empty target directory."""
+        target_dir = tmp_path / "workspace"
+        target_dir.mkdir()  # Create empty directory
+
+        # Create mock temp cache
+        temp_cache = tmp_path / "temp_cache"
+        temp_cache.mkdir()
+
+        with patch('mlxk2.operations.clone._validate_apfs_filesystem'), \
+             patch('mlxk2.operations.clone._validate_same_volume'), \
+             patch('mlxk2.operations.clone._create_temp_cache_same_volume') as mock_create_cache, \
+             patch('mlxk2.operations.clone.pull_to_cache') as mock_pull, \
+             patch('mlxk2.operations.clone._resolve_latest_snapshot') as mock_resolve, \
+             patch('mlxk2.operations.health.health_from_cache') as mock_health, \
+             patch('mlxk2.operations.clone._apfs_clone_directory') as mock_clone, \
+             patch('mlxk2.operations.clone._cleanup_temp_cache_safe'):
+
+            mock_create_cache.return_value = temp_cache
+            mock_pull.return_value = {"status": "success", "data": {"model": "test/model"}}
+            mock_health.return_value = (True, "Model is healthy")
+
+            # Mock snapshot resolution
+            mock_snapshot = MagicMock()
+            mock_snapshot.exists.return_value = True
+            mock_resolve.return_value = mock_snapshot
+
+            mock_clone.return_value = True
+
+            result = clone_operation("test/model", str(target_dir))
+
+            # Should succeed with empty directory
+            assert result["status"] == "success"
+
+    def test_clone_operation_unexpected_exception(self, tmp_path):
+        """Test clone operation handles unexpected exceptions."""
+        target_dir = str(tmp_path / "workspace")
+        model_spec = "test/model"
+
+        with patch('mlxk2.operations.clone._validate_apfs_filesystem') as mock_validate:
+            mock_validate.side_effect = RuntimeError("Unexpected error")
+
+            result = clone_operation(model_spec, target_dir)
+
+            assert result["status"] == "error"
+            assert result["data"]["clone_status"] == "error"
+            assert result["error"]["type"] == "CloneOperationError"
+            assert "Unexpected error" in result["error"]["message"]
\ No newline at end of file
diff --git a/tests_2.0/test_ctrl_c_handling.py b/tests_2.0/test_ctrl_c_handling.py
new file mode 100644
index 0000000..1959146
--- /dev/null
+++ b/tests_2.0/test_ctrl_c_handling.py
@@ -0,0 +1,442 @@
+"""
+Ctrl-C interruption handling tests for Step 1.1/1.2.
+Tests graceful interruption during generation and interactive mode.
+"""
+
+import pytest
+import signal
+import time
+from unittest.mock import Mock, patch, call
+from io import StringIO
+
+from mlxk2.core.runner import MLXRunner
+from mlxk2.operations.run import run_model, interactive_chat
+
+
+@pytest.fixture
+def mock_runner_with_interruption():
+    """Mock runner that can simulate interruption scenarios."""
+    mock_runner = Mock()
+    
+    # Track interruption state
+    mock_runner._interrupted = False
+    
+    def simulate_generation_with_interruption():
+        """Generator that checks for interruption"""
+        tokens = ["Token1", "Token2", "Token3", "Token4", "Token5"]
+        for i, token in enumerate(tokens):
+            if mock_runner._interrupted:
+                yield "\n[Generation interrupted by user]"
+                break
+            yield token
+    
+    mock_runner.generate_streaming.side_effect = lambda *args, **kwargs: simulate_generation_with_interruption()
+    mock_runner.generate_batch.return_value = "Complete response"
+    mock_runner._format_conversation.return_value = "Formatted conversation"
+    
+    return mock_runner
+
+
+class TestMLXRunnerInterruption:
+    """Test interruption handling in MLXRunner core."""
+    
+    @patch('mlxk2.core.runner.load')
+    @patch('mlxk2.core.runner.resolve_model_for_operation')
+    @patch('mlxk2.core.cache.get_current_model_cache')
+    def test_signal_handler_setup(self, mock_cache, mock_resolve, mock_load):
+        """Test that signal handler is properly set up"""
+        mock_resolve.return_value = ("test-model", None, None)
+        mock_cache.return_value = Mock()
+        mock_load.return_value = (Mock(), Mock())
+        
+        with patch('signal.signal') as mock_signal:
+            with MLXRunner("test-model") as runner:
+                # Should have set up SIGINT handler
+                mock_signal.assert_called_with(signal.SIGINT, runner._handle_interrupt)
+    
+    @patch('mlxk2.core.runner.load')
+    @patch('mlxk2.core.runner.resolve_model_for_operation')
+    @patch('mlxk2.core.cache.get_current_model_cache')
+    def test_interrupt_flag_setting(self, mock_cache, mock_resolve, mock_load):
+        """Test that interrupt handler sets the flag correctly"""
+        mock_resolve.return_value = ("test-model", None, None)
+        mock_cache.return_value = Mock()
+        mock_load.return_value = (Mock(), Mock())
+        
+        with MLXRunner("test-model") as runner:
+            # Initially not interrupted
+            assert runner._interrupted is False
+            
+            # Simulate signal
+            runner._handle_interrupt(signal.SIGINT, None)
+            
+            # Should be marked as interrupted
+            assert runner._interrupted is True
+    
+    @patch('mlxk2.core.runner.load')
+    @patch('mlxk2.core.runner.resolve_model_for_operation')
+    @patch('mlxk2.core.cache.get_current_model_cache')
+    @patch('mlxk2.core.runner.generate_step')
+    def test_streaming_interruption_detection(self, mock_gen, mock_cache, mock_resolve, mock_load):
+        """Test that streaming generation checks for interruption"""
+        mock_resolve.return_value = ("test-model", None, None)
+        mock_cache.return_value = Mock()
+        
+        mock_model = Mock()
+        mock_tokenizer = Mock()
+        mock_tokenizer.eos_token = "</s>"
+        mock_tokenizer.eos_token_id = 2
+        mock_tokenizer.eos_token_ids = {mock_tokenizer.eos_token_id}
+        mock_tokenizer.additional_special_tokens = []
+        mock_tokenizer.added_tokens_decoder = {}
+        mock_tokenizer.encode.return_value = [1, 2, 3]
+        mock_tokenizer.decode.side_effect = ["Hello", " world", "!"]
+        mock_load.return_value = (mock_model, mock_tokenizer)
+        
+        # Mock generation that yields multiple tokens
+        mock_gen.return_value = iter([
+            (Mock(item=lambda: 1), Mock()),
+            (Mock(item=lambda: 2), Mock()),
+            (Mock(item=lambda: 3), Mock())
+        ])
+        
+        with MLXRunner("test-model") as runner:
+            # Start generation
+            generator = runner.generate_streaming("test prompt")
+            
+            # Get first token
+            first_token = next(generator)
+            assert first_token == "Hello"
+            
+            # Simulate interruption
+            runner._interrupted = True
+            
+            # Next token should be interruption message
+            second_token = next(generator)
+            assert "interrupted" in second_token.lower()
+    
+    @patch('mlxk2.core.runner.load')
+    @patch('mlxk2.core.runner.resolve_model_for_operation')
+    @patch('mlxk2.core.cache.get_current_model_cache')
+    @patch('mlxk2.core.runner.generate_step')
+    def test_batch_interruption_detection(self, mock_gen, mock_cache, mock_resolve, mock_load):
+        """Test that batch generation also checks for interruption"""
+        mock_resolve.return_value = ("test-model", None, None)
+        mock_cache.return_value = Mock()
+        
+        mock_model = Mock()
+        mock_tokenizer = Mock()
+        mock_tokenizer.eos_token = "</s>"
+        mock_tokenizer.eos_token_id = 2
+        mock_tokenizer.eos_token_ids = {mock_tokenizer.eos_token_id}
+        mock_tokenizer.additional_special_tokens = []
+        mock_tokenizer.added_tokens_decoder = {}
+        mock_tokenizer.encode.return_value = [1, 2, 3]
+        mock_tokenizer.decode.return_value = "Partial response"
+        mock_load.return_value = (mock_model, mock_tokenizer)
+        
+        def interrupted_generation():
+            """Generator that gets interrupted"""
+            yield (Mock(item=lambda: 1), Mock())
+            # Simulation: interruption happens here
+            yield (Mock(item=lambda: 2), Mock())
+        
+        mock_gen.return_value = interrupted_generation()
+        
+        with MLXRunner("test-model") as runner:
+            # Set interruption before batch generation
+            runner._interrupted = True
+            
+            result = runner.generate_batch("test prompt")
+            
+            # Should handle interruption gracefully (empty or partial result)
+            assert isinstance(result, str)
+
+
+class TestRunCommandInterruption:
+    """Test interruption handling in run command operations."""
+    
+    def test_single_shot_streaming_interruption(self, mock_runner_with_interruption):
+        """Test interruption during single-shot streaming generation"""
+        with patch('mlxk2.operations.run.MLXRunner') as mock_runner_class:
+            mock_runner_class.return_value.__enter__.return_value = mock_runner_with_interruption
+            mock_runner_class.return_value.__exit__.return_value = None
+            
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                # Start generation
+                with patch('time.sleep', side_effect=[None, None]) as mock_sleep:
+                    # Simulate interruption during generation
+                    original_side_effect = mock_runner_with_interruption.generate_streaming.side_effect
+                    def interrupt_after_delay(*args, **kwargs):
+                        # Interrupt after first token
+                        mock_runner_with_interruption._interrupted = True
+                        # Continue with original generation behavior
+                        return original_side_effect()
+                    
+                    mock_runner_with_interruption.generate_streaming.side_effect = interrupt_after_delay
+                    
+                    result = run_model(
+                        model_spec="test-model",
+                        prompt="test prompt",
+                        stream=True,
+                        json_output=False
+                    )
+            
+            output = fake_out.getvalue()
+            assert "interrupted" in output.lower()
+    
+    def test_interactive_mode_interruption(self, mock_runner_with_interruption):
+        """Test interruption during interactive mode"""
+        with patch('mlxk2.operations.run.MLXRunner') as mock_runner_class:
+            mock_runner_class.return_value.__enter__.return_value = mock_runner_with_interruption
+            mock_runner_class.return_value.__exit__.return_value = None
+            
+            # Simulate Ctrl-C during input
+            with patch('builtins.input', side_effect=KeyboardInterrupt()):
+                with patch('sys.stdout', new=StringIO()) as fake_out:
+                    result = run_model(
+                        model_spec="test-model",
+                        prompt=None,  # Interactive mode
+                        stream=True,
+                        json_output=False
+                    )
+            
+            output = fake_out.getvalue()
+            assert "interrupted" in output.lower() or "goodbye" in output.lower()
+    
+    def test_interactive_chat_keyboard_interrupt(self, mock_runner_with_interruption):
+        """Test direct keyboard interrupt handling in interactive_chat"""
+        with patch('builtins.input', side_effect=KeyboardInterrupt()):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                interactive_chat(mock_runner_with_interruption, stream=True)
+        
+        output = fake_out.getvalue()
+        assert "interrupted" in output.lower() or "goodbye" in output.lower()
+    
+    def test_generation_interruption_in_interactive_mode(self, mock_runner_with_interruption):
+        """Test interruption during generation in interactive mode"""
+        # Set up mock to interrupt during generation
+        def interrupt_during_generation(messages):
+            mock_runner_with_interruption._interrupted = True
+            return "Formatted conversation"
+        
+        mock_runner_with_interruption._format_conversation.side_effect = interrupt_during_generation
+        
+        with patch('builtins.input', side_effect=["test message", "quit"]):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                interactive_chat(mock_runner_with_interruption, stream=True)
+        
+        output = fake_out.getvalue()
+        assert "interrupted" in output.lower()
+
+
+class TestInterruptionRecovery:
+    """Test recovery and cleanup after interruption."""
+    
+    def test_interruption_flag_reset(self, mock_runner_with_interruption):
+        """Test that interruption flag can be reset for subsequent operations"""
+        # Simulate interruption
+        mock_runner_with_interruption._interrupted = True
+        
+        # Reset flag (simulating what would happen in real scenario)
+        mock_runner_with_interruption._interrupted = False
+        
+        # Should be able to generate normally again
+        def normal_generation():
+            if not mock_runner_with_interruption._interrupted:
+                return iter(["Normal", " response"])
+            else:
+                return iter(["Interrupted"])
+        
+        mock_runner_with_interruption.generate_streaming.side_effect = normal_generation
+        
+        tokens = list(mock_runner_with_interruption.generate_streaming())
+        assert tokens == ["Normal", " response"]
+    
+    def test_multiple_interruptions(self, mock_runner_with_interruption):
+        """Test handling of multiple interruptions"""
+        interruption_count = 0
+        
+        def multi_interrupt_generation():
+            nonlocal interruption_count
+            interruption_count += 1
+            tokens = [f"Token{i}" for i in range(5)]
+            
+            for i, token in enumerate(tokens):
+                if i == 2:  # Interrupt at third token
+                    mock_runner_with_interruption._interrupted = True
+                
+                if mock_runner_with_interruption._interrupted:
+                    yield f"\n[Generation interrupted by user - attempt {interruption_count}]"
+                    break
+                yield token
+        
+        mock_runner_with_interruption.generate_streaming.side_effect = multi_interrupt_generation
+        
+        # First interruption
+        tokens1 = list(mock_runner_with_interruption.generate_streaming())
+        assert any("interrupted" in token.lower() for token in tokens1)
+        
+        # Reset for second attempt
+        mock_runner_with_interruption._interrupted = False
+        
+        # Second interruption
+        tokens2 = list(mock_runner_with_interruption.generate_streaming())
+        assert any("interrupted" in token.lower() for token in tokens2)
+        
+        assert interruption_count == 2
+    
+    def test_clean_interruption_message(self, mock_runner_with_interruption):
+        """Test that interruption message is clean and informative"""
+        def generate_with_interruption():
+            yield "Starting"
+            mock_runner_with_interruption._interrupted = True
+            yield "\n[Generation interrupted by user]"
+        
+        mock_runner_with_interruption.generate_streaming.side_effect = generate_with_interruption
+        
+        tokens = list(mock_runner_with_interruption.generate_streaming())
+        
+        # Should have starting token and clean interruption message
+        assert "Starting" in tokens
+        
+        interruption_msg = [t for t in tokens if "interrupted" in t.lower()][0]
+        assert interruption_msg == "\n[Generation interrupted by user]"
+        assert interruption_msg.startswith("\n")  # Proper formatting
+
+
+class TestInterruptionEdgeCases:
+    """Test edge cases in interruption handling."""
+    
+    def test_interruption_before_generation_starts(self, mock_runner_with_interruption):
+        """Test interruption that happens before generation begins"""
+        # Set interrupted flag before generation
+        mock_runner_with_interruption._interrupted = True
+        
+        def immediate_interruption():
+            if mock_runner_with_interruption._interrupted:
+                yield "\n[Generation interrupted by user]"
+                return
+            yield "This should not appear"
+        
+        mock_runner_with_interruption.generate_streaming.side_effect = immediate_interruption
+        
+        tokens = list(mock_runner_with_interruption.generate_streaming())
+        
+        assert len(tokens) == 1
+        assert "interrupted" in tokens[0].lower()
+        assert "This should not appear" not in tokens
+    
+    def test_interruption_after_generation_complete(self, mock_runner_with_interruption):
+        """Test that interruption flag doesn't affect completed generation"""
+        def complete_then_interrupt():
+            # Complete generation first
+            for token in ["Complete", " response"]:
+                yield token
+            
+            # Interrupt after completion (shouldn't affect output)
+            mock_runner_with_interruption._interrupted = True
+        
+        mock_runner_with_interruption.generate_streaming.side_effect = complete_then_interrupt
+        
+        tokens = list(mock_runner_with_interruption.generate_streaming())
+        
+        # Should have complete response, no interruption message
+        assert tokens == ["Complete", " response"]
+    
+    def test_interruption_with_empty_generation(self, mock_runner_with_interruption):
+        """Test interruption when generation produces no tokens"""
+        def empty_generation():
+            mock_runner_with_interruption._interrupted = True
+            # Check interruption immediately
+            if mock_runner_with_interruption._interrupted:
+                yield "\n[Generation interrupted by user]"
+                return
+            
+            # This would be empty generation
+            return
+            yield  # unreachable
+        
+        mock_runner_with_interruption.generate_streaming.side_effect = empty_generation
+        
+        tokens = list(mock_runner_with_interruption.generate_streaming())
+        
+        assert len(tokens) == 1
+        assert "interrupted" in tokens[0].lower()
+
+
+class TestInterruptionCompatibility:
+    """Test interruption compatibility with other features."""
+    
+    def test_interruption_with_chat_template(self, mock_runner_with_interruption):
+        """Test interruption works with chat template formatting"""
+        mock_runner_with_interruption._format_conversation.return_value = "Human: test\n\nAssistant: "
+        
+        def interrupt_after_template():
+            # Interrupt immediately after template formatting
+            mock_runner_with_interruption._interrupted = True
+            yield "\n[Generation interrupted by user]"
+        
+        mock_runner_with_interruption.generate_streaming.side_effect = interrupt_after_template
+        
+        with patch('builtins.input', side_effect=["test", "quit"]):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                interactive_chat(mock_runner_with_interruption, stream=True)
+        
+        output = fake_out.getvalue()
+        assert "interrupted" in output.lower()
+        
+        # Should have called format_conversation despite interruption
+        mock_runner_with_interruption._format_conversation.assert_called()
+    
+    def test_interruption_with_json_output(self, mock_runner_with_interruption):
+        """Test interruption handling with JSON output mode"""
+        with patch('mlxk2.operations.run.MLXRunner') as mock_runner_class:
+            mock_runner_class.return_value.__enter__.return_value = mock_runner_with_interruption
+            mock_runner_class.return_value.__exit__.return_value = None
+            
+            # Simulate interruption during generation
+            mock_runner_with_interruption._interrupted = True
+            
+            result = run_model(
+                model_spec="test-model",
+                prompt="test prompt",
+                stream=False,
+                json_output=True
+            )
+            
+            # Should return some result, even if interrupted
+            assert isinstance(result, str)
+    
+    def test_interruption_preserves_conversation_history(self, mock_runner_with_interruption):
+        """Test that interruption doesn't corrupt conversation history"""
+        conversation_calls = []
+        
+        def track_conversations(messages):
+            conversation_calls.append(len(messages))
+            if len(conversation_calls) == 2:  # Interrupt on second call
+                mock_runner_with_interruption._interrupted = True
+            return "Formatted conversation"
+        
+        mock_runner_with_interruption._format_conversation.side_effect = track_conversations
+        
+        # Mock interrupted generation for second message
+        generation_calls = 0
+        def selective_interruption():
+            nonlocal generation_calls
+            generation_calls += 1
+            if generation_calls == 2:  # Second generation gets interrupted
+                yield "\n[Generation interrupted by user]"
+            else:
+                yield "Normal response"
+        
+        mock_runner_with_interruption.generate_streaming.side_effect = selective_interruption
+        
+        with patch('builtins.input', side_effect=["first", "second", "quit"]):
+            with patch('sys.stdout', new=StringIO()):
+                interactive_chat(mock_runner_with_interruption, stream=True, prepare_next_prompt=False)
+        
+        # Should have processed both messages despite interruption
+        assert len(conversation_calls) == 2
+        assert conversation_calls[0] == 1  # First message
+        assert conversation_calls[1] == 3  # First + response + second message
diff --git a/tests_2.0/test_detection_readme_tokenizer.py b/tests_2.0/test_detection_readme_tokenizer.py
new file mode 100644
index 0000000..81ea82a
--- /dev/null
+++ b/tests_2.0/test_detection_readme_tokenizer.py
@@ -0,0 +1,87 @@
+"""Tests for lenient MLX detection (Issue #31 port) in 2.0.
+
+Covers:
+- Framework=MLX via README front-matter (tags/library_name) for non-mlx-community repos.
+- Type=chat via tokenizer chat_template hints.
+- Consistency between list and show outputs.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Tuple
+
+from mlxk2.core.cache import hf_to_cache_dir
+from mlxk2.operations.list import list_models
+from mlxk2.operations.show import show_model_operation
+
+
+def _mk_snapshot(cache_hub: Path, repo_id: str, hash40: str) -> Tuple[Path, Path]:
+    base = cache_hub / hf_to_cache_dir(repo_id)
+    snap = base / "snapshots" / hash40
+    snap.mkdir(parents=True, exist_ok=True)
+    # Minimal healthy files
+    (snap / "config.json").write_text('{"model_type": "test"}', encoding="utf-8")
+    (snap / "model.safetensors").write_bytes(b"w" * 1024)
+    return base, snap
+
+
+def test_framework_mlx_from_front_matter(isolated_cache):
+    repo = "custom-org/FrontMatter-Model"
+    h = "0123456789abcdef0123456789abcdef01234567"
+    base, snap = _mk_snapshot(isolated_cache, repo, h)
+
+    # README front-matter indicating MLX
+    (snap / "README.md").write_text(
+        """---
+library_name: mlx
+tags: [mlx, chat]
+---
+
+# Dummy
+""",
+        encoding="utf-8",
+    )
+
+    out = list_models()
+    models = {m["name"]: m for m in out["data"]["models"]}
+    assert repo in models, f"Model not listed: {repo}"
+    assert models[repo]["framework"] == "MLX"
+
+    s = show_model_operation(repo)
+    assert s["status"] == "success"
+    assert s["data"]["model"]["framework"] == "MLX"
+
+
+def test_type_chat_from_tokenizer_chat_template(isolated_cache):
+    repo = "custom-org/Tokenizer-Chat-Model"
+    h = "89abcdef0123456789abcdef0123456789abcdef"
+    base, snap = _mk_snapshot(isolated_cache, repo, h)
+
+    # No chat/instruct in name → rely on tokenizer chat_template
+    (snap / "tokenizer_config.json").write_text(
+        '{"chat_template": "{{ bos_token }}{{ eos_token }}"}', encoding="utf-8"
+    )
+
+    # Also put a front-matter not mentioning mlx to ensure chat comes from tokenizer
+    (snap / "README.md").write_text(
+        """---
+tags: [test]
+---
+""",
+        encoding="utf-8",
+    )
+
+    out = list_models()
+    models = {m["name"]: m for m in out["data"]["models"]}
+    assert repo in models, f"Model not listed: {repo}"
+    m = models[repo]
+    assert m["model_type"] == "chat"
+    assert "chat" in (m.get("capabilities") or [])
+
+    s = show_model_operation(repo)
+    assert s["status"] == "success"
+    ms = s["data"]["model"]
+    assert ms["model_type"] == "chat"
+    assert "chat" in (ms.get("capabilities") or [])
+
diff --git a/tests_2.0/test_edge_cases_adr002.py b/tests_2.0/test_edge_cases_adr002.py
new file mode 100644
index 0000000..0f61aa7
--- /dev/null
+++ b/tests_2.0/test_edge_cases_adr002.py
@@ -0,0 +1,267 @@
+"""ADR-002 Edge Cases Validation Tests for MLX-Knife 2.0.
+
+These tests validate critical edge cases learned from 1.x development,
+as documented in docs/ADR/ADR-002-edge-cases.md
+"""
+
+import pytest
+import tempfile
+from pathlib import Path
+from mlxk2.core.cache import hf_to_cache_dir, cache_dir_to_hf
+from mlxk2.core.model_resolution import resolve_model_for_operation, parse_model_spec
+from mlxk2.operations.list import list_models
+from mlxk2.operations.health import health_check_operation
+
+
+class TestModelNameValidation:
+    """Test model name validation edge cases from ADR-002."""
+    
+    def test_96_char_limit_validation(self):
+        """Test HuggingFace 96 character model name limit."""
+        # Valid length model name (95 chars)
+        valid_name = "org/" + "a" * 91  # 95 total
+        assert len(valid_name) == 95
+        
+        # Invalid length model name (97 chars)  
+        invalid_name = "org/" + "a" * 93  # 97 total
+        assert len(invalid_name) == 97
+        
+        # Resolution should handle long names gracefully
+        resolved_name, commit_hash, ambiguous = resolve_model_for_operation(invalid_name)
+        # Should either reject or truncate, not crash
+        assert isinstance(resolved_name, (str, type(None)))
+        assert isinstance(ambiguous, list)
+    
+    def test_empty_and_whitespace_names(self):
+        """Test empty and whitespace-only model names."""
+        test_cases = ["", " ", "  ", "\t", "\n", "   \t\n   "]
+        
+        for test_name in test_cases:
+            resolved_name, commit_hash, ambiguous = resolve_model_for_operation(test_name)
+            # Should handle gracefully, not crash
+            assert resolved_name is None
+            # Ambiguous may return all models (fuzzy matching behavior) or empty list
+            assert isinstance(ambiguous, list)
+    
+    def test_invalid_characters_in_names(self):
+        """Test names with invalid characters."""
+        invalid_names = [
+            "org//model",  # Double slash
+            "org/model/",  # Trailing slash
+            "/org/model",  # Leading slash
+            "org//sub//model",  # Multiple double slashes
+            "org\\model",  # Backslash
+            "org<model>",  # Angle brackets
+        ]
+        
+        for name in invalid_names:
+            resolved_name, commit_hash, ambiguous = resolve_model_for_operation(name)
+            # Should handle gracefully, not crash
+            assert isinstance(resolved_name, (str, type(None)))
+            assert isinstance(ambiguous, list)
+
+
+class TestCacheDirectoryManagement:
+    """Test cache directory handling edge cases."""
+    
+    def test_round_trip_conversion_bijective(self):
+        """Test that HF name ↔ cache dir conversion is bijective."""
+        test_cases = [
+            "microsoft/DialoGPT-small",
+            "org/sub/model",
+            "single-model",
+            "deep/nested/path/model",
+            "org-with-dashes/model-with-dashes",
+        ]
+        
+        for hf_name in test_cases:
+            # Forward conversion
+            cache_dir = hf_to_cache_dir(hf_name)
+            
+            # Backward conversion  
+            recovered_name = cache_dir_to_hf(cache_dir)
+            
+            # Should be identical
+            assert recovered_name == hf_name, f"Round-trip failed: {hf_name} → {cache_dir} → {recovered_name}"
+    
+    def test_corrupted_cache_tolerance(self):
+        """Test tolerance for corrupted cache directory names."""
+        # Violate naming rules (triple dashes)
+        corrupted_cache_names = [
+            "models--org---corrupted",  # Triple dash
+            "models--org--model---bad",  # Triple dash at end
+            "models---bad--model",  # Triple dash at start
+        ]
+        
+        for cache_name in corrupted_cache_names:
+            # Should not crash, mechanical conversion
+            hf_name = cache_dir_to_hf(cache_name)
+            
+            # Should produce visible corruption (empty segments)
+            assert isinstance(hf_name, str)
+            # Corruption should be visible somehow (empty segments, leading/trailing dashes, etc.)
+            if "---" in cache_name:
+                corruption_indicators = ["/-", "//", hf_name.startswith("/"), hf_name.endswith("/"), 
+                                       hf_name.startswith("-"), hf_name.endswith("-")]
+                assert any(corruption_indicators), f"Corruption not visible in: {hf_name}"
+
+
+class TestHashSyntaxParsing:
+    """Test @hash syntax parsing edge cases."""
+    
+    def test_hash_syntax_parsing(self):
+        """Test parsing of @hash syntax."""
+        test_cases = [
+            ("Phi-3@abc", ("Phi-3", "abc")),
+            ("mlx-community/Model@def123", ("mlx-community/Model", "def123")),
+            ("Model@a", ("Model", "a")),  # Single char hash
+            ("Model@" + "a" * 40, ("Model", "a" * 40)),  # Long hash
+        ]
+        
+        for input_spec, expected in test_cases:
+            result = parse_model_spec(input_spec)
+            assert result == expected
+    
+    def test_invalid_hash_syntax(self):
+        """Test invalid @hash syntax handling."""
+        invalid_cases = [
+            "Model@",  # Empty hash
+            "Model@@abc",  # Double @
+            "@abc",  # No model name
+            "Model@hash@invalid",  # Multiple @
+        ]
+        
+        for invalid_spec in invalid_cases:
+            # Should parse without crashing, handle invalid parts gracefully
+            try:
+                model_name, commit_hash = parse_model_spec(invalid_spec)
+                # Should return reasonable values, not crash
+                assert isinstance(model_name, str)
+                assert isinstance(commit_hash, (str, type(None)))
+            except Exception as e:
+                # If it throws, should be a clear validation error
+                assert "invalid" in str(e).lower() or "format" in str(e).lower()
+
+
+class TestHealthCheckEdgeCases:
+    """Test health checking edge cases from ADR-002."""
+    
+    def test_lfs_pointer_detection_pattern(self, isolated_cache):
+        """Test LFS pointer detection logic."""
+        # Create fake LFS pointer file
+        test_model_dir = isolated_cache / "models--test--lfs-model" / "snapshots" / "main"
+        test_model_dir.mkdir(parents=True)
+        
+        # Create LFS pointer content
+        lfs_content = '''version https://git-lfs.github.com/spec/v1
+oid sha256:1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef
+size 123456789
+'''
+        lfs_file = test_model_dir / "model.safetensors"
+        lfs_file.write_text(lfs_content)
+        
+        # Health check should detect this as unhealthy/incomplete
+        result = health_check_operation("test/lfs-model")
+        
+        # Should complete without crashing
+        assert result["status"] == "success"
+        
+        # If LFS detection is implemented, should flag as unhealthy
+        # (This test documents the expected behavior)
+    
+    def test_missing_critical_files(self, isolated_cache):
+        """Test handling of models missing critical files."""
+        # Create model with missing config.json
+        incomplete_model_dir = isolated_cache / "models--test--incomplete" / "snapshots" / "main"  
+        incomplete_model_dir.mkdir(parents=True)
+        
+        # Only create tokenizer, no config or model files
+        (incomplete_model_dir / "tokenizer.json").write_text('{"version": "1.0"}')
+        
+        result = health_check_operation("test/incomplete")
+        
+        # Should handle gracefully
+        assert result["status"] == "success"
+        # Should identify as incomplete/unhealthy if detection is implemented
+    
+    def test_health_check_with_empty_cache(self):
+        """Test health check when no models are cached."""
+        result = health_check_operation()
+        
+        # Should handle empty cache gracefully
+        assert result["status"] == "success"
+        assert result["data"]["summary"]["total"] >= 0
+
+
+class TestForceFlag:
+    """Test force flag behavior in rm operations."""
+    
+    def test_force_flag_skips_all_confirmations(self, mock_models, isolated_cache):
+        """Test that -f flag skips ALL confirmations (Issue #23 regression)."""
+        from mlxk2.operations.rm import rm_operation
+        from conftest import test_list_models
+        
+        # Get available model from test cache
+        models = test_list_models(isolated_cache)["data"]["models"]
+        if not models:
+            pytest.skip("No models in test cache for force flag testing")
+        
+        target_model = models[0]["name"]
+        
+        # Force flag should work without any prompts
+        result = rm_operation(target_model, force=True)
+        
+        # Should either succeed or fail with clear reason (never prompt)
+        assert result["status"] in ["success", "error"]
+        
+        if result["status"] == "error":
+            # Error should not be about confirmation/prompts
+            error_msg = result["error"]["message"].lower()
+            # Check for interactive prompts (not system errors like "no such file")
+            forbidden_phrases = ["confirm", "prompt", "yes/no", "continue?", "are you sure"]
+            for phrase in forbidden_phrases:
+                assert phrase not in error_msg, f"Force flag still prompting: {error_msg}"
+
+
+class TestJSONErrorHandling:
+    """Test JSON error handling consistency."""
+    
+    def test_invalid_operations_return_valid_json(self):
+        """Test that all invalid operations return valid JSON."""
+        invalid_operations = [
+            lambda: resolve_model_for_operation("definitely-nonexistent-12345"),
+            lambda: health_check_operation("nonexistent-model"),
+            lambda: parse_model_spec("invalid@@syntax"),
+        ]
+        
+        for operation in invalid_operations:
+            try:
+                result = operation()
+                # Should return structured data, not throw
+                assert isinstance(result, (tuple, dict, list))
+            except Exception as e:
+                # If it throws, should be for a good reason with clear message
+                assert str(e), "Empty error message not allowed"
+    
+    def test_json_structure_consistency(self):
+        """Test that all operations return consistent JSON structure."""
+        # Test operations that return JSON
+        operations_to_test = [
+            list_models,
+            lambda: health_check_operation(),
+        ]
+        
+        for operation in operations_to_test:
+            result = operation()
+            
+            # Should have consistent JSON structure
+            assert "status" in result
+            assert result["status"] in ["success", "error"]
+            assert "data" in result or "error" in result
+            
+            if "error" in result and result["error"] is not None:
+                assert "message" in result["error"]
+
+
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])
\ No newline at end of file
diff --git a/tests_2.0/test_health_multifile.py b/tests_2.0/test_health_multifile.py
new file mode 100644
index 0000000..de5bd56
--- /dev/null
+++ b/tests_2.0/test_health_multifile.py
@@ -0,0 +1,137 @@
+"""Deterministic tests for multi-file safetensors health (Issue #27 parity)."""
+
+import json
+from pathlib import Path
+
+
+def _write_idx(dir: Path, shards: list[str]):
+    idx = {
+        "metadata": {},
+        "weight_map": {f"layer{i}": shard for i, shard in enumerate(shards)}
+    }
+    (dir / "model.safetensors.index.json").write_text(json.dumps(idx))
+
+
+def test_multifile_index_missing_shard_is_unhealthy(isolated_cache):
+    snap = isolated_cache / "models--test--mf" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    shards = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
+    _write_idx(snap, shards)
+    # Create only one shard (subset)
+    (snap / shards[0]).write_bytes(b"ok")
+
+    from mlxk2.operations.health import health_check_operation
+    result = health_check_operation("test/mf")
+    assert result["status"] == "success"
+    assert any(m["name"] == "test/mf" and m["status"] == "unhealthy" for m in result["data"]["unhealthy"])
+
+
+def test_multifile_index_empty_shard_is_unhealthy(isolated_cache):
+    snap = isolated_cache / "models--test--mf2" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    shards = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
+    _write_idx(snap, shards)
+    # Create both, but one empty
+    (snap / shards[0]).write_bytes(b"ok")
+    (snap / shards[1]).write_bytes(b"")
+
+    from mlxk2.operations.health import health_check_operation
+    result = health_check_operation("test/mf2")
+    assert result["status"] == "success"
+    assert any(m["name"] == "test/mf2" and m["status"] == "unhealthy" for m in result["data"]["unhealthy"])
+
+
+def test_multifile_index_complete_is_healthy(isolated_cache):
+    snap = isolated_cache / "models--test--mf3" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    shards = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
+    _write_idx(snap, shards)
+    for s in shards:
+        (snap / s).write_bytes(b"ok")
+    # Minimal valid config
+    (snap / "config.json").write_text(json.dumps({"model_type": "test"}))
+
+
+def test_multifile_pattern_missing_shard_is_unhealthy(isolated_cache):
+    snap = isolated_cache / "models--test--mf4" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    # No index file; only pattern shards
+    shards = [
+        "model-00001-of-00003.safetensors",
+        # missing 00002
+        "model-00003-of-00003.safetensors",
+    ]
+    for s in shards:
+        (snap / s).write_bytes(b"ok")
+    (snap / "config.json").write_text(json.dumps({"model_type": "test"}))
+
+    from mlxk2.operations.health import health_check_operation
+    result = health_check_operation("test/mf4")
+    assert any(m["name"] == "test/mf4" and m["status"] == "unhealthy" for m in result["data"]["unhealthy"])
+
+
+def test_multifile_pattern_complete_is_unhealthy_without_index(isolated_cache):
+    snap = isolated_cache / "models--test--mf5" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    shards = [
+        "model-00001-of-00003.safetensors",
+        "model-00002-of-00003.safetensors",
+        "model-00003-of-00003.safetensors",
+    ]
+    for s in shards:
+        (snap / s).write_bytes(b"ok")
+    (snap / "config.json").write_text(json.dumps({"model_type": "test"}))
+
+    from mlxk2.operations.health import health_check_operation
+    result = health_check_operation("test/mf5")
+    # Robust policy: without index, sharded safetensors are unhealthy even if complete
+    assert any(m["name"] == "test/mf5" and m["status"] == "unhealthy" for m in result["data"]["unhealthy"])
+
+
+def test_partial_tmp_marker_is_unhealthy(isolated_cache):
+    snap = isolated_cache / "models--test--partial" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    # Single-file weight but with partial marker
+    (snap / "model.safetensors").write_bytes(b"ok")
+    (snap / ".partial.tmp").write_bytes(b"downloading")
+    (snap / "config.json").write_text(json.dumps({"model_type": "test"}))
+
+    from mlxk2.operations.health import health_check_operation
+    result = health_check_operation("test/partial")
+    assert any(m["name"] == "test/partial" and m["status"] == "unhealthy" for m in result["data"]["unhealthy"])
+
+
+def _write_pt_idx(dir: Path, shards: list[str]):
+    idx = {
+        "metadata": {},
+        "weight_map": {f"layer{i}": shard for i, shard in enumerate(shards)}
+    }
+    (dir / "pytorch_model.bin.index.json").write_text(json.dumps(idx))
+
+
+def test_pytorch_index_missing_shard_is_unhealthy(isolated_cache):
+    snap = isolated_cache / "models--test--pt" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    shards = ["pytorch_model-00001-of-00002.bin", "pytorch_model-00002-of-00002.bin"]
+    _write_pt_idx(snap, shards)
+    # Create only one shard
+    (snap / shards[0]).write_bytes(b"ok")
+    (snap / "config.json").write_text(json.dumps({"model_type": "test"}))
+
+    from mlxk2.operations.health import health_check_operation
+    result = health_check_operation("test/pt")
+    assert any(m["name"] == "test/pt" and m["status"] == "unhealthy" for m in result["data"]["unhealthy"])
+
+
+def test_pytorch_index_complete_is_healthy(isolated_cache):
+    snap = isolated_cache / "models--test--pt2" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    shards = ["pytorch_model-00001-of-00002.bin", "pytorch_model-00002-of-00002.bin"]
+    _write_pt_idx(snap, shards)
+    for s in shards:
+        (snap / s).write_bytes(b"ok")
+    (snap / "config.json").write_text(json.dumps({"model_type": "test"}))
+
+    from mlxk2.operations.health import health_check_operation
+    result = health_check_operation("test/pt2")
+    assert any(m["name"] == "test/pt2" and m["status"] == "healthy" for m in result["data"]["healthy"])
diff --git a/tests_2.0/test_human_output.py b/tests_2.0/test_human_output.py
new file mode 100644
index 0000000..d597fb5
--- /dev/null
+++ b/tests_2.0/test_human_output.py
@@ -0,0 +1,176 @@
+import re
+
+from mlxk2.output.human import render_list, render_health
+
+
+def sample_list_data():
+    return {
+        "status": "success",
+        "command": "list",
+        "data": {
+            "models": [
+                {
+                    "name": "mlx-community/TinyChat",
+                    "hash": "abcdef0123456789abcdef0123456789abcdef01",
+                    "size_bytes": 1_234_567,
+                    "last_modified": "2025-08-30T12:00:00Z",
+                    "framework": "MLX",
+                    "model_type": "chat",
+                    "capabilities": ["text-generation", "chat"],
+                    "health": "healthy",
+                    "cached": True,
+                },
+                {
+                    "name": "other-org/some-gguf",
+                    "hash": None,
+                    "size_bytes": 2_000,
+                    "last_modified": "2025-08-30T11:00:00Z",
+                    "framework": "GGUF",
+                    "model_type": "base",
+                    "capabilities": ["text-generation"],
+                    "health": "unhealthy",
+                    "cached": True,
+                },
+            ],
+            "count": 2,
+        },
+        "error": None,
+    }
+
+
+def test_list_human_compact_filters_and_headers():
+    out = render_list(sample_list_data(), show_health=False, show_all=False, verbose=False)
+    # No Framework column in compact mode
+    header = out.splitlines()[0]
+    assert "Framework" not in header
+    assert "Modified" in header
+    # Only MLX model should be shown, with compact name
+    assert "TinyChat" in out
+    assert "mlx-community/" not in out
+    assert "some-gguf" not in out
+
+
+def test_list_human_all_and_verbose_shows_framework_and_full_names():
+    out = render_list(sample_list_data(), show_health=False, show_all=True, verbose=True)
+    header = out.splitlines()[0]
+    assert "Framework" in header
+    assert "mlx-community/TinyChat" in out
+    assert "other-org/some-gguf" in out
+    # Framework labels present
+    assert "MLX" in out and "GGUF" in out
+
+
+def test_health_human_summary_and_entries():
+    data = {
+        "status": "success",
+        "command": "health",
+        "data": {
+            "healthy": [
+                {"name": "model-a", "status": "healthy", "reason": "ok"}
+            ],
+            "unhealthy": [
+                {"name": "model-b", "status": "unhealthy", "reason": "missing"}
+            ],
+            "summary": {"total": 2, "healthy_count": 1, "unhealthy_count": 1},
+        },
+        "error": None,
+    }
+    out = render_health(data)
+    assert "Summary: total 2, healthy 1, unhealthy 1" in out
+    assert "model-a" in out
+    assert "model-b" in out
+
+
+def test_list_human_filters_mlx_base_default():
+    from mlxk2.output.human import render_list
+
+    data = {
+        "status": "success",
+        "command": "list",
+        "data": {
+            "models": [
+                {
+                    "name": "org/MLXChat",
+                    "hash": "abcdef0123456789abcdef0123456789abcdef01",
+                    "size_bytes": 1000,
+                    "last_modified": "2025-08-30T12:00:00Z",
+                    "framework": "MLX",
+                    "model_type": "chat",
+                    "capabilities": ["text-generation", "chat"],
+                    "health": "healthy",
+                    "cached": True,
+                },
+                {
+                    "name": "org/MLXBase",
+                    "hash": "abcdef0123456789abcdef0123456789abcdef02",
+                    "size_bytes": 2000,
+                    "last_modified": "2025-08-30T12:00:00Z",
+                    "framework": "MLX",
+                    "model_type": "base",
+                    "capabilities": ["text-generation"],
+                    "health": "healthy",
+                    "cached": True,
+                },
+            ],
+            "count": 2,
+        },
+        "error": None,
+    }
+
+    # Default (compact) should hide MLX base
+    out_default = render_list(data, show_health=False, show_all=False, verbose=False)
+    assert "MLXChat" in out_default
+    assert "MLXBase" not in out_default
+
+    # Verbose (without --all) shows all MLX (chat + base)
+    out_verbose = render_list(data, show_health=False, show_all=False, verbose=True)
+    assert "MLXChat" in out_verbose
+    assert "MLXBase" in out_verbose
+
+
+def test_list_human_verbose_shows_all_mlx_only():
+    from mlxk2.output.human import render_list
+
+    data = {
+        "status": "success",
+        "command": "list",
+        "data": {
+            "models": [
+                {"name": "org/MLXChat", "hash": None, "size_bytes": 1, "last_modified": "2025-08-30T12:00:00Z", "framework": "MLX", "model_type": "chat", "capabilities": ["text-generation", "chat"], "health": "healthy", "cached": True},
+                {"name": "org/MLXBase", "hash": None, "size_bytes": 1, "last_modified": "2025-08-30T12:00:00Z", "framework": "MLX", "model_type": "base", "capabilities": ["text-generation"], "health": "healthy", "cached": True},
+                {"name": "org/OtherPT", "hash": None, "size_bytes": 1, "last_modified": "2025-08-30T12:00:00Z", "framework": "PyTorch", "model_type": "base", "capabilities": ["text-generation"], "health": "healthy", "cached": True},
+            ],
+            "count": 3,
+        },
+        "error": None,
+    }
+
+    out_verbose = render_list(data, show_health=False, show_all=False, verbose=True)
+    # Shows both MLX models (chat+base)
+    assert "MLXChat" in out_verbose
+    assert "MLXBase" in out_verbose
+    # Hides non-MLX
+    assert "OtherPT" not in out_verbose
+
+
+def test_list_human_all_shows_all_frameworks():
+    from mlxk2.output.human import render_list
+
+    data = {
+        "status": "success",
+        "command": "list",
+        "data": {
+            "models": [
+                {"name": "org/MLXChat", "hash": None, "size_bytes": 1, "last_modified": "2025-08-30T12:00:00Z", "framework": "MLX", "model_type": "chat", "capabilities": ["text-generation", "chat"], "health": "healthy", "cached": True},
+                {"name": "org/OtherGGUF", "hash": None, "size_bytes": 1, "last_modified": "2025-08-30T12:00:00Z", "framework": "GGUF", "model_type": "base", "capabilities": ["text-generation"], "health": "unhealthy", "cached": True},
+                {"name": "org/OtherPT", "hash": None, "size_bytes": 1, "last_modified": "2025-08-30T12:00:00Z", "framework": "PyTorch", "model_type": "base", "capabilities": ["text-generation"], "health": "healthy", "cached": True},
+            ],
+            "count": 3,
+        },
+        "error": None,
+    }
+
+    out_all = render_list(data, show_health=False, show_all=True, verbose=False)
+    assert "MLXChat" in out_all
+    assert "OtherGGUF" in out_all
+    assert "OtherPT" in out_all
diff --git a/tests_2.0/test_integration.py b/tests_2.0/test_integration.py
new file mode 100644
index 0000000..c23f5ab
--- /dev/null
+++ b/tests_2.0/test_integration.py
@@ -0,0 +1,171 @@
+"""Integration tests for MLX-Knife 2.0 with realistic cache scenarios."""
+
+import pytest
+from mlxk2.core.model_resolution import resolve_model_for_operation
+from mlxk2.operations.health import health_check_operation
+from mlxk2.operations.rm import rm_operation
+
+
+class TestModelResolutionIntegration:
+    """Test model resolution with realistic cache structures."""
+    
+    def test_short_name_expansion_with_cache(self, mock_models):
+        """Test that short names expand to mlx-community when model exists in cache."""
+        # Should find the cached mlx-community model
+        resolved_name, commit_hash, ambiguous = resolve_model_for_operation("Phi-3-mini")
+        
+        assert resolved_name == "mlx-community/Phi-3-mini-4k-instruct-4bit"
+        assert commit_hash is None
+        assert ambiguous is None
+    
+    def test_hash_syntax_resolution(self, mock_models, isolated_cache):
+        """Test @hash syntax finds correct model by short hash."""
+        # Short hash "e96" should match "e9675aa3def..."
+        from conftest import test_resolve_model_for_operation
+        resolved_name, commit_hash, ambiguous = test_resolve_model_for_operation(isolated_cache, "Qwen3@e96")
+        
+        # Should find one of the Qwen3 models (both have same short hash in our mock)
+        assert resolved_name is not None
+        assert "Qwen3" in resolved_name
+        assert commit_hash == "e96"
+        assert ambiguous is None
+    
+    def test_fuzzy_matching_partial_names(self, mock_models, isolated_cache):
+        """Test fuzzy matching finds models by partial names."""
+        from conftest import test_resolve_model_for_operation
+        resolved_name, commit_hash, ambiguous = test_resolve_model_for_operation(isolated_cache, "DialoGPT")
+        
+        assert resolved_name == "microsoft/DialoGPT-small"
+        assert commit_hash is None
+        assert ambiguous is None
+    
+    def test_ambiguous_matching_returns_choices(self, mock_models, isolated_cache):
+        """Test that ambiguous patterns return list of matches."""
+        # "Qwen" should match multiple models
+        from conftest import test_resolve_model_for_operation
+        resolved_name, commit_hash, ambiguous = test_resolve_model_for_operation(isolated_cache, "Qwen")
+        
+        assert resolved_name is None
+        assert ambiguous is not None
+        assert len(ambiguous) >= 2  # At least 2 Qwen models in mock
+        assert any("Qwen3-30B" in name for name in ambiguous)
+        assert any("Qwen3-Coder-480B" in name for name in ambiguous)
+    
+    def test_nonexistent_model_handling(self, mock_models):
+        """Test that nonexistent models are handled gracefully."""
+        resolved_name, commit_hash, ambiguous = resolve_model_for_operation("nonexistent-model")
+        
+        assert resolved_name is None
+        assert ambiguous == []  # Empty list, not None
+
+
+class TestHealthOperationIntegration:
+    """Test health operation with realistic models."""
+    
+    def test_health_check_all_models(self, mock_models, isolated_cache):
+        """Test health check on all cached models."""
+        from conftest import test_health_check_operation
+        result = test_health_check_operation(isolated_cache)
+        
+        assert result["status"] == "success"
+        assert result["data"]["summary"]["total"] >= 4  # At least our mock models
+        assert result["data"]["summary"]["healthy_count"] >= 3  # Healthy models
+        assert result["data"]["summary"]["unhealthy_count"] >= 1  # Corrupted model
+    
+    def test_health_check_specific_model_by_hash(self, mock_models, isolated_cache):
+        """Test health check on specific model using @hash syntax."""
+        from conftest import test_health_check_operation
+        result = test_health_check_operation(isolated_cache, "Qwen3@e96")
+        
+        assert result["status"] == "success" 
+        assert result["data"]["summary"]["total"] == 1
+        assert len(result["data"]["healthy"]) == 1
+        assert "Qwen3" in result["data"]["healthy"][0]["name"]
+    
+    def test_health_check_corrupted_model_detection(self, mock_models, isolated_cache):
+        """Test that corrupted models are properly detected."""
+        from conftest import test_health_check_operation
+        result = test_health_check_operation(isolated_cache, "corrupted")
+        
+        assert result["status"] == "success"
+        assert result["data"]["summary"]["unhealthy_count"] == 1
+        assert len(result["data"]["unhealthy"]) == 1
+        assert "corrupted" in result["data"]["unhealthy"][0]["name"].lower()
+
+
+class TestRmOperationIntegration:
+    """Test rm operation with realistic scenarios."""
+    
+    def test_rm_with_fuzzy_matching(self, mock_models, isolated_cache):
+        """Test rm finds model via fuzzy matching in isolated cache."""
+        # Get models from isolated cache
+        from conftest import test_list_models
+        result = test_list_models(isolated_cache)
+        available_models = result["data"]["models"]
+        
+        if not available_models:
+            pytest.skip("No models in test cache for rm testing")
+        
+        # Use first available model for testing
+        target_model = available_models[0]["name"]
+        
+        # Extract partial name for fuzzy matching
+        if "/" in target_model:
+            partial_name = target_model.split("/")[-1].split("-")[0]  # e.g., "DialoGPT" from "microsoft/DialoGPT-small"
+        else:
+            partial_name = target_model.split("-")[0]
+        
+        result = rm_operation(partial_name, force=True)
+        
+        # Should either succeed or be ambiguous
+        assert result["status"] in ["success", "error"]
+        
+        if result["status"] == "success":
+            assert "model" in result["data"]
+            assert result["data"]["action"] == "deleted"
+    
+    def test_rm_ambiguous_pattern_shows_choices(self, mock_models):
+        """Test rm shows choices for ambiguous patterns in isolated cache."""
+        # Create ambiguous scenario with multiple models starting with same prefix
+        result = rm_operation("m", force=False)  # "m" might match multiple models
+        
+        # Should either be ambiguous (error) or succeed (single match)
+        assert result["status"] in ["success", "error"]
+        
+        if result["status"] == "error" and "ambiguous" in result.get("error", {}).get("message", "").lower():
+            # Ambiguous case - should show choices
+            assert "matches" in result.get("data", {}) or "choices" in result.get("data", {})
+            choices = result["data"].get("matches", result["data"].get("choices", []))
+            assert len(choices) >= 2
+    
+    def test_rm_nonexistent_model(self, mock_models):
+        """Test rm handles nonexistent models gracefully."""
+        result = rm_operation("absolutely-does-not-exist-12345", force=True)
+        
+        assert result["status"] == "error"
+        error_msg = result["error"]["message"].lower()
+        assert "not found" in error_msg or "no matches" in error_msg or "no models found" in error_msg
+
+
+class TestCorruptedCacheHandling:
+    """Test handling of corrupted cache entries."""
+    
+    def test_corrupted_naming_tolerance(self, create_corrupted_cache_entry):
+        """Test that corrupted cache directory names are handled gracefully."""
+        # Create cache entry that violates naming rules
+        cache_path = create_corrupted_cache_entry("models--org--model---corrupted").parent
+        
+        from conftest import test_list_models
+        result = test_list_models(cache_path)
+        
+        # Should not crash, should show the corrupted entry
+        assert result["status"] == "success"
+        corrupted_models = [m for m in result["data"]["models"] if "/-" in m["name"]]
+        assert len(corrupted_models) >= 1  # At least our corrupted entry
+        
+        # Problem should be visible in name
+        assert any("/-" in model["name"] for model in corrupted_models)
+
+
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])
\ No newline at end of file
diff --git a/tests_2.0/test_interactive_mode.py b/tests_2.0/test_interactive_mode.py
new file mode 100644
index 0000000..3b474d1
--- /dev/null
+++ b/tests_2.0/test_interactive_mode.py
@@ -0,0 +1,407 @@
+"""
+Interactive mode and conversation history tests for Step 1.1/1.2.
+Tests conversation tracking and chat template integration.
+"""
+
+import pytest
+from unittest.mock import Mock, patch
+from io import StringIO
+
+from mlxk2.operations.run import interactive_chat
+from mlxk2.core.runner import MLXRunner
+
+
+@pytest.fixture
+def mock_runner_interactive():
+    """Mock runner specifically for interactive mode tests."""
+    mock_runner = Mock()
+    
+    # Mock conversation formatting
+    def format_conversation(messages):
+        """Mock chat template application"""
+        if not messages:
+            return ""
+        
+        # Simulate actual chat template behavior
+        formatted_parts = []
+        for msg in messages:
+            role = msg["role"]
+            content = msg["content"]
+            if role == "user":
+                formatted_parts.append(f"Human: {content}")
+            elif role == "assistant":
+                formatted_parts.append(f"Assistant: {content}")
+        
+        return "\n\n".join(formatted_parts) + "\n\nAssistant: "
+    
+    mock_runner._format_conversation.side_effect = format_conversation
+    
+    # Mock generation methods
+    mock_runner.generate_streaming.return_value = iter(["Generated", " response"])
+    mock_runner.generate_batch.return_value = "Generated response"
+    
+    return mock_runner
+
+
+class TestInteractiveBasic:
+    """Basic interactive mode functionality."""
+    
+    def test_interactive_startup_message(self, mock_runner_interactive):
+        """Test that interactive mode shows startup message"""
+        with patch('builtins.input', side_effect=["quit"]):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                interactive_chat(mock_runner_interactive)
+        
+        output = fake_out.getvalue()
+        assert "Starting interactive chat" in output
+        assert "Type 'exit' or 'quit' to end" in output
+    
+    def test_interactive_exit_commands(self, mock_runner_interactive):
+        """Test various exit commands work"""
+        exit_commands = ["exit", "quit", "q"]
+        
+        for exit_cmd in exit_commands:
+            with patch('builtins.input', side_effect=[exit_cmd]):
+                with patch('sys.stdout', new=StringIO()) as fake_out:
+                    interactive_chat(mock_runner_interactive)
+            
+            output = fake_out.getvalue()
+            assert "Goodbye!" in output
+    
+    def test_interactive_streaming_mode(self, mock_runner_interactive):
+        """Test interactive mode with streaming enabled"""
+        with patch('builtins.input', side_effect=["test message", "quit"]):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                interactive_chat(mock_runner_interactive, stream=True)
+        
+        # Should call generate_streaming
+        mock_runner_interactive.generate_streaming.assert_called()
+        
+        # Should not call generate_batch
+        mock_runner_interactive.generate_batch.assert_not_called()
+        
+        output = fake_out.getvalue()
+        assert "Generated response" in output
+    
+    def test_interactive_batch_mode(self, mock_runner_interactive):
+        """Test interactive mode with streaming disabled"""
+        with patch('builtins.input', side_effect=["test message", "quit"]):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                interactive_chat(mock_runner_interactive, stream=False)
+        
+        # Should call generate_batch
+        mock_runner_interactive.generate_batch.assert_called()
+        
+        # Should not call generate_streaming
+        mock_runner_interactive.generate_streaming.assert_not_called()
+        
+        output = fake_out.getvalue()
+        assert "Generated response" in output
+
+
+class TestConversationHistory:
+    """Test conversation history tracking and management."""
+    
+    def test_conversation_history_accumulation(self, mock_runner_interactive):
+        """Test that conversation history grows correctly"""
+        conversation_history = []
+        
+        def capture_conversation(messages):
+            conversation_history.append(messages.copy())
+            return f"Formatted: {len(messages)} messages"
+        
+        mock_runner_interactive._format_conversation.side_effect = capture_conversation
+        
+        inputs = ["first message", "second message", "third message", "quit"]
+        
+        with patch('builtins.input', side_effect=inputs):
+            with patch('sys.stdout', new=StringIO()):
+                interactive_chat(mock_runner_interactive, stream=True)
+        
+        # Should have captured multiple conversation states
+        assert len(conversation_history) == 3
+        
+        # First conversation: 1 user message
+        assert len(conversation_history[0]) == 1
+        assert conversation_history[0][0]["role"] == "user"
+        assert conversation_history[0][0]["content"] == "first message"
+        
+        # Second conversation: user + assistant + user
+        assert len(conversation_history[1]) == 3
+        assert conversation_history[1][0]["role"] == "user"
+        assert conversation_history[1][0]["content"] == "first message"
+        assert conversation_history[1][1]["role"] == "assistant"
+        assert conversation_history[1][1]["content"] == "Generated response"
+        assert conversation_history[1][2]["role"] == "user"
+        assert conversation_history[1][2]["content"] == "second message"
+        
+        # Third conversation: full history
+        assert len(conversation_history[2]) == 5
+        assert conversation_history[2][4]["content"] == "third message"
+    
+    def test_conversation_message_roles(self, mock_runner_interactive):
+        """Test that message roles are correctly assigned"""
+        captured_messages = []
+        
+        def capture_messages(messages):
+            if messages:
+                captured_messages.extend(messages)
+            return "Formatted conversation"
+        
+        mock_runner_interactive._format_conversation.side_effect = capture_messages
+        
+        with patch('builtins.input', side_effect=["user input", "quit"]):
+            with patch('sys.stdout', new=StringIO()):
+                interactive_chat(mock_runner_interactive, prepare_next_prompt=True)
+        
+        # Should have user and assistant messages
+        user_messages = [msg for msg in captured_messages if msg["role"] == "user"]
+        assistant_messages = [msg for msg in captured_messages if msg["role"] == "assistant"]
+        
+        assert len(user_messages) == 1
+        assert len(assistant_messages) == 1
+        assert user_messages[0]["content"] == "user input"
+        assert assistant_messages[0]["content"] == "Generated response"
+    
+    def test_empty_input_ignored(self, mock_runner_interactive):
+        """Test that empty input doesn't affect conversation history"""
+        conversation_calls = []
+        
+        def capture_conversation(messages):
+            conversation_calls.append(len(messages))
+            return "Formatted conversation"
+        
+        mock_runner_interactive._format_conversation.side_effect = capture_conversation
+        
+        # Include empty strings and whitespace
+        inputs = ["", "  ", "\t", "actual message", "quit"]
+        
+        with patch('builtins.input', side_effect=inputs):
+            with patch('sys.stdout', new=StringIO()):
+                interactive_chat(mock_runner_interactive)
+        
+        # Should only process the non-empty message
+        assert len(conversation_calls) == 1
+        assert conversation_calls[0] == 1  # Only one message in history
+    
+    def test_response_stripping(self, mock_runner_interactive):
+        """Test that assistant responses are properly stripped"""
+        captured_responses = []
+        
+        def capture_history(messages):
+            # Capture assistant responses from history
+            for msg in messages:
+                if msg["role"] == "assistant":
+                    captured_responses.append(msg["content"])
+            return "Formatted conversation"
+        
+        mock_runner_interactive._format_conversation.side_effect = capture_history
+        
+        # Mock streaming with whitespace
+        mock_runner_interactive.generate_streaming.return_value = iter([
+            "  Response", " with", " whitespace  "
+        ])
+        
+        with patch('builtins.input', side_effect=["test", "quit"]):
+            with patch('sys.stdout', new=StringIO()):
+                interactive_chat(mock_runner_interactive, stream=True, prepare_next_prompt=True)
+        
+        # Response should be stripped when added to history
+        assert len(captured_responses) == 1
+        assert captured_responses[0] == "Response with whitespace"
+
+
+class TestChatTemplateIntegration:
+    """Test chat template usage in interactive mode."""
+    
+    def test_chat_template_called_with_history(self, mock_runner_interactive):
+        """Test that _format_conversation is called with proper history"""
+        with patch('builtins.input', side_effect=["hello", "quit"]):
+            with patch('sys.stdout', new=StringIO()):
+                interactive_chat(mock_runner_interactive)
+        
+        # Should call _format_conversation
+        mock_runner_interactive._format_conversation.assert_called()
+        
+        # Should be called with list of message dicts
+        call_args = mock_runner_interactive._format_conversation.call_args[0][0]
+        assert isinstance(call_args, list)
+        assert len(call_args) == 1
+        assert isinstance(call_args[0], dict)
+        assert "role" in call_args[0]
+        assert "content" in call_args[0]
+    
+    def test_formatted_prompt_used_for_generation(self, mock_runner_interactive):
+        """Test that formatted conversation is passed to generation"""
+        with patch('builtins.input', side_effect=["test input", "quit"]):
+            with patch('sys.stdout', new=StringIO()):
+                interactive_chat(mock_runner_interactive, stream=True)
+        
+        # Should call generate_streaming with formatted prompt
+        mock_runner_interactive.generate_streaming.assert_called()
+        call_args = mock_runner_interactive.generate_streaming.call_args
+        
+        # First argument should be the formatted conversation
+        assert call_args[0][0] == "Human: test input\n\nAssistant: "
+        
+        # Should disable chat template (already applied)
+        assert call_args[1]['use_chat_template'] is False
+    
+    def test_chat_template_fallback_behavior(self, mock_runner_interactive):
+        """Test behavior when chat template formatting fails"""
+        def failing_format(messages):
+            raise Exception("Template error")
+        
+        mock_runner_interactive._format_conversation.side_effect = failing_format
+        
+        with patch('builtins.input', side_effect=["test", "quit"]):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                # Should handle template errors gracefully
+                interactive_chat(mock_runner_interactive)
+        
+        output = fake_out.getvalue()
+        # Should show error but not crash
+        assert "ERROR" in output
+
+
+class TestInteractiveParameters:
+    """Test parameter passing in interactive mode."""
+    
+    def test_parameter_passing_streaming(self, mock_runner_interactive):
+        """Test that parameters are passed to streaming generation"""
+        with patch('builtins.input', side_effect=["test", "quit"]):
+            with patch('sys.stdout', new=StringIO()):
+                interactive_chat(
+                    mock_runner_interactive,
+                    stream=True,
+                    max_tokens=100,
+                    temperature=0.8,
+                    top_p=0.95,
+                    repetition_penalty=1.2
+                )
+        
+        call_args = mock_runner_interactive.generate_streaming.call_args[1]
+        assert call_args['max_tokens'] == 100
+        assert call_args['temperature'] == 0.8
+        assert call_args['top_p'] == 0.95
+        assert call_args['repetition_penalty'] == 1.2
+    
+    def test_parameter_passing_batch(self, mock_runner_interactive):
+        """Test that parameters are passed to batch generation"""
+        with patch('builtins.input', side_effect=["test", "quit"]):
+            with patch('sys.stdout', new=StringIO()):
+                interactive_chat(
+                    mock_runner_interactive,
+                    stream=False,
+                    max_tokens=200,
+                    temperature=0.9,
+                    top_p=0.85,
+                    repetition_penalty=1.3
+                )
+        
+        call_args = mock_runner_interactive.generate_batch.call_args[1]
+        assert call_args['max_tokens'] == 200
+        assert call_args['temperature'] == 0.9
+        assert call_args['top_p'] == 0.85
+        assert call_args['repetition_penalty'] == 1.3
+    
+    def test_use_chat_template_disabled(self, mock_runner_interactive):
+        """Test that use_chat_template is disabled in generation calls"""
+        with patch('builtins.input', side_effect=["test", "quit"]):
+            with patch('sys.stdout', new=StringIO()):
+                interactive_chat(
+                    mock_runner_interactive,
+                    stream=True,
+                    use_chat_template=True  # This should be overridden
+                )
+        
+        # Should disable chat template in generation call
+        call_args = mock_runner_interactive.generate_streaming.call_args[1]
+        assert call_args['use_chat_template'] is False
+
+
+class TestInteractiveErrorHandling:
+    """Test error handling in interactive mode."""
+    
+    def test_generation_error_recovery(self, mock_runner_interactive):
+        """Test that generation errors don't crash interactive mode"""
+        # First call fails, second succeeds
+        mock_runner_interactive.generate_streaming.side_effect = [
+            RuntimeError("Generation failed"),
+            iter(["Success"])
+        ]
+        
+        with patch('builtins.input', side_effect=["first", "second", "quit"]):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                interactive_chat(mock_runner_interactive, stream=True)
+        
+        output = fake_out.getvalue()
+        # Should show error for first, success for second
+        assert "ERROR" in output
+        assert "Success" in output
+    
+    def test_keyboard_interrupt_handling(self, mock_runner_interactive):
+        """Test Ctrl-C handling in interactive mode"""
+        with patch('builtins.input', side_effect=KeyboardInterrupt()):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                interactive_chat(mock_runner_interactive)
+        
+        output = fake_out.getvalue()
+        assert "interrupted" in output.lower() or "goodbye" in output.lower()
+    
+    def test_input_error_recovery(self, mock_runner_interactive):
+        """Test recovery from input errors"""
+        def failing_input(prompt):
+            if "You:" in prompt:
+                if not hasattr(failing_input, 'called'):
+                    failing_input.called = True
+                    raise EOFError("Input failed")
+                else:
+                    return "quit"
+            return prompt
+        
+        with patch('builtins.input', side_effect=failing_input):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                interactive_chat(mock_runner_interactive)
+        
+        # Should handle input errors gracefully
+        output = fake_out.getvalue()
+        assert "Starting interactive chat" in output
+
+
+class TestInteractiveUI:
+    """Test user interface elements of interactive mode."""
+    
+    def test_user_prompt_display(self, mock_runner_interactive):
+        """Test that user prompt is displayed correctly"""
+        with patch('builtins.input', side_effect=["test", "quit"]) as mock_input:
+            with patch('sys.stdout', new=StringIO()):
+                interactive_chat(mock_runner_interactive)
+        
+        # Should call input with "You: " prompt
+        mock_input.assert_called()
+        calls = [call.args[0] for call in mock_input.call_args_list]
+        assert "You: " in calls
+    
+    def test_assistant_prompt_display(self, mock_runner_interactive):
+        """Test that assistant prompt is displayed correctly"""
+        with patch('builtins.input', side_effect=["test", "quit"]):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                interactive_chat(mock_runner_interactive, stream=True)
+        
+        output = fake_out.getvalue()
+        assert "Assistant: " in output
+    
+    def test_response_formatting(self, mock_runner_interactive):
+        """Test that responses are formatted correctly"""
+        mock_runner_interactive.generate_streaming.return_value = iter([
+            "Token1", "Token2", "Token3"
+        ])
+        
+        with patch('builtins.input', side_effect=["test", "quit"]):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                interactive_chat(mock_runner_interactive, stream=True)
+        
+        output = fake_out.getvalue()
+        # Should include all tokens in output
+        assert "Token1Token2Token3" in output or "Token1 Token2 Token3" in output
diff --git a/tests_2.0/test_interruption_recovery.py b/tests_2.0/test_interruption_recovery.py
new file mode 100644
index 0000000..8c92beb
--- /dev/null
+++ b/tests_2.0/test_interruption_recovery.py
@@ -0,0 +1,211 @@
+"""
+Test for interruption recovery bug fix.
+Ensures that after Ctrl-C, subsequent generations work normally.
+"""
+
+import pytest
+from unittest.mock import Mock, patch
+from io import StringIO
+
+from mlxk2.core.runner import MLXRunner
+from mlxk2.operations.run import interactive_chat
+
+
+class TestInterruptionRecovery:
+    """Test recovery after interruption in interactive mode."""
+    
+    @patch('mlxk2.core.runner.load')
+    @patch('mlxk2.core.runner.resolve_model_for_operation')
+    @patch('mlxk2.core.cache.get_current_model_cache')
+    def test_interruption_flag_reset_streaming(self, mock_cache, mock_resolve, mock_load):
+        """Test that interruption flag is reset for new streaming generation"""
+        mock_resolve.return_value = ("test-model", None, None)
+        mock_cache.return_value = Mock()
+        
+        mock_model = Mock()
+        mock_tokenizer = Mock()
+        mock_tokenizer.eos_token = "</s>"
+        mock_tokenizer.eos_token_id = 2
+        mock_tokenizer.eos_token_ids = {mock_tokenizer.eos_token_id}
+        mock_tokenizer.additional_special_tokens = []
+        mock_tokenizer.added_tokens_decoder = {}
+        mock_tokenizer.encode.return_value = [1, 2, 3]
+        mock_load.return_value = (mock_model, mock_tokenizer)
+        
+        with patch('mlxk2.core.runner.generate_step') as mock_gen:
+            # Mock generation that yields tokens
+            mock_gen.return_value = iter([
+                (Mock(item=lambda: 1), Mock()),
+                (Mock(item=lambda: 2), Mock())
+            ])
+            mock_tokenizer.decode.side_effect = ["Hello", " world"]
+            
+            with MLXRunner("test-model") as runner:
+                # Simulate interruption
+                runner._interrupted = True
+                assert runner._interrupted is True
+                
+                # Start new generation - should reset flag
+                tokens = list(runner.generate_streaming("test prompt"))
+                
+                # Flag should be reset at start of generation
+                assert runner._interrupted is False
+                assert tokens == ["Hello", " world"]
+    
+    @patch('mlxk2.core.runner.load')
+    @patch('mlxk2.core.runner.resolve_model_for_operation')
+    @patch('mlxk2.core.cache.get_current_model_cache')
+    def test_interruption_flag_reset_batch(self, mock_cache, mock_resolve, mock_load):
+        """Test that interruption flag is reset for new batch generation"""
+        mock_resolve.return_value = ("test-model", None, None)
+        mock_cache.return_value = Mock()
+        
+        mock_model = Mock()
+        mock_tokenizer = Mock()
+        mock_tokenizer.eos_token = "</s>"
+        mock_tokenizer.eos_token_id = 2
+        mock_tokenizer.eos_token_ids = {mock_tokenizer.eos_token_id}
+        mock_tokenizer.additional_special_tokens = []
+        mock_tokenizer.added_tokens_decoder = {}
+        mock_tokenizer.encode.return_value = [1, 2, 3]
+        mock_tokenizer.decode.return_value = "Hello world"
+        mock_load.return_value = (mock_model, mock_tokenizer)
+        
+        with patch('mlxk2.core.runner.generate_step') as mock_gen:
+            mock_gen.return_value = iter([
+                (Mock(item=lambda: 1), Mock()),
+                (Mock(item=lambda: 2), Mock())
+            ])
+            
+            with MLXRunner("test-model") as runner:
+                # Simulate interruption
+                runner._interrupted = True
+                assert runner._interrupted is True
+                
+                # Start new generation - should reset flag
+                result = runner.generate_batch("test prompt")
+                
+                # Flag should be reset at start of generation
+                assert runner._interrupted is False
+                assert result == "Hello world"
+    
+    def test_interactive_mode_recovery_after_interruption(self):
+        """Test that interactive mode works after interruption"""
+        mock_runner = Mock()
+        
+        # Track interruption state and generation calls
+        generation_calls = []
+        
+        def mock_generation(prompt, **kwargs):
+            generation_calls.append(len(generation_calls))
+            if len(generation_calls) == 1:
+                # First call: simulate interruption
+                mock_runner._interrupted = True
+                return iter(["\n[Generation interrupted by user]"])
+            else:
+                # Subsequent calls: normal generation
+                mock_runner._interrupted = False
+                return iter(["Normal", " response"])
+        
+        mock_runner.generate_streaming.side_effect = mock_generation
+        mock_runner._format_conversation.return_value = "Formatted conversation"
+        
+        # Simulate user input: first prompt gets interrupted, second works normally
+        inputs = ["first prompt", "second prompt", "quit"]
+        
+        with patch('builtins.input', side_effect=inputs):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                interactive_chat(mock_runner, stream=True)
+        
+        output = fake_out.getvalue()
+        
+        # Should show interruption for first, normal response for second
+        assert "interrupted" in output.lower()
+        assert "Normal response" in output
+        
+        # Should have made two generation calls
+        assert len(generation_calls) == 2
+    
+    def test_multiple_interruptions_and_recoveries(self):
+        """Test multiple cycles of interruption and recovery"""
+        mock_runner = Mock()
+        
+        generation_calls = []
+        
+        def mock_generation(prompt, **kwargs):
+            call_num = len(generation_calls)
+            generation_calls.append(call_num)
+            
+            # Interrupt every other call
+            if call_num % 2 == 0:
+                mock_runner._interrupted = True
+                return iter(["\n[Generation interrupted by user]"])
+            else:
+                mock_runner._interrupted = False
+                return iter([f"Response {call_num}"])
+        
+        mock_runner.generate_streaming.side_effect = mock_generation
+        mock_runner._format_conversation.return_value = "Formatted conversation"
+        
+        # Multiple prompts with alternating interruptions
+        inputs = ["prompt1", "prompt2", "prompt3", "prompt4", "quit"]
+        
+        with patch('builtins.input', side_effect=inputs):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                interactive_chat(mock_runner, stream=True)
+        
+        output = fake_out.getvalue()
+        
+        # Should show interruptions and normal responses
+        assert "interrupted" in output.lower()
+        assert "Response 1" in output
+        assert "Response 3" in output
+        
+        # Should have made four generation calls
+        assert len(generation_calls) == 4
+    
+    def test_interruption_does_not_affect_conversation_history(self):
+        """Test that interruption doesn't corrupt conversation history"""
+        mock_runner = Mock()
+        
+        conversation_calls = []
+        
+        def track_conversation(messages):
+            conversation_calls.append([msg.copy() for msg in messages])
+            return "Formatted conversation"
+        
+        mock_runner._format_conversation.side_effect = track_conversation
+        
+        # First generation gets interrupted, second succeeds
+        generation_calls = []
+        def mock_generation(prompt, **kwargs):
+            call_num = len(generation_calls)
+            generation_calls.append(call_num)
+            
+            if call_num == 0:
+                # First call: interrupted
+                return iter(["\n[Generation interrupted by user]"])
+            else:
+                # Second call: normal
+                return iter(["Normal response"])
+        
+        mock_runner.generate_streaming.side_effect = mock_generation
+        
+        inputs = ["first prompt", "second prompt", "quit"]
+        
+        with patch('builtins.input', side_effect=inputs):
+            with patch('sys.stdout', new=StringIO()):
+                interactive_chat(mock_runner, stream=True)
+        
+        # Should have proper conversation progression
+        assert len(conversation_calls) == 2
+        
+        # First conversation: just user message
+        assert len(conversation_calls[0]) == 1
+        assert conversation_calls[0][0]["content"] == "first prompt"
+        
+        # Second conversation: user + interrupted response + new user message
+        assert len(conversation_calls[1]) == 3
+        assert conversation_calls[1][0]["content"] == "first prompt"
+        assert conversation_calls[1][1]["content"] == "[Generation interrupted by user]"
+        assert conversation_calls[1][2]["content"] == "second prompt"
diff --git a/tests_2.0/test_issue_27.py b/tests_2.0/test_issue_27.py
new file mode 100644
index 0000000..5e95a09
--- /dev/null
+++ b/tests_2.0/test_issue_27.py
@@ -0,0 +1,141 @@
+"""Exploratory tests for Issue #27 using real model copies in isolated cache.
+
+These tests are opt-in and require MLXK2_USER_HF_HOME to point to the user HF cache.
+They never modify the user cache; they copy selected models into the isolated test cache
+and then apply controlled mutations to simulate edge cases.
+"""
+
+import os
+import pytest
+
+# Allow selecting these tests via marker: -m issue27
+pytestmark = [pytest.mark.issue27]
+
+# Capture the original user cache root at import time (before fixtures may
+# override HF_HOME for isolation). This allows using either MLXK2_USER_HF_HOME
+# or HF_HOME as the source of truth for the user's cache path.
+_USER_CACHE_ROOT = os.environ.get("MLXK2_USER_HF_HOME") or os.environ.get("HF_HOME")
+
+
+requires_user_cache = pytest.mark.skipif(
+    not _USER_CACHE_ROOT,
+    reason="requires MLXK2_USER_HF_HOME or HF_HOME (user cache path)"
+)
+
+
+@requires_user_cache
+class TestIssue27Exploration:
+    def test_copy_real_model_and_list(self, copy_user_model_to_isolated):
+        # Choose a common model; allow override via env
+        model = os.environ.get(
+            "MLXK2_ISSUE27_MODEL", "mlx-community/Phi-3-mini-4k-instruct-4bit"
+        )
+        dst = copy_user_model_to_isolated(model)
+
+        # Verify list sees it via the regular operation
+        from mlxk2.operations.list import list_models
+        result = list_models()
+        assert result["status"] == "success"
+        names = [m["name"] for m in result["data"]["models"]]
+        assert model in names
+
+    def test_partial_download_simulation_health(self, copy_user_model_to_isolated):
+        model = os.environ.get(
+            "MLXK2_ISSUE27_MODEL", "mlx-community/Phi-3-mini-4k-instruct-4bit"
+        )
+        # Simulate partial/incomplete model state
+        copy_user_model_to_isolated(model, mutations=[
+            "remove_config", "truncate_weight", "add_partial_tmp"
+        ])
+
+        # Health should not crash and should report issues
+        from mlxk2.operations.health import health_check_operation
+        result = health_check_operation(model)
+        assert result["status"] == "success"
+        issues = result["data"]["unhealthy"]
+        # Either unhealthy includes this model, or health summaries remain consistent
+        if issues:
+            assert any(model in m.get("name", "") for m in issues)
+
+    def test_index_missing_shards_unhealthy(self, copy_user_model_to_isolated, monkeypatch):
+        model = os.environ.get(
+            "MLXK2_ISSUE27_INDEX_MODEL",
+            os.environ.get("MLXK2_ISSUE27_MODEL", "intfloat/multilingual-e5-large"),
+        )
+        # Force subset copy with 0 shards to minimize disk use
+        monkeypatch.setenv("MLXK2_SUBSET_COUNT", "0")
+        dst = copy_user_model_to_isolated(model)
+        sft_idx = dst / 'model.safetensors.index.json'
+        pt_idx = dst / 'pytorch_model.bin.index.json'
+        if not sft_idx.exists() and not pt_idx.exists():
+            pytest.skip('No safetensors/pytorch index found; skipping index-missing-shards test')
+
+        from mlxk2.operations.health import health_check_operation
+        result = health_check_operation(model)
+        assert any(m["name"].endswith(model.split('/')[-1]) or m["name"] == model for m in result["data"]["unhealthy"])
+
+    def test_index_delete_shard_is_unhealthy(self, copy_user_model_to_isolated):
+        model = os.environ.get(
+            "MLXK2_ISSUE27_INDEX_MODEL",
+            os.environ.get("MLXK2_ISSUE27_MODEL", "mistralai/Mistral-7B-Instruct-v0.2"),
+        )
+        dst = copy_user_model_to_isolated(model, mutations=['delete_indexed_shard'])
+        # If no index exists, skip this targeted test
+        if not (dst / 'model.safetensors.index.json').exists() and not (dst / 'pytorch_model.bin.index.json').exists():
+            pytest.skip('No safetensors/pytorch index found; skipping index-specific test')
+
+        from mlxk2.operations.health import health_check_operation
+        result = health_check_operation(model)
+        assert any(m["name"] == model and m["status"] == "unhealthy" for m in result["data"]["unhealthy"])
+
+    def test_index_truncate_shard_is_unhealthy(self, copy_user_model_to_isolated):
+        model = os.environ.get(
+            "MLXK2_ISSUE27_INDEX_MODEL",
+            os.environ.get("MLXK2_ISSUE27_MODEL", "mistralai/Mistral-7B-Instruct-v0.2"),
+        )
+        dst = copy_user_model_to_isolated(model, mutations=['truncate_indexed_shard'])
+        if not (dst / 'model.safetensors.index.json').exists() and not (dst / 'pytorch_model.bin.index.json').exists():
+            pytest.skip('No safetensors/pytorch index found; skipping index-specific test')
+
+        from mlxk2.operations.health import health_check_operation
+        result = health_check_operation(model)
+        assert any(m["name"] == model and m["status"] == "unhealthy" for m in result["data"]["unhealthy"])
+
+    def test_index_lfs_pointer_is_unhealthy(self, copy_user_model_to_isolated):
+        model = os.environ.get(
+            "MLXK2_ISSUE27_INDEX_MODEL",
+            os.environ.get("MLXK2_ISSUE27_MODEL", "mistralai/Mistral-7B-Instruct-v0.2"),
+        )
+        dst = copy_user_model_to_isolated(model, mutations=['lfsify_indexed_shard'])
+        if not (dst / 'model.safetensors.index.json').exists() and not (dst / 'pytorch_model.bin.index.json').exists():
+            pytest.skip('No safetensors/pytorch index found; skipping index-specific test')
+
+        from mlxk2.operations.health import health_check_operation
+        result = health_check_operation(model)
+        assert any(m["name"] == model and m["status"] == "unhealthy" for m in result["data"]["unhealthy"])
+
+    def test_user_cache_health_ok_readonly(self, monkeypatch):
+        """Read-only health OK check directly against user cache (no copy)."""
+        user_hf_home = _USER_CACHE_ROOT
+        if not user_hf_home:
+            pytest.skip("User cache root not set; set MLXK2_USER_HF_HOME or HF_HOME")
+
+        model = os.environ.get(
+            "MLXK2_ISSUE27_MODEL", "intfloat/multilingual-e5-large"
+        )
+        # Verify model exists in user cache
+        from pathlib import Path
+        from mlxk2.core.cache import hf_to_cache_dir
+        src = Path(user_hf_home) / "hub" / hf_to_cache_dir(model)
+        if not src.exists():
+            pytest.skip(f"Model not present in user cache: {src}")
+
+        # Point HF_HOME to user cache temporarily (read-only operation)
+        monkeypatch.setenv("HF_HOME", user_hf_home)
+        from mlxk2.operations.health import health_check_operation
+        result = health_check_operation(model)
+        assert result["status"] == "success"
+        assert any(
+            m.get("name") == model and m.get("status") == "healthy"
+            for m in result["data"]["healthy"]
+        ), f"Expected healthy for user model, got: {result}"
diff --git a/tests_2.0/test_issue_30_preflight.py b/tests_2.0/test_issue_30_preflight.py
new file mode 100644
index 0000000..722a57e
--- /dev/null
+++ b/tests_2.0/test_issue_30_preflight.py
@@ -0,0 +1,166 @@
+"""Tests for Issue #30: Gated Models Preflight Check"""
+
+import pytest
+from mlxk2.operations.pull import preflight_repo_access, pull_operation
+
+
+def test_preflight_private_model_without_token(monkeypatch):
+    """Test preflight check with a known private model without token.
+    
+    This is the core Issue #30 scenario: user tries to pull private/gated model
+    without setting HUGGINGFACE_HUB_TOKEN, should fail fast at preflight.
+    
+    Uses BrokeC/broken_model - a small private test model.
+    """
+    # Ensure no token is set for this test
+    # Ensure no tokens in environment
+    monkeypatch.delenv("HF_TOKEN", raising=False)
+    monkeypatch.delenv("HUGGINGFACE_HUB_TOKEN", raising=False)
+    
+    try:
+        # Verify no token in environment (critical for test validity)
+        import os
+        assert "HF_TOKEN" not in os.environ
+        assert "HUGGINGFACE_HUB_TOKEN" not in os.environ
+
+        # Require huggingface_hub for this test (skip if missing)
+        hub = pytest.importorskip("huggingface_hub")
+        from huggingface_hub import HfApi
+        from huggingface_hub import errors as _hub_errors
+        GatedRepoError = _hub_errors.GatedRepoError
+        def _fake_model_info(self, repo_id, token=None):
+            raise GatedRepoError("Gated/private repository")
+        monkeypatch.setattr(HfApi, "model_info", _fake_model_info, raising=True)
+
+        success, error = preflight_repo_access("org/private-model")
+        
+        # Should fail fast without token
+        assert success is False
+        assert error is not None
+        assert isinstance(error, str)
+        # Should mention access/private/gated/denied
+        assert any(keyword in error.lower() for keyword in ["access", "private", "gated", "denied", "token"])
+        
+    finally:
+        pass
+
+
+def test_preflight_nonexistent_model(monkeypatch):
+    """Test preflight check with a non-existent model."""
+    # Require huggingface_hub for this test (skip if missing)
+    hub = pytest.importorskip("huggingface_hub")
+    from huggingface_hub import HfApi
+    from huggingface_hub import errors as _hub_errors
+    RepositoryNotFoundError = _hub_errors.RepositoryNotFoundError
+    def _fake_model_info(self, repo_id, token=None):
+        raise RepositoryNotFoundError("Not found")
+    monkeypatch.setattr(HfApi, "model_info", _fake_model_info, raising=True)
+
+    success, error = preflight_repo_access("definitely-not-existing-model-12345-xyz")
+    
+    assert success is False
+    assert error is not None
+    # HuggingFace returns "access denied" even for non-existent models (security feature)
+    assert any(keyword in error.lower() for keyword in ["not found", "access denied", "denied"])
+
+
+def test_preflight_integration_in_pull(isolated_cache, monkeypatch):
+    """Test that preflight check is properly integrated in pull operation.
+    
+    Uses isolated_cache fixture which creates:
+    - Temporary cache under /var/folders/.../mlxk2_test_XXXXX/
+    - Safety sentinel: models--TEST-CACHE-SENTINEL--mlxk2-safety-check
+    - Proper HF_HOME override and MODEL_CACHE patching
+    """
+    # Require huggingface_hub for this test (skip if missing)
+    hub = pytest.importorskip("huggingface_hub")
+    from huggingface_hub import HfApi
+    from huggingface_hub import errors as _hub_errors
+    RepositoryNotFoundError = _hub_errors.RepositoryNotFoundError
+    def _fake_model_info(self, repo_id, token=None):
+        raise RepositoryNotFoundError("Not found")
+    monkeypatch.setattr(HfApi, "model_info", _fake_model_info, raising=True)
+
+    # Test with a non-existent model - should fail at preflight stage
+    result = pull_operation("definitely-not-existing-model-12345-xyz")
+    
+    assert result["status"] == "error"
+    assert result["data"]["download_status"] == "access_denied"
+    assert result["error"]["type"] == "access_denied"
+    # HuggingFace returns "access denied" even for non-existent models
+    assert any(keyword in result["error"]["message"].lower() for keyword in ["not found", "access denied", "denied"])
+
+
+def test_preflight_graceful_degradation():
+    """Test that preflight check degrades gracefully on errors."""
+    # Test with empty model name - should handle gracefully
+    success, error = preflight_repo_access("")
+    
+    # Should either handle this gracefully or fail predictably
+    assert isinstance(success, bool)
+    if not success:
+        assert isinstance(error, str)
+        assert len(error) > 0
+
+
+def test_preflight_mock_gated_scenario():
+    """Test preflight behavior documentation for gated models."""
+    # Note: We can't easily test actual gated models without tokens
+    # This test documents the expected behavior
+    
+    # If we had a gated model, the expected flow would be:
+    # 1. preflight_repo_access("meta-llama/Llama-2-7b-hf") -> (False, "gated")
+    # 2. pull_operation should return access_denied without downloading anything
+    
+    # For now, we just verify the function exists and is importable
+    assert callable(preflight_repo_access)
+    
+    # The function should handle import errors gracefully
+    # (e.g., if huggingface_hub is not installed)
+    try:
+        success, error = preflight_repo_access("test-model")
+        # Should not crash, even if the model doesn't exist
+        assert isinstance(success, bool)
+        assert error is None or isinstance(error, str)
+    except Exception as e:
+        pytest.fail(f"preflight_repo_access should not crash: {e}")
+
+
+def test_preflight_prevents_cache_pollution(isolated_cache, monkeypatch):
+    """Test that preflight check prevents cache pollution.
+    
+    This is the core value of Issue #30: failed access should not leave
+    partial downloads in the cache.
+    """
+    from mlxk2.core.cache import MODEL_CACHE
+    from conftest import assert_is_test_cache
+    
+    # Verify we're using test cache (safety)
+    # MODEL_CACHE points to hub/, sentinel is in hub/, so check MODEL_CACHE directly
+    assert_is_test_cache(MODEL_CACHE)
+    
+    # Require huggingface_hub for this test (skip if missing)
+    hub = pytest.importorskip("huggingface_hub")
+    from huggingface_hub import HfApi
+    from huggingface_hub import errors as _hub_errors
+    GatedRepoError = _hub_errors.GatedRepoError
+    def _fake_model_info(self, repo_id, token=None):
+        raise GatedRepoError("Gated/private repository")
+    monkeypatch.setattr(HfApi, "model_info", _fake_model_info, raising=True)
+
+    # Attempt to pull a gated/private model
+    result = pull_operation("org/gated-model")
+    
+    # Should fail at preflight stage
+    assert result["status"] == "error"
+    assert result["data"]["download_status"] == "access_denied"
+    
+    # Cache should remain clean (no partial downloads)
+    cache_contents = list(MODEL_CACHE.iterdir())
+    # Only the sentinel should exist
+    sentinel_exists = any("TEST-CACHE-SENTINEL" in item.name for item in cache_contents)
+    assert sentinel_exists, "Test sentinel should exist"
+    
+    # No model directories should be created for the failed model
+    model_dirs = [item for item in cache_contents if "gated-model" in item.name]
+    assert len(model_dirs) == 0, "No partial model directories should exist after preflight failure"
diff --git a/tests_2.0/test_issue_37_private_org_regression.py b/tests_2.0/test_issue_37_private_org_regression.py
new file mode 100644
index 0000000..fab96ac
--- /dev/null
+++ b/tests_2.0/test_issue_37_private_org_regression.py
@@ -0,0 +1,170 @@
+"""Regression test for Issue #37 P0: Private/org MLX models rejected in run command.
+
+Beta.5 introduced runtime compatibility pre-flight check in run_model() that incorrectly
+passed snapshot path instead of cache root to detect_framework(), causing all non-mlx-community
+models to be detected as "Unknown framework" and rejected.
+
+This test verifies the fix by simulating a private-org MLX model (renamed from mlx-community/Phi-3).
+
+Opt-in via: pytest -m live_run
+Requires: mlx-community/Phi-3-mini-4k-instruct-4bit in user cache (MLXK2_USER_HF_HOME)
+"""
+
+from __future__ import annotations
+
+import os
+import pytest
+import shutil
+from pathlib import Path
+from mlxk2.operations.run import run_model
+from mlxk2.core.cache import hf_to_cache_dir
+
+# Opt-in marker: only run with pytest -m live_run
+pytestmark = [pytest.mark.live_run]
+
+# Skip if MLXK2_USER_HF_HOME not set (prevents running in standard pytest)
+_USER_CACHE_ROOT = os.environ.get("MLXK2_USER_HF_HOME") or os.environ.get("HF_HOME")
+requires_user_cache = pytest.mark.skipif(
+    not _USER_CACHE_ROOT,
+    reason="requires MLXK2_USER_HF_HOME or HF_HOME (opt-in via pytest -m live_run)"
+)
+
+
+@requires_user_cache
+def test_private_org_mlx_model_runs_without_unknown_framework_error(
+    copy_user_model_to_isolated, isolated_cache
+):
+    """Test that private/org MLX models are correctly detected and can run.
+
+    Workflow:
+    1. Copy mlx-community/Phi-3-mini-4k-instruct-4bit from user cache
+    2. Rename cache directory to simulate private-org model (test-org/phi3-mlx-instruct)
+    3. Run the model with a simple prompt
+    4. Verify no "Unknown framework" error occurs
+
+    This test requires:
+    - Phi-3-mini-4k-instruct-4bit in user cache (MLXK2_USER_HF_HOME)
+    - Run with: pytest -m live_run
+    """
+    # Step 1: Copy Phi-3 from user cache to isolated test cache
+    src_model_dir = copy_user_model_to_isolated("mlx-community/Phi-3-mini-4k-instruct-4bit")
+
+    # Step 2: Rename to simulate private-org model
+    # From: models--mlx-community--Phi-3-mini-4k-instruct-4bit
+    # To:   models--test-org--phi3-mlx-instruct
+    private_org_cache_name = "models--test-org--phi3-mlx-instruct"
+    private_org_dir = isolated_cache / private_org_cache_name
+
+    # Move the directory
+    shutil.move(str(src_model_dir), str(private_org_dir))
+
+    # Verify the renamed model exists
+    assert private_org_dir.exists(), "Private org model directory should exist after rename"
+    snapshots = private_org_dir / "snapshots"
+    assert snapshots.exists(), "Snapshots directory should exist"
+
+    # Step 3: Add README.md with MLX tags to ensure framework detection works
+    # (This is what a real private MLX model would have)
+    snapshot_dirs = [d for d in snapshots.iterdir() if d.is_dir()]
+    assert len(snapshot_dirs) > 0, "Should have at least one snapshot"
+
+    for snapshot_dir in snapshot_dirs:
+        readme = snapshot_dir / "README.md"
+        readme.write_text("""---
+tags: [mlx, chat]
+library_name: mlx
+---
+# Test Org Phi-3 MLX Model
+
+This is a test private-org MLX model for regression testing.
+""")
+
+    # Step 4: Run the model - this should NOT fail with "Unknown framework"
+    # Note: We use json_output=True to get structured error messages
+    result = run_model(
+        model_spec="test-org/phi3-mlx-instruct",
+        prompt="Hello",
+        json_output=True,
+        stream=False,
+        max_tokens=5,  # Keep it short for speed
+        verbose=False
+    )
+
+    # Step 5: Verify no "Unknown framework" or "Incompatible: PyTorch" errors
+    # Note: We're testing framework detection, not mlx_lm availability
+    if isinstance(result, str):
+        # The bug would manifest as one of these:
+        assert "Unknown framework" not in result, (
+            f"Private-org MLX model should not be rejected as 'Unknown framework'. "
+            f"Got result: {result}"
+        )
+        assert "Incompatible: PyTorch" not in result, (
+            f"Private-org MLX model should not be detected as PyTorch. "
+            f"Got result: {result}"
+        )
+        # If we get mlx_lm import errors, that's OK - it means framework detection worked!
+        # The model was recognized as MLX and pre-flight passed
+
+    # If we get here without assertions failing, the regression is fixed!
+    print(f"✓ Private-org MLX model 'test-org/phi3-mlx-instruct' runs successfully")
+
+
+@requires_user_cache
+def test_framework_detection_for_renamed_mlx_community_model(
+    copy_user_model_to_isolated, isolated_cache
+):
+    """Test that framework detection works correctly when cache root is passed.
+
+    This is a more focused unit-style test that verifies detect_framework()
+    receives the correct parameters from run_model().
+    """
+    from mlxk2.operations.common import detect_framework
+    from mlxk2.core.cache import get_current_model_cache, hf_to_cache_dir
+
+    # Copy and rename model
+    src_model_dir = copy_user_model_to_isolated("mlx-community/Phi-3-mini-4k-instruct-4bit")
+    private_org_cache_name = "models--acme--mlx-chat-model"
+    private_org_dir = isolated_cache / private_org_cache_name
+    shutil.move(str(src_model_dir), str(private_org_dir))
+
+    # Add MLX tags to README
+    snapshots = private_org_dir / "snapshots"
+    snapshot_dirs = [d for d in snapshots.iterdir() if d.is_dir()]
+    assert len(snapshot_dirs) > 0
+    snapshot_path = snapshot_dirs[0]
+
+    readme = snapshot_path / "README.md"
+    readme.write_text("""---
+tags: [mlx]
+library_name: mlx
+---
+# Acme MLX Model
+""")
+
+    # Test framework detection with CORRECT parameters (cache root + selected_path + fm)
+    from mlxk2.operations.common import read_front_matter
+    fm = read_front_matter(snapshot_path)  # Read the README we just wrote
+    framework = detect_framework(
+        hf_name="acme/mlx-chat-model",
+        model_root=private_org_dir,  # Cache root (models--acme--mlx-chat-model)
+        selected_path=snapshot_path,  # Snapshot path (snapshots/abc123...)
+        fm=fm  # Front-matter with MLX tags
+    )
+
+    assert framework == "MLX", (
+        f"Framework should be detected as MLX from README tags. Got: {framework}"
+    )
+
+    # Test with INCORRECT parameters (what Beta.5 bug did)
+    framework_buggy = detect_framework(
+        hf_name="acme/mlx-chat-model",
+        model_root=snapshot_path,  # BUG: Passing snapshot as root
+        selected_path=None
+    )
+
+    # With the bug, it would fall through to "Unknown" because:
+    # - Not mlx-community/* → no early return
+    # - README not in snapshot_path / "snapshots" (doesn't exist)
+    # - No GGUF/PyTorch detected
+    # This assertion documents the buggy behavior for reference
+    print(f"Buggy detection result: {framework_buggy} (should be Unknown without fix)")
diff --git a/tests_2.0/test_json_api_list.py b/tests_2.0/test_json_api_list.py
new file mode 100644
index 0000000..aecef78
--- /dev/null
+++ b/tests_2.0/test_json_api_list.py
@@ -0,0 +1,111 @@
+"""Tests for JSON API spec v0.1.2: list operation minimal model object.
+
+Covers: size_bytes, last_modified (ISO-8601 Z), framework, model_type,
+capabilities, health, hash selection, cached.
+"""
+
+from datetime import datetime
+from typing import Set
+import pytest
+
+from mlxk2.operations.list import list_models
+
+
+def _is_iso_utc_z(ts: str) -> bool:
+    try:
+        # Must end with 'Z' and be parseable
+        if not ts.endswith("Z"):
+            return False
+        # Strip Z, attempt parsing
+        datetime.fromisoformat(ts.replace("Z", ""))
+        return True
+    except Exception:
+        return False
+
+
+@pytest.mark.spec
+def test_list_minimal_model_object_fields(mock_models, isolated_cache):
+    """Each model entry returns the minimal model object with health."""
+    result = list_models()
+    assert result["status"] == "success"
+    assert result["command"] == "list"
+
+    models = result["data"]["models"]
+    assert isinstance(models, list)
+    assert result["data"]["count"] == len(models)
+
+    # Allowed enums
+    allowed_framework: Set[str] = {"MLX", "GGUF", "PyTorch", "Unknown"}
+    allowed_model_types: Set[str] = {"chat", "embedding", "base", "unknown"}
+
+    # Verify minimal fields and types
+    for m in models:
+        # Required fields
+        assert set([
+            "name", "hash", "size_bytes", "last_modified", "framework",
+            "model_type", "capabilities", "health", "cached"
+        ]).issubset(m.keys())
+
+        assert isinstance(m["name"], str) and "/" in m["name"]
+
+        # hash: 40-char or None
+        h = m["hash"]
+        assert (h is None) or (isinstance(h, str) and len(h) == 40)
+
+        # size_bytes integer >= 0
+        assert isinstance(m["size_bytes"], int)
+        assert m["size_bytes"] >= 0
+
+        # last_modified as ISO-8601 UTC Z
+        assert isinstance(m["last_modified"], str)
+        assert _is_iso_utc_z(m["last_modified"]) is True
+
+        # framework
+        assert m["framework"] in allowed_framework
+
+        # model_type + capabilities
+        assert m["model_type"] in allowed_model_types
+        assert isinstance(m["capabilities"], list)
+
+        # health
+        assert m["health"] in {"healthy", "unhealthy"}
+
+        # cached flag
+        assert m["cached"] is True
+
+        # Spec 0.1.2: no human-readable size; ensure we do not expose 'size' or internal paths
+        assert "size" not in m
+        assert "hashes" not in m
+
+
+@pytest.mark.spec
+def test_list_pattern_filter_case_insensitive(mock_models, isolated_cache):
+    """Pattern filters case-insensitively on model name."""
+    result = list_models(pattern="llama")
+    models = result["data"]["models"]
+    assert all("llama" in m["name"].lower() for m in models)
+
+    # A different pattern should yield different subset
+    result_q = list_models(pattern="Qwen")
+    models_q = result_q["data"]["models"]
+    assert all("qwen" in m["name"].lower() for m in models_q)
+    # Ensure partition is non-trivial in our fixture
+    assert set(m["name"].lower() for m in models).isdisjoint(
+        set(m["name"].lower() for m in models_q)
+    ) is True
+
+
+@pytest.mark.spec
+def test_list_empty_cache(isolated_cache):
+    """Empty cache yields empty list and count 0."""
+    # Remove all models (keep canary)
+    for d in isolated_cache.iterdir():
+        if d.is_dir() and d.name.startswith("models--") and "TEST-CACHE-SENTINEL" not in d.name:
+            # Safe in tests; strict delete is enforced by fixture env var
+            from shutil import rmtree
+            rmtree(d)
+
+    result = list_models()
+    assert result["status"] == "success"
+    assert result["data"]["models"] == []
+    assert result["data"]["count"] == 0
diff --git a/tests_2.0/test_json_api_show.py b/tests_2.0/test_json_api_show.py
new file mode 100644
index 0000000..096f4c8
--- /dev/null
+++ b/tests_2.0/test_json_api_show.py
@@ -0,0 +1,74 @@
+"""Tests for JSON API spec v0.1.2: show operation variants.
+
+Validates minimal model object and that --files and --config yield different
+optional data sections.
+"""
+
+from datetime import datetime
+import pytest
+
+from mlxk2.operations.show import show_model_operation
+
+
+def _is_iso_utc_z(ts: str) -> bool:
+    try:
+        if not ts.endswith("Z"):
+            return False
+        datetime.fromisoformat(ts.replace("Z", ""))
+        return True
+    except Exception:
+        return False
+
+
+@pytest.mark.spec
+def test_show_minimal_model_object(mock_models, isolated_cache):
+    name = "mlx-community/Phi-3-mini-4k-instruct-4bit"
+    res = show_model_operation(name)
+    assert res["status"] == "success"
+    assert res["command"] == "show"
+
+    model = res["data"]["model"]
+    assert set([
+        "name", "hash", "size_bytes", "last_modified", "framework",
+        "model_type", "capabilities", "health", "cached"
+    ]).issubset(model.keys())
+    assert model["name"] == name
+    assert (model["hash"] is None) or (isinstance(model["hash"], str) and len(model["hash"]) == 40)
+    assert isinstance(model["size_bytes"], int) and model["size_bytes"] > 0
+    assert _is_iso_utc_z(model["last_modified"]) is True
+    assert model["cached"] is True
+    # Ensure show does not expose human-readable size
+    assert "size" not in model
+
+    # Default branch returns metadata when available
+    assert "metadata" in res["data"]
+
+
+@pytest.mark.spec
+def test_show_with_files_and_config_are_different(mock_models, isolated_cache):
+    name = "mlx-community/Phi-3-mini-4k-instruct-4bit"
+
+    res_files = show_model_operation(name, include_files=True, include_config=False)
+    assert res_files["status"] == "success"
+    assert "files" in res_files["data"]
+    assert res_files["data"].get("metadata") is None
+    assert "config" not in res_files["data"]
+
+    files = res_files["data"]["files"]
+    assert isinstance(files, list) and len(files) > 0
+    # Validate file entry shape
+    first = files[0]
+    assert set(["name", "size", "type"]).issubset(first.keys())
+
+    res_config = show_model_operation(name, include_files=False, include_config=True)
+    assert res_config["status"] == "success"
+    assert "config" in res_config["data"]
+    assert res_config["data"].get("metadata") is None
+    assert "files" not in res_config["data"]
+
+    cfg = res_config["data"]["config"]
+    assert isinstance(cfg, dict) and len(cfg) > 0
+
+    # Compare that the two payloads differ in optional sections
+    assert ("files" in res_files["data"]) != ("files" in res_config["data"])  # XOR presence
+    assert ("config" in res_files["data"]) != ("config" in res_config["data"])  # XOR presence
diff --git a/tests_2.0/test_legacy_formats.py b/tests_2.0/test_legacy_formats.py
new file mode 100644
index 0000000..ea2b103
--- /dev/null
+++ b/tests_2.0/test_legacy_formats.py
@@ -0,0 +1,93 @@
+"""Tests for legacy model format detection (Issue #37).
+
+Note: These tests focus on legacy format detection only.
+Runtime compatibility tests for modern formats (Issue #36) are pending.
+"""
+
+import json
+from pathlib import Path
+
+
+def test_weights_numeric_safetensors_is_runtime_incompatible(isolated_cache):
+    """Legacy weights.00.safetensors format should be healthy but runtime incompatible."""
+    snap = isolated_cache / "models--test--legacy-weights" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+
+    # Create config.json (required for health check)
+    config = {
+        "model_type": "llama",
+        "architectures": ["LlamaForCausalLM"]
+    }
+    (snap / "config.json").write_text(json.dumps(config))
+
+    # Create legacy weight file
+    (snap / "weights.00.safetensors").write_bytes(b"fake_weights" * 100)
+
+    from mlxk2.operations.health import _check_snapshot_health, check_runtime_compatibility
+
+    # Health check should pass (files are complete)
+    healthy, reason = _check_snapshot_health(snap)
+    assert healthy is True, f"Expected healthy, got: {reason}"
+
+    # Runtime compatibility should fail due to legacy format
+    compatible, reason = check_runtime_compatibility(snap, "MLX")
+    assert compatible is False
+    assert "Legacy format not supported by mlx-lm" in reason
+
+
+def test_pytorch_model_numeric_safetensors_is_runtime_incompatible(isolated_cache):
+    """Legacy pytorch_model-00001.safetensors format should be runtime incompatible."""
+    snap = isolated_cache / "models--test--legacy-pytorch" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+
+    config = {
+        "model_type": "llama",
+        "architectures": ["LlamaForCausalLM"]
+    }
+    (snap / "config.json").write_text(json.dumps(config))
+
+    # Create legacy pytorch_model files
+    (snap / "pytorch_model-00001.safetensors").write_bytes(b"fake" * 100)
+    (snap / "pytorch_model-00002.safetensors").write_bytes(b"fake" * 100)
+
+    from mlxk2.operations.health import _check_snapshot_health, check_runtime_compatibility
+
+    healthy, reason = _check_snapshot_health(snap)
+    assert healthy is True
+
+    # Runtime compatibility should fail due to legacy format
+    compatible, reason = check_runtime_compatibility(snap, "MLX")
+    assert compatible is False
+    assert "Legacy format not supported by mlx-lm" in reason
+
+
+def test_modern_model_safetensors_passes_legacy_gate(isolated_cache):
+    """Modern model.safetensors should pass the legacy format gate (Gate 2).
+
+    This test verifies that modern formats are NOT rejected by the legacy format check.
+    Full runtime compatibility (Gate 3: model_type check) is not tested here.
+    """
+    snap = isolated_cache / "models--test--modern" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+
+    config = {
+        "model_type": "llama",
+        "architectures": ["LlamaForCausalLM"]
+    }
+    (snap / "config.json").write_text(json.dumps(config))
+
+    # Create modern weight file
+    (snap / "model.safetensors").write_bytes(b"fake_weights" * 100)
+
+    from mlxk2.operations.health import _check_snapshot_health, check_runtime_compatibility
+
+    healthy, reason = _check_snapshot_health(snap)
+    assert healthy is True
+
+    # Should NOT be rejected by legacy format check (Gate 2)
+    # Note: May still fail at Gate 3 (model_type) if mlx-lm is not available
+    compatible, reason = check_runtime_compatibility(snap, "MLX")
+
+    # If it failed, it should NOT be due to legacy format
+    if not compatible:
+        assert "Legacy format" not in reason, f"Should not fail due to legacy format, but got: {reason}"
diff --git a/tests_2.0/test_model_naming.py b/tests_2.0/test_model_naming.py
new file mode 100644
index 0000000..e467ebe
--- /dev/null
+++ b/tests_2.0/test_model_naming.py
@@ -0,0 +1,146 @@
+"""Tests for MLX-Knife 2.0 model naming rules and conversion.
+
+These tests document and verify the critical naming rules we discovered:
+1. Universal conversion: -- ↔ / (all occurrences)  
+2. Character constraints: single "-" extern, double "--" intern
+3. Corrupted cache tolerance: mechanical conversion, problems visible
+4. CLI compatibility: short names, @hash syntax, fuzzy matching
+"""
+
+import pytest
+import sys
+from pathlib import Path
+
+# Import MLX-Knife 2.0 modules
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from mlxk2.core.cache import hf_to_cache_dir, cache_dir_to_hf
+
+
+class TestNamingConversionRules:
+    """Test the fundamental -- ↔ / conversion rules."""
+    
+    def test_universal_conversion_rule(self):
+        """ALL -- ↔ / conversion (not just first occurrence)."""
+        # External → Internal: All "/" become "--"
+        assert hf_to_cache_dir("org/sub/model") == "models--org--sub--model"
+        assert hf_to_cache_dir("deep/nested/path/model") == "models--deep--nested--path--model"
+        
+        # Internal → External: All "--" become "/"  
+        assert cache_dir_to_hf("models--org--sub--model") == "org/sub/model"
+        assert cache_dir_to_hf("models--deep--nested--path--model") == "deep/nested/path/model"
+    
+    def test_bijective_conversion_clean_names(self):
+        """Clean names must convert bijectively (no information loss)."""
+        clean_names = [
+            "microsoft/DialoGPT-small",
+            "mlx-community/Phi-3-mini-4k-instruct-4bit",
+            "org-name/model-v1",  # Single dashes OK
+            "single-model",
+            "org/sub/model",  # Multi-level
+        ]
+        
+        for external in clean_names:
+            internal = hf_to_cache_dir(external)
+            recovered = cache_dir_to_hf(internal)
+            assert external == recovered, f"NOT BIJECTIVE: {external} → {internal} → {recovered}"
+    
+    def test_character_constraint_validation(self):
+        """Validate character constraints for clean conversion."""
+        # Clean external names: max 1 consecutive dash
+        valid_external = [
+            "org-name/model-v1",
+            "microsoft/DialoGPT-small"
+        ]
+        
+        for external in valid_external:
+            assert "--" not in external, f"Double dash in external name: {external}"
+            
+            internal = hf_to_cache_dir(external)
+            # Clean internal: max 2 consecutive dashes (separators only)
+            assert "---" not in internal, f"Triple dash in internal: {internal}"
+    
+    def test_corrupted_cache_mechanical_conversion(self):
+        """Corrupted cache entries get mechanical conversion (problems visible)."""
+        # These violate the clean naming rules but should convert gracefully
+        corrupted_cases = [
+            ("models--org--model---corrupted", "org/model/-corrupted"),  # Triple dash → empty segment
+            ("models--microsoft--DialogGPT---small", "microsoft/DialogGPT/-small"),  # Problem visible
+            ("models--org----model", "org//model"),  # Quadruple dash → empty segment
+        ]
+        
+        for corrupted_internal, expected_external in corrupted_cases:
+            result = cache_dir_to_hf(corrupted_internal)
+            assert result == expected_external, f"Mechanical conversion failed: {corrupted_internal}"
+            # Problem must be visible in result
+            assert ("/-" in result or "//" in result), f"Corruption not visible in: {result}"
+
+
+class TestModelResolutionLogic:
+    """Test CLI compatibility features: expansion, @hash, fuzzy matching."""
+    
+    def test_hash_syntax_parsing(self):
+        """@hash syntax must parse correctly."""
+        from mlxk2.core.model_resolution import parse_model_spec
+        
+        # With hash
+        model, hash_val = parse_model_spec("Qwen3@e96")
+        assert hash_val == "e96"
+        assert "@" not in model  # Hash removed from model name
+        
+        # Without hash  
+        model, hash_val = parse_model_spec("Phi-3-mini")
+        assert hash_val is None
+        assert model == "Phi-3-mini"  # Would be expanded by expand_model_name
+    
+    def test_short_name_expansion_logic(self):
+        """Short names should try mlx-community first, then return as-is."""
+        from mlxk2.core.model_resolution import expand_model_name
+        
+        # Names with org should not be expanded
+        assert expand_model_name("microsoft/DialoGPT-small") == "microsoft/DialoGPT-small"
+        
+        # Single names return as-is (no pattern forcing!)
+        assert expand_model_name("nonexistent-model") == "nonexistent-model"
+        
+        # NOTE: mlx-community expansion requires actual cache, tested in integration tests
+    
+    def test_fuzzy_matching_pattern(self):
+        """Fuzzy matching should be case-insensitive partial matching."""
+        from mlxk2.core.model_resolution import find_matching_models
+        
+        # Empty cache returns empty list
+        matches = find_matching_models("anything")
+        assert isinstance(matches, list)  # Should not crash
+        
+        # NOTE: Real fuzzy matching requires actual cache, tested in integration tests
+
+
+class TestErrorHandlingRobustness:
+    """Test that edge cases don't crash the system."""
+    
+    def test_empty_and_invalid_inputs(self):
+        """Empty or invalid inputs should not crash."""
+        # Empty strings
+        assert hf_to_cache_dir("") == "models--"
+        assert cache_dir_to_hf("models--") == ""
+        
+        # Invalid formats
+        assert cache_dir_to_hf("invalid-format") == "invalid-format"
+        assert cache_dir_to_hf("models--") == ""
+    
+    def test_resolution_with_invalid_inputs(self):
+        """Model resolution should handle invalid inputs gracefully."""
+        from mlxk2.core.model_resolution import resolve_model_for_operation
+        
+        # Should return some response, not crash
+        result = resolve_model_for_operation("")
+        assert result is not None
+        assert len(result) == 3  # (name, hash, matches)
+        
+        result = resolve_model_for_operation("nonexistent@invalidhash")  
+        assert result is not None
+        assert len(result) == 3
+
+
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])
\ No newline at end of file
diff --git a/tests_2.0/test_push_dry_run.py b/tests_2.0/test_push_dry_run.py
new file mode 100644
index 0000000..8f98e90
--- /dev/null
+++ b/tests_2.0/test_push_dry_run.py
@@ -0,0 +1,121 @@
+"""Dry-run tests for experimental push (offline, no network).
+
+Covers repo-missing, existing-no-changes, and existing-with-changes cases.
+"""
+
+from __future__ import annotations
+
+import os
+import sys
+from pathlib import Path
+
+import pytest
+
+# Push tests now run by default (alpha features included in standard test suite)
+from types import SimpleNamespace
+
+import pytest
+
+from mlxk2.operations.push import push_operation, DEFAULT_PUSH_BRANCH
+from mlxk2.output.human import render_push
+
+
+def _install_fake_hf(monkeypatch, *, repo_exists: bool = True, branch_exists: bool = True, remote_files: list[str] | None = None):
+    class _Errors:
+        class HfHubHTTPError(Exception):
+            pass
+
+        class RepositoryNotFoundError(Exception):
+            pass
+
+        class RevisionNotFoundError(Exception):
+            pass
+
+    class _Api:
+        def __init__(self, token=None):
+            self.token = token
+
+        def repo_info(self, repo_id: str, repo_type: str, revision: str):
+            if not repo_exists:
+                raise _Errors.RepositoryNotFoundError("not found")
+            if not branch_exists:
+                raise _Errors.RevisionNotFoundError("rev not found")
+            return {"id": repo_id, "type": repo_type, "rev": revision}
+
+        def list_repo_files(self, repo_id: str, repo_type: str, revision: str):
+            return list(remote_files or [])
+
+        # create_repo is only called when create=True (not used in dry-run tests)
+        def create_repo(self, repo_id: str, repo_type: str, private: bool, exist_ok: bool):
+            return {"ok": True}
+
+    fake = SimpleNamespace(HfApi=_Api, upload_folder=None, errors=_Errors)
+    # Use monkeypatch to ensure automatic restoration after each test
+    monkeypatch.setitem(sys.modules, "huggingface_hub", fake)
+    monkeypatch.setitem(sys.modules, "huggingface_hub.errors", _Errors)
+
+
+def test_dry_run_repo_missing(tmp_path: Path, monkeypatch):
+    # Workspace with files; one ignored by default, one via .hfignore
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "keep.txt").write_text("x")
+    (ws / ".DS_Store").write_text("x")  # default ignore
+    (ws / "ignored.log").write_text("x")
+    (ws / ".hfignore").write_text("ignored.log\n")
+
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+    _install_fake_hf(monkeypatch, repo_exists=False)
+
+    res = push_operation(str(ws), "org/model", branch=DEFAULT_PUSH_BRANCH, dry_run=True)
+    assert res["status"] == "success"
+    d = res["data"]
+    assert d.get("dry_run") is True
+    assert d.get("would_create_repo") is True
+    assert d.get("would_create_branch") is True
+    # Only keep.txt should be counted (others ignored)
+    assert d.get("dry_run_summary", {}).get("added") == 1
+    # Human line
+    line = render_push(res)
+    assert "dry-run:" in line
+
+
+def test_dry_run_existing_no_changes(tmp_path: Path, monkeypatch):
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "a.txt").write_text("1")
+    (ws / "b.txt").write_text("2")
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+    _install_fake_hf(monkeypatch, repo_exists=True, branch_exists=True, remote_files=["a.txt", "b.txt"]) 
+
+    res = push_operation(str(ws), "org/model", branch=DEFAULT_PUSH_BRANCH, dry_run=True)
+    assert res["status"] == "success"
+    d = res["data"]
+    assert d.get("dry_run") is True
+    assert d.get("no_changes") is True
+    assert d.get("dry_run_summary", {}).get("added") == 0
+    assert d.get("dry_run_summary", {}).get("deleted") == 0
+    assert d.get("message") == "Dry-run: no changes"
+
+
+def test_dry_run_existing_with_changes(tmp_path: Path, monkeypatch):
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    # Local: a.txt (shared), new.txt (to add)
+    (ws / "a.txt").write_text("1")
+    (ws / "new.txt").write_text("x")
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+    # Remote: a.txt (shared), gone.txt (to delete)
+    _install_fake_hf(monkeypatch, repo_exists=True, branch_exists=True, remote_files=["a.txt", "gone.txt"]) 
+
+    res = push_operation(str(ws), "org/model", branch=DEFAULT_PUSH_BRANCH, dry_run=True)
+    assert res["status"] == "success"
+    d = res["data"]
+    assert d.get("dry_run") is True
+    assert d.get("no_changes") is False
+    assert d.get("dry_run_summary", {}).get("added") == 1
+    assert d.get("dry_run_summary", {}).get("deleted") == 1
+    assert d.get("message") == "Dry-run: +1 ~? -1"
+    # Human line should reflect plan
+    line = render_push(res)
+    assert "dry-run: +1 ~? -1" in line
diff --git a/tests_2.0/test_push_extended.py b/tests_2.0/test_push_extended.py
new file mode 100644
index 0000000..9dd2d3b
--- /dev/null
+++ b/tests_2.0/test_push_extended.py
@@ -0,0 +1,303 @@
+"""Extended offline tests for experimental push.
+
+These tests monkeypatch a fake `huggingface_hub` to avoid network
+and validate:
+- no-op (no changes) behavior and message/log propagation
+- change summary (+/~/-) extraction from returned commit info
+- repo/branch existence handling (`--create`, missing branch tolerated)
+- .hfignore merge with default ignore patterns
+- human output rendering including --verbose extras
+"""
+
+from __future__ import annotations
+
+import os
+import pytest
+
+# Skip all tests if push is not enabled
+# Push tests now run by default (alpha features included in standard test suite)
+
+import logging
+import sys
+from pathlib import Path
+from types import SimpleNamespace
+
+import pytest
+
+from mlxk2.operations.push import push_operation, DEFAULT_PUSH_BRANCH
+from mlxk2.output.human import render_push
+
+
+class _Errors(SimpleNamespace):
+    class HfHubHTTPError(Exception):
+        pass
+
+    class RepositoryNotFoundError(Exception):
+        pass
+
+    class RevisionNotFoundError(Exception):
+        pass
+
+
+class _FakeHfApi:
+    def __init__(self, token: str | None = None) -> None:
+        self.token = token
+        self.created = False
+
+    def repo_info(self, repo_id: str, repo_type: str, revision: str):
+        # Default: repo + branch exist
+        return {"id": repo_id, "type": repo_type, "rev": revision}
+
+    def create_repo(self, repo_id: str, repo_type: str, private: bool, exist_ok: bool):
+        self.created = True
+        return {"created": True, "private": private}
+
+
+def _install_fake_hub(monkeypatch, *, mode: str, capture_patterns: dict | None = None):
+    """Install a fake huggingface_hub into sys.modules.
+
+    mode:
+      - "no_changes": upload returns object without commit_id and emits hub log
+      - "with_changes": upload returns commit and files ops
+    capture_patterns: optional dict to capture kwargs from upload_folder
+    """
+
+    api = _FakeHfApi
+
+    def upload_folder(**kwargs):  # type: ignore[override]
+        # Record ignore_patterns if requested
+        if capture_patterns is not None:
+            capture_patterns["ignore_patterns"] = list(kwargs.get("ignore_patterns") or [])
+
+        if mode == "no_changes":
+            # Emit a hub-like info message
+            logging.getLogger("huggingface_hub").info(
+                "No files have been modified since last commit. Skipping to prevent empty commit."
+            )
+            # Return object without commit id and without files
+            return SimpleNamespace()
+        elif mode == "with_changes":
+            files = [
+                SimpleNamespace(operation="add"),
+                SimpleNamespace(operation="update"),
+                SimpleNamespace(operation="delete"),
+            ]
+            return SimpleNamespace(
+                commit_id="abcdef1234567890abcdef1234567890abcdef12",
+                commit_url="https://huggingface.co/user/repo/commit/abcdef1",
+                files=files,
+            )
+        else:
+            return SimpleNamespace(commit_id="cafebabe" * 5)
+
+    fake = SimpleNamespace(HfApi=api, upload_folder=upload_folder, errors=_Errors)
+    sys.modules["huggingface_hub"] = fake  # type: ignore
+    sys.modules["huggingface_hub.errors"] = _Errors  # type: ignore
+    monkeypatch.setitem(sys.modules, "huggingface_hub", fake)
+    monkeypatch.setitem(sys.modules, "huggingface_hub.errors", _Errors)
+    return fake
+
+
+def test_push_no_changes_offline(tmp_path, monkeypatch):
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "README.md").write_text("x")
+
+    _install_fake_hub(monkeypatch, mode="no_changes")
+
+    res = push_operation(str(ws), "user/repo", branch=DEFAULT_PUSH_BRANCH)
+    assert res["status"] == "success"
+    assert res["data"]["no_changes"] is True
+    assert res["data"]["uploaded_files_count"] == 0
+    # Hub message should be reflected in JSON message or hf_logs
+    msg = res["data"].get("message") or ""
+    logs = res["data"].get("hf_logs") or []
+    assert isinstance(logs, list)
+    assert ("No files have been modified" in msg) or any(
+        isinstance(l, str) and "No files have been modified" in l for l in logs
+    )
+
+    # Human output should show "no changes" and not duplicate hub logs
+    line = render_push(res)
+    assert "no changes" in line
+    assert "No files have been modified" not in line
+
+
+def test_push_with_changes_summary_and_url(tmp_path, monkeypatch):
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "file.txt").write_text("x")
+
+    _install_fake_hub(monkeypatch, mode="with_changes")
+
+    res = push_operation(str(ws), "user/repo", branch=DEFAULT_PUSH_BRANCH)
+    assert res["status"] == "success"
+    assert res["data"]["no_changes"] is False
+    assert res["data"]["uploaded_files_count"] == 3
+    assert res["data"]["change_summary"] == {"added": 1, "modified": 1, "deleted": 1}
+    assert res["data"]["commit_url"].startswith("https://huggingface.co/")
+
+    # Human output with verbose includes URL
+    verbose_line = render_push(res, verbose=True)
+    assert "commit" in verbose_line and "http" in verbose_line
+
+
+def test_push_repo_not_found_requires_create(tmp_path, monkeypatch):
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "file.txt").write_text("x")
+
+    # Fake API that raises repo not found
+    class _ApiMissing(_FakeHfApi):
+        def repo_info(self, repo_id: str, repo_type: str, revision: str):  # type: ignore[override]
+            raise _Errors.RepositoryNotFoundError()
+
+    def upload_folder(**kwargs):  # type: ignore
+        return SimpleNamespace(commit_id="deadbeefdeadbeefdeadbeefdeadbeefdeadbeef")
+
+    fake = SimpleNamespace(HfApi=_ApiMissing, upload_folder=upload_folder, errors=_Errors)
+    sys.modules["huggingface_hub"] = fake  # type: ignore
+    sys.modules["huggingface_hub.errors"] = _Errors  # type: ignore
+    monkeypatch.setitem(sys.modules, "huggingface_hub", fake)
+    monkeypatch.setitem(sys.modules, "huggingface_hub.errors", _Errors)
+
+    # Without --create → error
+    res = push_operation(str(ws), "user/repo", create=False, private=False, branch=DEFAULT_PUSH_BRANCH)
+    assert res["status"] == "error"
+    assert res["error"]["type"] == "repo_not_found"
+
+    # With --create → success and created_repo True
+    res2 = push_operation(str(ws), "user/repo", create=True, private=True, branch=DEFAULT_PUSH_BRANCH)
+    assert res2["status"] == "success"
+    assert res2["data"]["created_repo"] is True
+
+
+def test_push_branch_missing_is_tolerated(tmp_path, monkeypatch):
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "file.txt").write_text("x")
+
+    class _ApiNoBranch(_FakeHfApi):
+        def repo_info(self, repo_id: str, repo_type: str, revision: str):  # type: ignore[override]
+            raise _Errors.RevisionNotFoundError()
+
+    def upload_folder(**kwargs):  # type: ignore
+        return SimpleNamespace(commit_id="feedfacefeedfacefeedfacefeedfacefeedface")
+
+    fake = SimpleNamespace(HfApi=_ApiNoBranch, upload_folder=upload_folder, errors=_Errors)
+    sys.modules["huggingface_hub"] = fake  # type: ignore
+    sys.modules["huggingface_hub.errors"] = _Errors  # type: ignore
+    monkeypatch.setitem(sys.modules, "huggingface_hub", fake)
+    monkeypatch.setitem(sys.modules, "huggingface_hub.errors", _Errors)
+
+    res = push_operation(str(ws), "user/repo", branch=DEFAULT_PUSH_BRANCH)
+    assert res["status"] == "success"
+    assert isinstance(res["data"].get("commit_sha"), str)
+
+
+def test_push_hfignore_is_merged_with_defaults(tmp_path, monkeypatch):
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    # Create files and .hfignore
+    (ws / "README.md").write_text("x")
+    (ws / ".hfignore").write_text(".idea/\n.vscode/\n*.ipynb\n")
+
+    captured: dict = {}
+    _install_fake_hub(monkeypatch, mode="with_changes", capture_patterns=captured)
+
+    res = push_operation(str(ws), "user/repo", branch=DEFAULT_PUSH_BRANCH)
+    assert res["status"] == "success"
+    pats = captured.get("ignore_patterns") or []
+    # Ensure core defaults are present
+    defaults = {"**/.git/**", "**/.DS_Store", "**/__pycache__/**", "**/.venv/**", "**/venv/**", "**/*.pyc"}
+    assert defaults.issubset(set(pats))
+    # Ensure .hfignore additions are present
+    assert ".idea/" in pats and ".vscode/" in pats and "*.ipynb" in pats
+
+
+def test_push_retry_creates_branch_on_upload_revision_error(tmp_path, monkeypatch):
+    """If upload fails with a revision-not-found style error and --create is set,
+    the operation should create the branch and retry once, succeeding offline."""
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "file.txt").write_text("x")
+
+    class _ApiOk(_FakeHfApi):
+        instance = None  # type: ignore[var-annotated]
+
+        def __init__(self, token: str | None = None) -> None:  # type: ignore[override]
+            super().__init__(token)
+            self.created_branches: list[tuple[str, str]] = []
+            _ApiOk.instance = self
+
+        def create_branch(self, repo_id: str, repo_type: str, branch: str):  # type: ignore[override]
+            self.created_branches.append((repo_id, branch))
+            return {"ok": True}
+
+    state = {"attempt": 0}
+
+    def upload_folder(**kwargs):  # type: ignore
+        # First attempt fails with a hub-like error; second succeeds
+        if state["attempt"] == 0:
+            state["attempt"] += 1
+            raise _Errors.HfHubHTTPError("Invalid rev id: test-branch")
+        state["attempt"] += 1
+        return SimpleNamespace(commit_id="0123456789abcdef0123456789abcdef01234567")
+
+    fake = SimpleNamespace(HfApi=_ApiOk, upload_folder=upload_folder, errors=_Errors)
+    sys.modules["huggingface_hub"] = fake  # type: ignore
+    sys.modules["huggingface_hub.errors"] = _Errors  # type: ignore
+    monkeypatch.setitem(sys.modules, "huggingface_hub", fake)
+    monkeypatch.setitem(sys.modules, "huggingface_hub.errors", _Errors)
+
+    res = push_operation(str(ws), "user/repo", create=True, private=True, branch="test-branch")
+    assert res["status"] == "success"
+    # Ensure we retried exactly once (two attempts total)
+    assert state["attempt"] == 2
+    # Ensure branch creation was attempted once
+    assert _ApiOk.instance is not None
+    assert ("user/repo", "test-branch") in (_ApiOk.instance.created_branches if _ApiOk.instance else [])
+
+
+def test_push_apfs_warning_added_for_non_apfs_workspace(tmp_path, monkeypatch):
+    """Test that push adds APFS warning to message for non-APFS workspaces (ADR-007)."""
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "file.txt").write_text("test content")
+
+    # Mock APFS detection to return False (non-APFS workspace)
+    with monkeypatch.context() as m:
+        m.setattr("mlxk2.operations.push._is_apfs_filesystem", lambda path: False)
+
+        _install_fake_hub(monkeypatch, mode="with_changes")
+
+        res = push_operation(str(ws), "user/repo", branch=DEFAULT_PUSH_BRANCH)
+
+        assert res["status"] == "success"
+        assert "Clone operations require APFS filesystem" in res["data"]["message"]
+
+
+def test_push_no_apfs_warning_for_apfs_workspace(tmp_path, monkeypatch):
+    """Test that push does NOT add APFS warning for APFS workspaces (ADR-007)."""
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "file.txt").write_text("test content")
+
+    # Mock APFS detection to return True (APFS workspace)
+    with monkeypatch.context() as m:
+        m.setattr("mlxk2.operations.push._is_apfs_filesystem", lambda path: True)
+
+        _install_fake_hub(monkeypatch, mode="with_changes")
+
+        res = push_operation(str(ws), "user/repo", branch=DEFAULT_PUSH_BRANCH)
+
+        assert res["status"] == "success"
+        assert "Clone operations require APFS filesystem" not in res["data"]["message"]
diff --git a/tests_2.0/test_push_minimal.py b/tests_2.0/test_push_minimal.py
new file mode 100644
index 0000000..2b06fa7
--- /dev/null
+++ b/tests_2.0/test_push_minimal.py
@@ -0,0 +1,39 @@
+"""Minimal offline tests for experimental push operation (M0).
+
+These tests avoid any network access and only validate local preconditions
+and JSON envelope/fields.
+"""
+
+import os
+from pathlib import Path
+
+import pytest
+
+# Skip all tests if push is not enabled
+# Push tests now run by default (alpha features included in standard test suite)
+
+from mlxk2.operations.push import push_operation, DEFAULT_PUSH_BRANCH
+
+
+def test_push_requires_token(tmp_path, monkeypatch):
+    # Ensure no token present
+    monkeypatch.delenv("HF_TOKEN", raising=False)
+
+    d: Path = tmp_path / "workspace"
+    d.mkdir()
+    (d / "README.md").write_text("hello")
+
+    res = push_operation(str(d), "org/model", branch=DEFAULT_PUSH_BRANCH)
+    assert res["command"] == "push"
+    assert res["status"] == "error"
+    assert res["error"]["type"] == "auth_error"
+    assert res["data"]["repo_id"] == "org/model"
+    assert res["data"]["branch"] == DEFAULT_PUSH_BRANCH
+
+
+def test_push_workspace_missing(monkeypatch, tmp_path):
+    monkeypatch.setenv("HF_TOKEN", "dummy")
+    missing = tmp_path / "nope"
+    res = push_operation(str(missing), "org/model", branch=DEFAULT_PUSH_BRANCH)
+    assert res["status"] == "error"
+    assert res["error"]["type"] == "workspace_not_found"
diff --git a/tests_2.0/test_push_workspace_check.py b/tests_2.0/test_push_workspace_check.py
new file mode 100644
index 0000000..5d7665c
--- /dev/null
+++ b/tests_2.0/test_push_workspace_check.py
@@ -0,0 +1,75 @@
+"""Offline tests for push --check-only (workspace health)."""
+
+from __future__ import annotations
+
+import json
+import os
+from pathlib import Path
+
+import pytest
+
+# Skip all tests if push is not enabled
+# Push tests now run by default (alpha features included in standard test suite)
+
+from mlxk2.operations.push import push_operation, DEFAULT_PUSH_BRANCH
+
+
+def test_check_only_minimal_invalid_config(tmp_path):
+    ws: Path = tmp_path / "ws"
+    ws.mkdir()
+    # Invalid JSON config
+    (ws / "config.json").write_text("{")
+    # A dummy weight file
+    (ws / "model.safetensors").write_text("data")
+
+    res = push_operation(str(ws), "org/model", branch=DEFAULT_PUSH_BRANCH, check_only=True)
+    assert res["status"] == "success"
+    diag = res["data"]["workspace_health"]
+    assert diag["config"]["exists"] is True
+    assert diag["config"]["valid_json"] is False
+    assert diag["healthy"] is False
+    assert any(a["code"] == "config_invalid_json" for a in diag["anomalies"])
+
+
+def test_check_only_index_missing_shard(tmp_path):
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "config.json").write_text('{"model_type": "base"}')
+    # Index references a missing shard
+    idx = {"weight_map": {"w0": "model-00001-of-00002.safetensors", "w1": "model-00002-of-00002.safetensors"}}
+    (ws / "model.safetensors.index.json").write_text(json.dumps(idx))
+    # Create only one shard
+    (ws / "model-00001-of-00002.safetensors").write_text("x")
+
+    res = push_operation(str(ws), "org/model", branch=DEFAULT_PUSH_BRANCH, check_only=True)
+    diag = res["data"]["workspace_health"]
+    assert diag["healthy"] is False
+    assert any(a["code"] == "index_missing_shard" for a in diag["anomalies"])
+
+
+def test_check_only_gguf_single_file_ok(tmp_path):
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "config.json").write_text('{"model_type": "base"}')
+    # Single GGUF file
+    (ws / "model.gguf").write_bytes(b"\x00\x01\x02")
+
+    res = push_operation(str(ws), "org/model", branch=DEFAULT_PUSH_BRANCH, check_only=True)
+    diag = res["data"]["workspace_health"]
+    assert diag["healthy"] is True
+    assert diag["weights"]["count"] == 1
+    assert "gguf" in diag["weights"]["formats"]
+
+
+def test_check_only_lfs_pointer_detected(tmp_path):
+    ws = tmp_path / "ws"
+    ws.mkdir()
+    (ws / "config.json").write_text("{}")
+    # Create a small LFS pointer file
+    lfs = (ws / "pytorch_model.bin")
+    lfs.write_text("version https://git-lfs.github.com/spec/v1\nOID sha256:abc\nsize 123\n")
+
+    res = push_operation(str(ws), "org/model", branch=DEFAULT_PUSH_BRANCH, check_only=True)
+    diag = res["data"]["workspace_health"]
+    assert diag["healthy"] is False
+    assert any(a["code"] == "lfs_pointer_detected" for a in diag["anomalies"])
diff --git a/tests_2.0/test_robustness.py b/tests_2.0/test_robustness.py
new file mode 100644
index 0000000..8bd7b3e
--- /dev/null
+++ b/tests_2.0/test_robustness.py
@@ -0,0 +1,233 @@
+"""Robustness tests for critical rm and pull operations.
+
+These tests ensure user-cache safety and robust error handling
+for operations that modify the user's model cache.
+"""
+
+import pytest
+import tempfile
+import shutil
+from pathlib import Path
+from unittest.mock import patch, MagicMock
+
+from mlxk2.operations.rm import rm_operation
+from mlxk2.operations.pull import pull_operation
+
+
+class TestRmOperationRobustness:
+    """Test rm operation robustness with user cache safety."""
+    
+    def test_rm_force_flag_skips_all_confirmations(self, mock_models, isolated_cache):
+        """Critical: Force flag must skip ALL confirmations (Issue #23 regression)."""
+        # Get a model from mock cache
+        from conftest import test_list_models
+        models = test_list_models(isolated_cache)["data"]["models"]
+        
+        # Filter out sentinel model and get a real mock model
+        real_models = [m for m in models if "TEST-CACHE-SENTINEL" not in m["name"]]
+        if not real_models:
+            pytest.skip("No real models in mock cache for force flag testing")
+        
+        target_model = real_models[0]["name"]
+        
+        # Force flag should work without any interactive prompts
+        with patch('builtins.input') as mock_input:
+            result = rm_operation(target_model, force=True)
+            
+            # Should never call input() when force=True
+            mock_input.assert_not_called()
+            
+            # Should either succeed or fail with clear reason (never prompt)
+            assert result["status"] in ["success", "error"]
+    
+    def test_rm_without_force_handles_nonexistent_gracefully(self, mock_models):
+        """Test rm without force flag handles nonexistent models gracefully."""
+        result = rm_operation("definitely-nonexistent-model-12345", force=False)
+        
+        assert result["status"] == "error"
+        assert "not found" in result["error"]["message"].lower() or "no models found" in result["error"]["message"].lower()
+    
+    def test_rm_permission_error_handling(self, mock_models, isolated_cache):
+        """Test rm handles permission errors gracefully."""
+        from conftest import atomic_cache_context, test_list_models
+        from mlxk2.operations.rm import rm_operation
+        
+        with atomic_cache_context(isolated_cache, "test"):
+            # Get models in test cache context
+            models = test_list_models(isolated_cache)["data"]["models"]
+            
+            # Filter out sentinel model and get a real mock model
+            real_models = [m for m in models if "TEST-CACHE-SENTINEL" not in m["name"]]  
+            if not real_models:
+                pytest.skip("No real models in mock cache for permission testing")
+            
+            target_model = real_models[0]["name"]
+            
+            # Mock permission error
+            with patch('shutil.rmtree', side_effect=PermissionError("Permission denied")):
+                result = rm_operation(target_model, force=True)
+                
+                assert result["status"] == "error"
+                assert "permission" in result["error"]["message"].lower()
+    
+    def test_rm_partial_deletion_recovery(self, mock_models, isolated_cache):
+        """Test rm handles interrupted deletion gracefully."""
+        from conftest import atomic_cache_context, test_list_models
+        from mlxk2.operations.rm import rm_operation
+        
+        with atomic_cache_context(isolated_cache, "test"):
+            # Get models in test cache context
+            models = test_list_models(isolated_cache)["data"]["models"]
+            
+            # Filter out sentinel model and get a real mock model
+            real_models = [m for m in models if "TEST-CACHE-SENTINEL" not in m["name"]]
+            if not real_models:
+                pytest.skip("No real models in mock cache for partial deletion testing")
+            
+            target_model = real_models[0]["name"]
+            
+            # Mock partial failure (some files deleted, then error)
+            call_count = 0
+            def mock_rmtree_partial_fail(path):
+                nonlocal call_count
+                call_count += 1
+                if call_count == 1:
+                    # First call succeeds (partial deletion)
+                    pass
+                else:
+                    # Second call fails
+                    raise OSError("Device busy")
+            
+            with patch('shutil.rmtree', side_effect=mock_rmtree_partial_fail):
+                result = rm_operation(target_model, force=True)
+                
+                # Should handle partial failure gracefully
+                assert result["status"] in ["success", "error"]
+                if result["status"] == "error":
+                    assert "error" in result["error"]["message"].lower()
+
+
+class TestPullOperationRobustness:
+    """Test pull operation robustness and error handling."""
+    
+    def test_pull_model_name_validation(self):
+        """Test pull validates model names before network operations."""
+        # Test 96 character limit
+        long_name = "a" * 100
+        result = pull_operation(long_name)
+        
+        assert result["status"] == "error"
+        # Should fail validation before attempting network operation
+        assert "name" in result["error"]["message"].lower() or "invalid" in result["error"]["message"].lower()
+    
+    def test_pull_network_timeout_handling(self, monkeypatch):
+        """Test pull handles network timeouts gracefully."""
+        # Set dummy token to pass preflight checks
+        monkeypatch.setenv("HF_TOKEN", "dummy-token")
+
+        # Mock preflight to succeed and pull to timeout
+        with patch('mlxk2.operations.pull.preflight_repo_access', return_value=(True, None)), \
+             patch('mlxk2.operations.pull.pull_model_with_huggingface_hub', side_effect=TimeoutError("Network timeout")):
+            result = pull_operation("test-model")
+
+            assert result["status"] == "error"
+            assert "timeout" in result["error"]["message"].lower() or "network" in result["error"]["message"].lower() or "error" in result["error"]["message"].lower()
+    
+    def test_pull_disk_space_validation(self, isolated_cache):
+        """Test pull checks available disk space before download."""
+        # Mock disk space check
+        with patch('shutil.disk_usage', return_value=(1000, 900, 100)):  # Only 100 bytes free
+            result = pull_operation("mlx-community/Phi-3-mini-4k-instruct-4bit")
+            
+            # Should either succeed (if no disk check implemented) or fail gracefully
+            assert result["status"] in ["success", "error"]
+            if result["status"] == "error":
+                # Error message should be helpful
+                assert len(result["error"]["message"]) > 0
+    
+    def test_pull_invalid_repo_early_validation(self):
+        """Test pull validates repo format before network calls."""
+        invalid_repos = [
+            "",  # Empty
+            "no-slash",  # No org/model format (might be valid short name though)
+            "org//model",  # Double slash
+            "/org/model",  # Leading slash
+            "org/model/",  # Trailing slash
+        ]
+        
+        for invalid_repo in invalid_repos:
+            if not invalid_repo.strip():  # Skip empty strings
+                result = pull_operation(invalid_repo)
+                assert result["status"] == "error"
+                assert len(result["error"]["message"]) > 0
+    
+    def test_pull_concurrent_download_prevention(self, mock_models):
+        """Test pull prevents concurrent downloads of same model."""
+        model_name = "test-concurrent-model"
+        
+        # Mock a long-running download
+        with patch('subprocess.run', side_effect=lambda *args, **kwargs: __import__('time').sleep(0.1)):
+            # Start first download (simulate in progress)
+            import threading
+            
+            first_result = [None]
+            def first_download():
+                first_result[0] = pull_operation(model_name)
+            
+            # Start first download in background
+            thread1 = threading.Thread(target=first_download)
+            thread1.start()
+            
+            # Try concurrent download (should detect ongoing download)
+            result2 = pull_operation(model_name)
+            
+            thread1.join(timeout=1.0)  # Wait for first to complete
+            
+            # At least one should complete successfully, and system should handle concurrent access
+            assert isinstance(result2, dict)
+            assert result2["status"] in ["success", "error"]
+
+
+class TestCacheIntegrityRobustness:
+    """Test cache integrity and corruption handling."""
+    
+    def test_operations_with_corrupted_cache_entries(self, create_corrupted_cache_entry):
+        """Test that operations handle corrupted cache entries gracefully."""
+        # Create corrupted entry
+        cache_path = create_corrupted_cache_entry("models--corrupted---entry").parent
+        
+        # List should not crash with corrupted entries
+        from conftest import test_list_models
+        result = test_list_models(cache_path)
+        
+        assert result["status"] == "success"
+        # Should include corrupted entry but mark it as such
+        corrupted_models = [m for m in result["data"]["models"] if "/-" in m["name"] or m["name"].startswith("-")]
+        assert len(corrupted_models) >= 1
+    
+    def test_cache_recovery_after_interruption(self, isolated_cache):
+        """Test system recovers gracefully from interrupted operations."""
+        # Create partial model directory (simulate interrupted download)
+        partial_model_dir = isolated_cache / "models--test--partial-model"
+        partial_model_dir.mkdir(parents=True)
+        
+        # Create snapshots dir but no content (interrupted state)
+        snapshots_dir = partial_model_dir / "snapshots"
+        snapshots_dir.mkdir()
+        
+        # Operations should handle partial state
+        from conftest import test_list_models
+        result = test_list_models(isolated_cache)
+        
+        assert result["status"] == "success"
+        # Should either exclude partial model or mark it as unhealthy
+        model_names = [m["name"] for m in result["data"]["models"]]
+        if "test/partial-model" in model_names:
+            # If included, should be marked somehow as problematic
+            partial_model = next(m for m in result["data"]["models"] if m["name"] == "test/partial-model")
+            # Could be marked with different framework or size indicating incomplete
+            assert partial_model is not None
+
+
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])
\ No newline at end of file
diff --git a/tests_2.0/test_run_complete.py b/tests_2.0/test_run_complete.py
new file mode 100644
index 0000000..78047ab
--- /dev/null
+++ b/tests_2.0/test_run_complete.py
@@ -0,0 +1,463 @@
+"""
+Complete run command functionality tests for Step 1.1/1.2.
+Tests all run command scenarios as specified in 2.0-TEST-SPECIFICATIONS.md.
+"""
+
+import pytest
+import tempfile
+from unittest.mock import Mock, patch, call
+from pathlib import Path
+from io import StringIO
+import sys
+
+from mlxk2.operations.run import run_model, interactive_chat, single_shot_generation
+from mlxk2.core.runner import MLXRunner
+
+
+@pytest.fixture
+def mock_runner_complete():
+    """Complete mock runner for run command tests."""
+    with patch('mlxk2.operations.run.MLXRunner') as mock_runner_class:
+        mock_runner = Mock()
+        mock_runner_class.return_value.__enter__.return_value = mock_runner
+        mock_runner_class.return_value.__exit__.return_value = None
+        
+        # Mock generation methods
+        mock_runner.generate_streaming.return_value = iter(["Hello", " ", "world", "!"])
+        mock_runner.generate_batch.return_value = "Hello world!"
+        mock_runner._format_conversation.return_value = "Formatted conversation"
+        
+        yield mock_runner
+
+
+class TestRunBasic:
+    """Basic run command functionality tests."""
+    
+    def test_run_single_shot_streaming(self, mock_runner_complete):
+        """mlxk run model "prompt" - streaming mode"""
+        with patch('sys.stdout', new=StringIO()) as fake_out:
+            result = run_model(
+                model_spec="test-model",
+                prompt="test prompt",
+                stream=True,
+                json_output=False
+            )
+        
+        # Should have called generate_streaming
+        mock_runner_complete.generate_streaming.assert_called_once()
+        
+        # Should print streaming output
+        output = fake_out.getvalue()
+        assert "Hello world!" in output
+        
+        # Non-JSON mode returns None
+        assert result is None
+    
+    def test_run_single_shot_batch(self, mock_runner_complete):
+        """mlxk run model "prompt" --no-stream - batch mode"""
+        with patch('sys.stdout', new=StringIO()) as fake_out:
+            result = run_model(
+                model_spec="test-model",
+                prompt="test prompt",
+                stream=False,
+                json_output=False
+            )
+        
+        # Should have called generate_batch
+        mock_runner_complete.generate_batch.assert_called_once()
+        
+        # Should print batch output
+        output = fake_out.getvalue()
+        assert "Hello world!" in output
+        
+        # Non-JSON mode returns None
+        assert result is None
+    
+    def test_run_single_shot_json_output(self, mock_runner_complete):
+        """Test JSON output mode for single-shot"""
+        result = run_model(
+            model_spec="test-model",
+            prompt="test prompt",
+            stream=False,
+            json_output=True
+        )
+        
+        # Should return the generated text
+        assert result == "Hello world!"
+    
+    def test_run_interactive_streaming(self, mock_runner_complete):
+        """mlxk run model (no prompt) - interactive streaming mode"""
+        # Mock user input
+        with patch('builtins.input', side_effect=["hello", "quit"]):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                result = run_model(
+                    model_spec="test-model",
+                    prompt=None,  # Interactive mode
+                    stream=True,
+                    json_output=False
+                )
+        
+        # Should have called format_conversation and generate_streaming
+        mock_runner_complete._format_conversation.assert_called()
+        mock_runner_complete.generate_streaming.assert_called()
+        
+        # Should show interactive prompts
+        output = fake_out.getvalue()
+        assert "Starting interactive chat" in output
+        assert "You:" in output or "Assistant:" in output
+    
+    def test_run_interactive_batch(self, mock_runner_complete):
+        """mlxk run model --no-stream (no prompt) - interactive batch mode"""
+        # Mock user input
+        with patch('builtins.input', side_effect=["hello", "quit"]):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                result = run_model(
+                    model_spec="test-model",
+                    prompt=None,  # Interactive mode
+                    stream=False,
+                    json_output=False
+                )
+        
+        # Should have called format_conversation and generate_batch
+        mock_runner_complete._format_conversation.assert_called()
+        mock_runner_complete.generate_batch.assert_called()
+    
+    def test_run_interactive_json_incompatible(self, mock_runner_complete):
+        """Interactive mode should not work with JSON output"""
+        with patch('sys.stdout', new=StringIO()) as fake_out:
+            result = run_model(
+                model_spec="test-model",
+                prompt=None,  # Interactive mode
+                json_output=True
+            )
+        
+        output = fake_out.getvalue()
+        assert "not compatible with JSON output" in output
+        assert result is None
+
+
+class TestRunParameters:
+    """Test parameter passing and configuration."""
+    
+    def test_run_full_context_tokens(self, mock_runner_complete):
+        """Test that run command uses full model context by default"""
+        run_model(
+            model_spec="test-model",
+            prompt="test",
+            max_tokens=None  # Should use dynamic (full context)
+        )
+        
+        # Should call with None max_tokens (dynamic calculation)
+        call_args = mock_runner_complete.generate_streaming.call_args
+        assert call_args[1]['max_tokens'] is None
+    
+    def test_run_explicit_max_tokens(self, mock_runner_complete):
+        """Test that explicit max_tokens is respected"""
+        run_model(
+            model_spec="test-model",
+            prompt="test",
+            max_tokens=500
+        )
+        
+        # Should pass through explicit max_tokens
+        call_args = mock_runner_complete.generate_streaming.call_args
+        assert call_args[1]['max_tokens'] == 500
+    
+    def test_run_temperature_parameter(self, mock_runner_complete):
+        """Test temperature parameter passing"""
+        run_model(
+            model_spec="test-model",
+            prompt="test",
+            temperature=0.9
+        )
+        
+        call_args = mock_runner_complete.generate_streaming.call_args
+        assert call_args[1]['temperature'] == 0.9
+    
+    def test_run_top_p_parameter(self, mock_runner_complete):
+        """Test top_p parameter passing"""
+        run_model(
+            model_spec="test-model",
+            prompt="test",
+            top_p=0.95
+        )
+        
+        call_args = mock_runner_complete.generate_streaming.call_args
+        assert call_args[1]['top_p'] == 0.95
+    
+    def test_run_chat_template_control(self, mock_runner_complete):
+        """Test chat template enable/disable"""
+        # With chat template (default)
+        run_model(
+            model_spec="test-model",
+            prompt="test",
+            use_chat_template=True
+        )
+        
+        call_args = mock_runner_complete.generate_streaming.call_args
+        assert call_args[1]['use_chat_template'] is True
+        
+        # Without chat template
+        run_model(
+            model_spec="test-model",
+            prompt="test",
+            use_chat_template=False
+        )
+        
+        call_args = mock_runner_complete.generate_streaming.call_args
+        assert call_args[1]['use_chat_template'] is False
+
+
+class TestConversationHistory:
+    """Test conversation history tracking in interactive mode."""
+    
+    def test_conversation_history_accumulation(self, mock_runner_complete):
+        """Test that conversation history accumulates properly"""
+        conversation_calls = []
+        
+        def capture_conversation(messages):
+            conversation_calls.append(messages.copy())
+            return "Formatted conversation"
+        
+        mock_runner_complete._format_conversation.side_effect = capture_conversation
+        
+        # Simulate interactive conversation
+        with patch('builtins.input', side_effect=["first message", "second message", "quit"]):
+            with patch('sys.stdout', new=StringIO()):
+                run_model(
+                    model_spec="test-model",
+                    prompt=None,  # Interactive mode
+                    stream=True
+                )
+        
+        # Should have multiple conversation calls with growing history
+        assert len(conversation_calls) >= 2
+        
+        # First call: one user message
+        assert len(conversation_calls[0]) == 1
+        assert conversation_calls[0][0]["role"] == "user"
+        assert conversation_calls[0][0]["content"] == "first message"
+        
+        # Second call: user + assistant + user
+        assert len(conversation_calls[1]) == 3
+        assert conversation_calls[1][0]["role"] == "user"
+        assert conversation_calls[1][1]["role"] == "assistant"
+        assert conversation_calls[1][2]["role"] == "user"
+        assert conversation_calls[1][2]["content"] == "second message"
+    
+    def test_empty_input_handling(self, mock_runner_complete):
+        """Test that empty input is ignored"""
+        with patch('builtins.input', side_effect=["", "  ", "actual message", "quit"]):
+            with patch('sys.stdout', new=StringIO()):
+                run_model(
+                    model_spec="test-model",
+                    prompt=None,
+                    stream=True
+                )
+        
+        # Should only process the non-empty message
+        conversation_calls = mock_runner_complete._format_conversation.call_args_list
+        assert len(conversation_calls) == 1  # Only one actual message processed
+        
+        messages = conversation_calls[0][0][0]
+        assert len(messages) == 1
+        assert messages[0]["content"] == "actual message"
+
+
+class TestChatTemplate:
+    """Test chat template integration."""
+    
+    def test_chat_template_integration(self, mock_runner_complete):
+        """Test that chat template is used for conversation formatting"""
+        with patch('builtins.input', side_effect=["test message", "quit"]):
+            with patch('sys.stdout', new=StringIO()):
+                run_model(
+                    model_spec="test-model",
+                    prompt=None,
+                    stream=True
+                )
+        
+        # Should call _format_conversation with proper message structure
+        mock_runner_complete._format_conversation.assert_called()
+        call_args = mock_runner_complete._format_conversation.call_args[0][0]
+        
+        assert isinstance(call_args, list)
+        assert len(call_args) == 1
+        assert call_args[0]["role"] == "user"
+        assert call_args[0]["content"] == "test message"
+        
+        # Should call generate_streaming with use_chat_template=False
+        # (because template already applied in _format_conversation)
+        gen_call_args = mock_runner_complete.generate_streaming.call_args
+        assert gen_call_args[1]['use_chat_template'] is False
+
+
+class TestErrorHandling:
+    """Test error handling in run command."""
+    
+    def test_model_loading_error(self):
+        """Test handling of model loading failures"""
+        with patch('mlxk2.operations.run.MLXRunner') as mock_runner_class:
+            mock_runner_class.side_effect = FileNotFoundError("Model not found")
+            
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                result = run_model(
+                    model_spec="nonexistent-model",
+                    prompt="test",
+                    json_output=False
+                )
+            
+            output = fake_out.getvalue()
+            assert "Error:" in output
+            assert result is None
+    
+    def test_generation_error_json_mode(self):
+        """Test error handling in JSON mode"""
+        with patch('mlxk2.operations.run.MLXRunner') as mock_runner_class:
+            mock_runner_class.side_effect = RuntimeError("Generation failed")
+            
+            result = run_model(
+                model_spec="test-model",
+                prompt="test",
+                json_output=True
+            )
+            
+            assert "Error:" in result
+    
+    def test_keyboard_interrupt_handling(self, mock_runner_complete):
+        """Test Ctrl-C handling in interactive mode"""
+        def simulate_interrupt(*args, **kwargs):
+            raise KeyboardInterrupt()
+        
+        with patch('builtins.input', side_effect=simulate_interrupt):
+            with patch('sys.stdout', new=StringIO()) as fake_out:
+                result = run_model(
+                    model_spec="test-model",
+                    prompt=None,
+                    stream=True
+                )
+            
+            output = fake_out.getvalue()
+            assert "interrupted" in output.lower() or "goodbye" in output.lower()
+
+
+class TestStreamingVsBatch:
+    """Test consistency between streaming and batch modes."""
+    
+    def test_streaming_vs_batch_output_consistency(self, mock_runner_complete):
+        """Test that streaming and batch produce equivalent output"""
+        # Configure mocks to return same content
+        mock_runner_complete.generate_streaming.return_value = iter(["Hello", " ", "world"])
+        mock_runner_complete.generate_batch.return_value = "Hello world"
+        
+        # Test streaming mode
+        with patch('sys.stdout', new=StringIO()) as stream_out:
+            run_model(
+                model_spec="test-model",
+                prompt="test",
+                stream=True,
+                json_output=False
+            )
+        
+        # Test batch mode  
+        with patch('sys.stdout', new=StringIO()) as batch_out:
+            run_model(
+                model_spec="test-model",
+                prompt="test",
+                stream=False,
+                json_output=False
+            )
+        
+        # Output should be equivalent (modulo formatting)
+        stream_output = stream_out.getvalue().strip()
+        batch_output = batch_out.getvalue().strip()
+
+        # Both should contain the core content
+        assert "Hello world" in stream_output
+        assert "Hello world" in batch_output
+
+
+class TestPreflightCompatibilityCheck:
+    """Test runtime compatibility preflight checks in run command."""
+
+    def test_commit_pinned_incompatible_model_blocked(self, isolated_cache):
+        """Commit-pinned models must also pass compatibility check (regression test).
+
+        Regression: Beta.5 introduced preflight compatibility checks, but commit-pinned
+        models bypassed the check due to incorrect if/else scoping.
+
+        This test verifies that `mlxk run org/model@commit_hash` properly validates
+        framework compatibility before attempting to load the model.
+        """
+        import json
+        from unittest.mock import patch
+
+        # Create a PyTorch model in cache with specific commit hash
+        commit_hash = "abc123def456"
+        model_name = "test-org/pytorch-model"
+        cache_dir = isolated_cache / f"models--{model_name.replace('/', '--')}"
+        snapshot_dir = cache_dir / "snapshots" / commit_hash
+        snapshot_dir.mkdir(parents=True)
+
+        # Create valid config.json (healthy model)
+        config = {"model_type": "bert", "architectures": ["BertForSequenceClassification"]}
+        (snapshot_dir / "config.json").write_text(json.dumps(config))
+
+        # Create PyTorch weights (incompatible framework)
+        (snapshot_dir / "pytorch_model.bin").write_bytes(b"fake_pytorch_weights" * 100)
+
+        # Mock resolve_model_for_operation to return our commit hash
+        with patch('mlxk2.operations.run.resolve_model_for_operation') as mock_resolve:
+            mock_resolve.return_value = (model_name, commit_hash, None)
+
+            # Mock get_current_model_cache to use our isolated cache
+            with patch('mlxk2.operations.run.get_current_model_cache') as mock_cache:
+                mock_cache.return_value = isolated_cache
+
+                # Attempt to run with commit-pinned spec
+                result = run_model(
+                    model_spec=f"{model_name}@{commit_hash}",
+                    prompt="test prompt",
+                    json_output=True
+                )
+
+        # Should be blocked by preflight check
+        assert result is not None
+        assert "Error:" in result
+        assert "not compatible" in result or "Incompatible" in result
+
+    def test_latest_snapshot_incompatible_model_blocked(self, isolated_cache):
+        """Non-pinned models should also be blocked by compatibility check."""
+        import json
+        from unittest.mock import patch
+
+        # Create a PyTorch model in cache (latest snapshot)
+        model_name = "test-org/another-pytorch"
+        cache_dir = isolated_cache / f"models--{model_name.replace('/', '--')}"
+        snapshot_dir = cache_dir / "snapshots" / "latest_snapshot"
+        snapshot_dir.mkdir(parents=True)
+
+        # Create valid config.json (healthy model)
+        config = {"model_type": "gpt2", "architectures": ["GPT2LMHeadModel"]}
+        (snapshot_dir / "config.json").write_text(json.dumps(config))
+
+        # Create PyTorch weights (incompatible framework)
+        (snapshot_dir / "pytorch_model.bin").write_bytes(b"fake_weights" * 100)
+
+        # Mock resolve_model_for_operation (no commit hash)
+        with patch('mlxk2.operations.run.resolve_model_for_operation') as mock_resolve:
+            mock_resolve.return_value = (model_name, None, None)
+
+            with patch('mlxk2.operations.run.get_current_model_cache') as mock_cache:
+                mock_cache.return_value = isolated_cache
+
+                result = run_model(
+                    model_spec=model_name,
+                    prompt="test prompt",
+                    json_output=True
+                )
+
+        # Should be blocked by preflight check
+        assert result is not None
+        assert "Error:" in result
+        assert "not compatible" in result or "Incompatible" in result
\ No newline at end of file
diff --git a/tests_2.0/test_runner_core.py b/tests_2.0/test_runner_core.py
new file mode 100644
index 0000000..094f66d
--- /dev/null
+++ b/tests_2.0/test_runner_core.py
@@ -0,0 +1,402 @@
+"""
+Core MLXRunner tests for 2.0 implementation.
+Tests the core model execution engine ported from 1.x.
+"""
+
+import pytest
+import tempfile
+from unittest.mock import Mock, patch
+from pathlib import Path
+from contextlib import contextmanager
+
+import mlx.core as mx
+from mlxk2.core.runner import MLXRunner
+
+
+@contextmanager
+def mock_runner_environment(temp_cache_dir, model_name="test-model"):
+    """Mock the environment needed for MLXRunner tests."""
+    with patch('mlxk2.core.runner.load') as mock_load, \
+         patch('mlxk2.core.runner.resolve_model_for_operation') as mock_resolve, \
+         patch('mlxk2.core.cache.get_current_model_cache') as mock_cache, \
+         patch('mlxk2.core.cache.hf_to_cache_dir') as mock_hf_to_cache, \
+         patch('mlxk2.core.runner.get_model_context_length') as mock_context:
+        
+        # Mock successful model resolution
+        mock_resolve.return_value = (model_name, None, None)
+        mock_cache.return_value = temp_cache_dir
+        mock_hf_to_cache.return_value = f"models--{model_name}"
+        mock_context.return_value = 8192
+        
+        # Create mock snapshots directory
+        snapshots_dir = temp_cache_dir / f"models--{model_name}" / "snapshots" / "abc123"
+        snapshots_dir.mkdir(parents=True)
+        
+        # Mock model and tokenizer
+        mock_model = Mock()
+        mock_tokenizer = Mock()
+        mock_tokenizer.eos_token = "</s>"
+        mock_tokenizer.eos_token_id = 2
+        mock_tokenizer.eos_token_ids = {mock_tokenizer.eos_token_id}
+        mock_tokenizer.pad_token = None
+        mock_tokenizer.additional_special_tokens = []
+        mock_tokenizer.added_tokens_decoder = {}
+        mock_tokenizer.chat_template = None
+        mock_tokenizer.name_or_path = f"mock-{model_name}"
+        mock_load.return_value = (mock_model, mock_tokenizer)
+        
+        yield {
+            'mock_load': mock_load,
+            'mock_model': mock_model,
+            'mock_tokenizer': mock_tokenizer,
+            'mock_resolve': mock_resolve
+        }
+
+
+class TestMLXRunnerBasic:
+    """Basic MLXRunner functionality tests"""
+    
+    def test_runner_context_manager(self, temp_cache_dir):
+        """Test context manager pattern for memory safety"""
+        model_name = "test-model"
+        
+        with mock_runner_environment(temp_cache_dir) as mocks:
+            with MLXRunner(model_name) as runner:
+                assert runner is not None
+                # Should have loaded model
+                mocks['mock_load'].assert_called_once()
+            
+            # Should cleanup on exit (tested via mock verification)
+    
+    def test_runner_cleanup_on_exception(self, temp_cache_dir):
+        """Test that cleanup happens even on exception"""
+        model_name = "test-model"
+        
+        with mock_runner_environment(temp_cache_dir) as mocks:
+            try:
+                with MLXRunner(model_name) as runner:
+                    # Force an exception
+                    raise ValueError("Test exception")
+            except ValueError:
+                pass
+            
+            # Should still have called load and cleanup
+            mocks['mock_load'].assert_called_once()
+    
+    def test_generate_streaming_basic(self, temp_cache_dir):
+        """Test basic streaming generation"""
+        model_name = "test-model"
+        
+        with mock_runner_environment(temp_cache_dir, model_name) as mocks:
+            # Mock generate_step to yield tokens
+            with patch('mlxk2.core.runner.generate_step') as mock_gen:
+                # generate_step yields (token, logits) tuples  
+                mock_gen.return_value = [
+                    (mx.array([1]), mx.zeros(1)),  # Token IDs as mx.array
+                    (mx.array([2]), mx.zeros(1)), 
+                ]
+                
+                # Mock tokenizer methods
+                mocks['mock_tokenizer'].encode.return_value = [100, 101]  # Prompt tokens
+                mocks['mock_tokenizer'].eos_token_id = 999  # Don't trigger EOS
+                mocks['mock_tokenizer'].eos_token_ids = {mocks['mock_tokenizer'].eos_token_id}
+                mocks['mock_tokenizer'].chat_template = None  # Disable chat template
+                
+                # Mock decode to return consistent strings based on token list length/content
+                def mock_decode(tokens):
+                    if tokens == [1]:
+                        return "Hello"
+                    elif tokens == [1, 2]:
+                        return "Hello world"
+                    elif tokens == [2]:
+                        return " world"
+                    else:
+                        return "unknown"
+                
+                mocks['mock_tokenizer'].decode.side_effect = mock_decode
+                
+                with MLXRunner(model_name) as runner:
+                    tokens = list(runner.generate_streaming("test prompt", max_tokens=2))
+                    
+                # Should yield incremental tokens
+                assert len(tokens) >= 1
+                assert any("Hello" in token for token in tokens)
+    
+    def test_generate_batch(self, temp_cache_dir):
+        """Test batch generation (complete output at once)"""
+        model_name = "test-model"
+        
+        with mock_runner_environment(temp_cache_dir, model_name) as mocks:
+            with patch('mlxk2.core.runner.generate_step') as mock_gen:
+                mock_gen.return_value = [
+                    (mx.array([1]), mx.zeros(1)),
+                    (mx.array([2]), mx.zeros(1)),
+                    (mx.array([3]), mx.zeros(1))
+                ]
+                
+                # Mock tokenizer for batch mode
+                mocks['mock_tokenizer'].encode.return_value = [100, 101]  # Prompt
+                mocks['mock_tokenizer'].decode.side_effect = lambda tokens: " ".join([f"token{t}" for t in tokens])
+                mocks['mock_tokenizer'].eos_token_id = 999  # Don't trigger EOS
+                mocks['mock_tokenizer'].eos_token_ids = {mocks['mock_tokenizer'].eos_token_id}
+                mocks['mock_tokenizer'].chat_template = None
+                
+                with MLXRunner(model_name) as runner:
+                    result = runner.generate_batch("test prompt", max_tokens=3)
+                    
+                # Should return a single string (complete response)
+                assert isinstance(result, str)
+                assert len(result) > 0
+
+
+class TestMLXRunnerStopTokens:
+    """Test stop token filtering functionality"""
+    
+    def test_chat_stop_tokens_filtered_when_enabled(self, temp_cache_dir):
+        """Chat stop tokens are filtered only when explicitly enabled"""
+        model_name = "test-model"
+        
+        with mock_runner_environment(temp_cache_dir, model_name) as mocks:
+            with patch('mlxk2.core.runner.generate_step') as mock_gen:
+                mock_gen.return_value = [
+                    (1, 0),
+                    (2, 0),
+                    (3, 0)
+                ]
+                # Encode returns prompt tokens
+                mocks['mock_tokenizer'].encode.return_value = [100]
+                # Decode returns full generated text when decoding generated tokens
+                def mock_decode(tokens):
+                    if tokens == [1]:
+                        return "Response"
+                    if tokens == [1, 2]:
+                        return "Response\nHuman:"
+                    if tokens == [1, 2, 3]:
+                        return "Response\nHuman: filtered"
+                    # Fallback for other cases
+                    return ""
+                mocks['mock_tokenizer'].decode.side_effect = mock_decode
+
+                with MLXRunner(model_name) as runner:
+                    result = runner.generate_batch("test prompt", use_chat_stop_tokens=True)
+                    
+                # Should stop at chat stop token
+                assert "\nHuman:" not in result
+                assert result == "Response"
+
+    def test_chat_stop_tokens_not_filtered_by_default(self, temp_cache_dir):
+        """By default, batch mode does not strip chat stop tokens"""
+        model_name = "test-model"
+
+        with mock_runner_environment(temp_cache_dir, model_name) as mocks:
+            with patch('mlxk2.core.runner.generate_step') as mock_gen:
+                mock_gen.return_value = [
+                    (1, 0),
+                    (2, 0),
+                    (3, 0)
+                ]
+                mocks['mock_tokenizer'].encode.return_value = [100]
+                def mock_decode(tokens):
+                    if tokens == [1]:
+                        return "Response"
+                    if tokens == [1, 2]:
+                        return "Response\nHuman:"
+                    if tokens == [1, 2, 3]:
+                        return "Response\nHuman: rest"
+                    return ""
+                mocks['mock_tokenizer'].decode.side_effect = mock_decode
+
+                with MLXRunner(model_name) as runner:
+                    result = runner.generate_batch("test prompt")
+
+                # Default behavior: token remains unless explicitly enabled
+                assert "\nHuman:" in result
+    
+    def test_streaming_vs_batch_consistency(self, temp_cache_dir):
+        """Test that streaming and batch modes produce identical output"""
+        model_name = "test-model"
+        
+        with mock_runner_environment(temp_cache_dir, model_name) as mocks:
+            # Same mock sequence for both tests
+            def mock_generation():
+                return [
+                    (1, 0),
+                    (2, 0),
+                    (3, 0)
+                ]
+
+            mocks['mock_tokenizer'].encode.return_value = [100]
+            def mock_decode(tokens):
+                if tokens == [1]:
+                    return "Hello"
+                if tokens == [2]:
+                    return " world"
+                if tokens == [3]:
+                    return "!"
+                if tokens == [1, 2]:
+                    return "Hello world"
+                if tokens == [2, 3]:
+                    return " world!"
+                if tokens == [1, 2, 3]:
+                    return "Hello world!"
+                return ""
+            mocks['mock_tokenizer'].decode.side_effect = mock_decode
+
+            with MLXRunner(model_name) as runner:
+                # Test streaming
+                with patch('mlxk2.core.runner.generate_step', return_value=mock_generation()):
+                    streaming_result = "".join(runner.generate_streaming("test"))
+                
+                # Test batch
+                with patch('mlxk2.core.runner.generate_step', return_value=mock_generation()):
+                    batch_result = runner.generate_batch("test")
+                
+                assert streaming_result == batch_result
+
+
+class TestMLXRunnerMemorySafety:
+    """Test memory management and cleanup"""
+    
+    def test_model_cleanup_on_context_exit(self, temp_cache_dir):
+        """Test that model is properly cleaned up"""
+        model_name = "test-model"
+        
+        with patch('mlxk2.core.runner.load') as mock_load:
+            mock_model = Mock()
+            mock_tokenizer = Mock()
+            mock_load.return_value = (mock_model, mock_tokenizer)
+            
+            runner = None
+            with MLXRunner(model_name) as r:
+                runner = r
+                assert runner.model is not None
+                assert runner.tokenizer is not None
+            
+            # After context exit, model should be cleaned up
+            assert runner.model is None
+            assert runner.tokenizer is None
+    
+    def test_multiple_context_managers(self, temp_cache_dir):
+        """Test that multiple runners can be used sequentially"""
+        model_name = "test-model"
+        
+        with patch('mlxk2.core.runner.load') as mock_load:
+            mock_model = Mock()
+            mock_tokenizer = Mock()
+            mock_tokenizer.encode.return_value = [1]
+            mock_tokenizer.decode.return_value = "ok"
+            mock_tokenizer.eos_token_id = 2
+            mock_tokenizer.eos_token_ids = {mock_tokenizer.eos_token_id}
+            mock_tokenizer.additional_special_tokens = []
+            mock_tokenizer.added_tokens_decoder = {}
+            mock_load.return_value = (mock_model, mock_tokenizer)
+            
+            # First runner
+            with MLXRunner(model_name) as runner1:
+                assert runner1 is not None
+            
+            # Second runner should work independently
+            with MLXRunner(model_name) as runner2:
+                assert runner2 is not None
+            
+            # Should have loaded model twice
+            assert mock_load.call_count == 2
+
+
+class TestMLXRunnerDynamicTokens:
+    """Test dynamic token limit functionality"""
+    
+    def test_no_max_tokens_uses_dynamic(self, temp_cache_dir):
+        """Test that None max_tokens uses dynamic limit based on model context"""
+        model_name = "test-model"
+        
+        with patch('mlxk2.core.runner.load') as mock_load:
+            mock_load.return_value = (Mock(), Mock())
+            
+            # Mock config reading for context length
+            with patch('mlxk2.core.runner.get_model_context_length', return_value=8192):
+                with MLXRunner(model_name) as runner:
+                    # Should calculate dynamic limit from context length
+                    dynamic_limit = runner._calculate_dynamic_max_tokens()
+                    
+                    # Should be a reasonable fraction of context (server-mode default)
+                    # Accept half-context on 8K models as reasonable
+                    assert 1000 <= dynamic_limit <= 4096
+    
+    def test_respects_explicit_max_tokens(self, temp_cache_dir):
+        """Test that explicit max_tokens is respected"""
+        model_name = "test-model"
+        
+        with patch('mlxk2.core.runner.load') as mock_load:
+            mock_model = Mock()
+            mock_tokenizer = Mock()
+            mock_tokenizer.encode.return_value = [1]
+            mock_tokenizer.decode.return_value = "ok"
+            mock_tokenizer.eos_token_id = 2
+            mock_tokenizer.eos_token_ids = {mock_tokenizer.eos_token_id}
+            mock_tokenizer.additional_special_tokens = []
+            mock_tokenizer.added_tokens_decoder = {}
+            mock_load.return_value = (mock_model, mock_tokenizer)
+            
+            with MLXRunner(model_name) as runner:
+                # When max_tokens is explicitly set, should respect it
+                with patch('mlxk2.core.runner.generate_step') as mock_gen:
+                    mock_gen.return_value = iter([(mx.array([1]), mx.zeros(1))])
+                    
+                    # Mock to check that max_tokens is passed through
+                    result = runner.generate_batch("test", max_tokens=100)
+                    
+                    # Should have respected the explicit limit
+                    # (Details depend on implementation)
+
+
+class TestMLXRunnerErrorHandling:
+    """Test error handling and edge cases"""
+    
+    def test_model_loading_failure(self, temp_cache_dir):
+        """Test handling of model loading failures"""
+        model_path = "nonexistent-model"
+        
+        with patch('mlxk2.core.runner.load') as mock_load:
+            mock_load.side_effect = FileNotFoundError("Model not found")
+            
+            with pytest.raises(FileNotFoundError):
+                with MLXRunner(model_path):
+                    pass
+    
+    def test_generation_interruption(self, temp_cache_dir):
+        """Test Ctrl-C interruption handling"""
+        model_name = "test-model"
+        
+        with patch('mlxk2.core.runner.load') as mock_load:
+            mock_model, mock_tokenizer = Mock(), Mock()
+            # Minimal tokenizer stubs to satisfy runner
+            mock_tokenizer.encode.return_value = [1]
+            mock_tokenizer.decode.return_value = "ok"
+            mock_tokenizer.eos_token_id = 2
+            mock_tokenizer.eos_token_ids = {mock_tokenizer.eos_token_id}
+            mock_tokenizer.additional_special_tokens = []
+            mock_tokenizer.added_tokens_decoder = {}
+            mock_load.return_value = (mock_model, mock_tokenizer)
+
+            # With new recovery semantics, a pre-existing interruption flag
+            # is cleared at the start of a new generation.
+            with MLXRunner(model_name) as runner:
+                runner._interrupted = True
+                tokens = list(runner.generate_streaming("test"))
+                # Should not yield an interruption message at start
+                assert not any(isinstance(t, str) and "interrupted" in t.lower() for t in tokens)
+
+
+# Test fixtures for integration with existing test infrastructure
+@pytest.fixture
+def mock_tiny_model():
+    """Minimal model for fast tests"""
+    return "hf-internal-testing/tiny-random-gpt2"
+
+
+@pytest.fixture  
+def temp_cache_dir():
+    """Isolated cache directory for testing"""
+    with tempfile.TemporaryDirectory() as tmpdir:
+        yield Path(tmpdir)
diff --git a/tests_2.0/test_runtime_compatibility_reason_chain.py b/tests_2.0/test_runtime_compatibility_reason_chain.py
new file mode 100644
index 0000000..f164eba
--- /dev/null
+++ b/tests_2.0/test_runtime_compatibility_reason_chain.py
@@ -0,0 +1,379 @@
+"""Tests for runtime compatibility reason field decision chain (Issue #36).
+
+Critical requirement: The reason field must reflect the FIRST problem encountered:
+1. Health check failures take precedence over runtime failures
+2. Gate 1 (framework) failures take precedence over Gate 2 (legacy format)
+3. Gate 2 (legacy format) failures take precedence over Gate 3 (model_type)
+4. Only when ALL checks pass should reason be None
+
+This ensures users see the most actionable error message first.
+"""
+
+import json
+from pathlib import Path
+import pytest
+import sys
+
+# Check if mlx-lm is available for Gate 3 tests
+# Note: Gate 3 tests require a working mlx-lm installation with _get_classes API
+# Current implementation has compatibility issues with mlx-lm 0.28.x API changes
+try:
+    import mlx_lm
+    # Try to import the function we actually need
+    try:
+        from mlx_lm.models.base import _get_classes
+        HAS_WORKING_MLX_LM = True
+    except ImportError:
+        # Try old API
+        try:
+            from mlx_lm.utils import _get_classes
+            HAS_WORKING_MLX_LM = True
+        except ImportError:
+            HAS_WORKING_MLX_LM = False
+except ImportError:
+    HAS_WORKING_MLX_LM = False
+
+requires_mlx_lm = pytest.mark.skipif(
+    not HAS_WORKING_MLX_LM,
+    reason="mlx-lm not available or _get_classes API not found (required for Gate 3)"
+)
+
+
+# ============================================================================
+# Test Helpers
+# ============================================================================
+
+def _create_config(snap: Path, model_type: str = "llama"):
+    """Create a minimal valid config.json."""
+    config = {
+        "model_type": model_type,
+        "architectures": ["LlamaForCausalLM"]
+    }
+    (snap / "config.json").write_text(json.dumps(config))
+
+
+def _create_healthy_mlx_model(cache: Path, name: str, weights_pattern: str = "model.safetensors", model_type: str = "llama"):
+    """Create a healthy MLX model with specified weight file pattern."""
+    snap = cache / f"models--{name.replace('/', '--')}" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    _create_config(snap, model_type)
+    (snap / weights_pattern).write_bytes(b"fake_weights" * 100)
+    return snap
+
+
+# ============================================================================
+# Reason Chain Tests: Health Check Precedence (Highest Priority)
+# ============================================================================
+
+def test_reason_chain_health_failure_overrides_framework_failure(isolated_cache):
+    """Health failure reason should take precedence over framework incompatibility.
+
+    Scenario: GGUF model (would fail Gate 1) but missing config.json (health failure)
+    Expected: reason = "config.json missing" (NOT "Incompatible: GGUF")
+    """
+    snap = isolated_cache / "models--test--broken-gguf" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    # Missing config.json → unhealthy
+    (snap / "model.gguf").write_bytes(b"fake" * 100)
+
+    from mlxk2.operations.common import build_model_object
+    model_obj = build_model_object("test/broken-gguf", snap.parent.parent, snap)
+
+    assert model_obj["health"] == "unhealthy"
+    assert model_obj["runtime_compatible"] is False
+    assert "config.json" in model_obj["reason"]
+    assert "GGUF" not in model_obj["reason"], "Framework reason should not appear when health check fails"
+
+
+def test_reason_chain_health_failure_overrides_legacy_format(isolated_cache):
+    """Health failure should take precedence over legacy format detection.
+
+    Scenario: Legacy weights but missing config.json
+    Expected: reason = "config.json missing" (NOT "Legacy format")
+    """
+    snap = isolated_cache / "models--test--broken-legacy" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    # Missing config.json → unhealthy
+    (snap / "weights.00.safetensors").write_bytes(b"fake" * 100)
+
+    from mlxk2.operations.common import build_model_object
+    model_obj = build_model_object("test/broken-legacy", snap.parent.parent, snap)
+
+    assert model_obj["health"] == "unhealthy"
+    assert model_obj["runtime_compatible"] is False
+    assert "config.json" in model_obj["reason"]
+    assert "Legacy" not in model_obj["reason"], "Legacy format reason should not appear when health check fails"
+
+
+# ============================================================================
+# Reason Chain Tests: Gate 1 (Framework) Precedence
+# ============================================================================
+
+def test_reason_chain_gate1_framework_check_gguf(isolated_cache):
+    """Gate 1 failure (GGUF) should be reported even if model has legacy weights.
+
+    Scenario: Healthy GGUF model with legacy-named files
+    Expected: reason = "Incompatible: GGUF" (Gate 1 fails, Gate 2 never runs)
+    """
+    snap = _create_healthy_mlx_model(isolated_cache, "test/gguf-legacy", "weights.00.gguf")
+
+    from mlxk2.operations.health import check_runtime_compatibility
+    compatible, reason = check_runtime_compatibility(snap, "GGUF")
+
+    assert compatible is False
+    assert reason == "Incompatible: GGUF"
+    # Gate 1 fails early, so legacy format detection (Gate 2) never runs
+
+
+def test_reason_chain_gate1_framework_check_pytorch(isolated_cache):
+    """Gate 1 failure (PyTorch) should take precedence."""
+    snap = _create_healthy_mlx_model(isolated_cache, "test/pytorch", "model.safetensors")
+
+    from mlxk2.operations.health import check_runtime_compatibility
+    compatible, reason = check_runtime_compatibility(snap, "PyTorch")
+
+    assert compatible is False
+    assert reason == "Incompatible: PyTorch"
+
+
+def test_reason_chain_gate1_mlx_framework_passes(isolated_cache):
+    """MLX framework should pass Gate 1 (may fail at later gates)."""
+    snap = _create_healthy_mlx_model(isolated_cache, "test/mlx", "weights.00.safetensors")  # Legacy format
+
+    from mlxk2.operations.health import check_runtime_compatibility
+    compatible, reason = check_runtime_compatibility(snap, "MLX")
+
+    # Should pass Gate 1 but fail at Gate 2 (legacy format)
+    assert compatible is False
+    assert "Incompatible: MLX" not in reason, "Should not fail at Gate 1 for MLX framework"
+    assert "Legacy format" in reason, "Should fail at Gate 2 for legacy weights"
+
+
+# ============================================================================
+# Reason Chain Tests: Gate 2 (Legacy Format) Precedence
+# ============================================================================
+
+def test_reason_chain_gate2_legacy_weights_numeric(isolated_cache):
+    """Gate 2 should detect weights.NN.safetensors legacy format.
+
+    Even if model_type might be unsupported (Gate 3), legacy format (Gate 2) is reported first.
+    """
+    snap = isolated_cache / "models--test--legacy-weights" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    _create_config(snap, model_type="some_fake_unsupported_type_xyz")  # Would fail Gate 3
+    (snap / "weights.00.safetensors").write_bytes(b"fake" * 100)
+    (snap / "weights.01.safetensors").write_bytes(b"fake" * 100)
+
+    from mlxk2.operations.health import check_runtime_compatibility
+    compatible, reason = check_runtime_compatibility(snap, "MLX")
+
+    assert compatible is False
+    assert "Legacy format not supported by mlx-lm" in reason
+    # Gate 2 fails, so Gate 3 (model_type check) never runs
+
+
+def test_reason_chain_gate2_legacy_pytorch_model_numeric(isolated_cache):
+    """Gate 2 should detect pytorch_model-NNNNN.safetensors legacy format."""
+    snap = isolated_cache / "models--test--legacy-pytorch" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    _create_config(snap, model_type="some_fake_type")  # Would fail Gate 3
+    (snap / "pytorch_model-00001.safetensors").write_bytes(b"fake" * 100)
+
+    from mlxk2.operations.health import check_runtime_compatibility
+    compatible, reason = check_runtime_compatibility(snap, "MLX")
+
+    assert compatible is False
+    assert "Legacy format not supported by mlx-lm" in reason
+
+
+def test_reason_chain_gate2_modern_format_passes(isolated_cache):
+    """Modern model.safetensors should pass Gate 2."""
+    snap = isolated_cache / "models--test--modern" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    _create_config(snap, model_type="llama")
+    (snap / "model.safetensors").write_bytes(b"fake" * 100)
+
+    from mlxk2.operations.health import check_runtime_compatibility
+    compatible, reason = check_runtime_compatibility(snap, "MLX")
+
+    # Should pass Gates 1 and 2, outcome depends on Gate 3 (model_type check)
+    # If mlx-lm supports llama, should be compatible
+    # If not compatible, reason should NOT be about legacy format
+    if not compatible:
+        assert "Legacy format" not in reason, "Modern format should not trigger legacy format error"
+
+
+def test_reason_chain_gate2_sharded_modern_format_passes(isolated_cache):
+    """Modern sharded model-XXXXX-of-YYYYY.safetensors should pass Gate 2."""
+    snap = isolated_cache / "models--test--sharded" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    _create_config(snap, model_type="llama")
+    (snap / "model-00001-of-00002.safetensors").write_bytes(b"fake" * 100)
+    (snap / "model-00002-of-00002.safetensors").write_bytes(b"fake" * 100)
+
+    from mlxk2.operations.health import check_runtime_compatibility
+    compatible, reason = check_runtime_compatibility(snap, "MLX")
+
+    # Should pass Gate 2
+    if not compatible:
+        assert "Legacy format" not in reason, "Modern sharded format should not trigger legacy format error"
+
+
+# ============================================================================
+# Reason Chain Tests: Gate 3 (model_type Support)
+# ============================================================================
+
+@requires_mlx_lm
+def test_reason_chain_gate3_unsupported_model_type(isolated_cache):
+    """Gate 3 should only run if Gates 1 and 2 pass.
+
+    This test uses a clearly fake model_type that mlx-lm won't support.
+    The error should be about model_type, not about framework or legacy format.
+    """
+    snap = isolated_cache / "models--test--unsupported-arch" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    _create_config(snap, model_type="definitely_not_a_real_architecture_xyz123")
+    (snap / "model.safetensors").write_bytes(b"fake" * 100)
+
+    from mlxk2.operations.health import check_runtime_compatibility
+    compatible, reason = check_runtime_compatibility(snap, "MLX")
+
+    # Should fail at Gate 3
+    assert compatible is False
+    assert reason is not None
+    # Reason should be about model_type, not framework or legacy format
+    assert "Incompatible:" not in reason, "Should not fail at Gate 1"
+    assert "Legacy format" not in reason, "Should not fail at Gate 2"
+    # Should mention model_type or architecture
+    assert ("model_type" in reason.lower() or
+            "not supported" in reason.lower() or
+            "architecture" in reason.lower()), f"Gate 3 should report model_type issue, got: {reason}"
+
+
+@requires_mlx_lm
+def test_reason_chain_gate3_supported_model_type_llama(isolated_cache):
+    """Well-known supported model_type (llama) should pass all gates.
+
+    This is the happy path: MLX framework, modern format, supported architecture.
+    """
+    snap = isolated_cache / "models--test--llama-supported" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    _create_config(snap, model_type="llama")  # Well-known supported type
+    (snap / "model.safetensors").write_bytes(b"fake" * 100)
+
+    from mlxk2.operations.health import check_runtime_compatibility
+    compatible, reason = check_runtime_compatibility(snap, "MLX")
+
+    # Should pass all gates (assuming mlx-lm supports llama)
+    assert compatible is True, f"llama model_type should be supported, got reason: {reason}"
+    assert reason is None, "Fully compatible models should have reason=None"
+
+
+# ============================================================================
+# Integration Tests: build_model_object() Reason Field
+# ============================================================================
+
+def test_integration_reason_field_unhealthy_model(isolated_cache):
+    """Integration: Unhealthy model should show health reason in model object."""
+    snap = isolated_cache / "models--test--no-config" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    # Missing config.json → unhealthy
+    (snap / "model.safetensors").write_bytes(b"fake" * 100)
+
+    from mlxk2.operations.common import build_model_object
+    model_obj = build_model_object("test/no-config", snap.parent.parent, snap)
+
+    assert model_obj["health"] == "unhealthy"
+    assert model_obj["runtime_compatible"] is False
+    assert "config.json" in model_obj["reason"]
+
+
+def test_integration_reason_field_gguf_model(isolated_cache):
+    """Integration: Healthy GGUF model should show framework incompatibility."""
+    snap = isolated_cache / "models--test--gguf-healthy" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    _create_config(snap)
+    (snap / "model.gguf").write_bytes(b"fake" * 100)
+
+    from mlxk2.operations.common import build_model_object
+    model_obj = build_model_object("test/gguf-healthy", snap.parent.parent, snap)
+
+    assert model_obj["health"] == "healthy"
+    assert model_obj["runtime_compatible"] is False
+    assert "Incompatible: GGUF" in model_obj["reason"] or "GGUF" in model_obj["reason"]
+
+
+def test_integration_reason_field_legacy_mlx_model(isolated_cache):
+    """Integration: Healthy MLX model with legacy weights should show legacy format reason.
+
+    Important: Model must be recognized as MLX framework (via mlx-community prefix)
+    so it passes Gate 1 and reaches Gate 2 (legacy format check).
+    """
+    snap = isolated_cache / "models--mlx-community--legacy-test" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    _create_config(snap)
+    (snap / "weights.00.safetensors").write_bytes(b"fake" * 100)
+
+    from mlxk2.operations.common import build_model_object
+    model_obj = build_model_object("mlx-community/legacy-test", snap.parent.parent, snap)
+
+    assert model_obj["health"] == "healthy"
+    assert model_obj["framework"] == "MLX", "Model should be detected as MLX framework"
+    assert model_obj["runtime_compatible"] is False
+    assert "Legacy format" in model_obj["reason"]
+
+
+def test_integration_reason_field_compatible_mlx_model(isolated_cache):
+    """Integration: Fully compatible MLX model should have reason=None."""
+    snap = isolated_cache / "models--test--mlx-compatible" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    _create_config(snap, model_type="llama")  # Well-known supported type
+    (snap / "model.safetensors").write_bytes(b"fake" * 100)
+
+    from mlxk2.operations.common import build_model_object
+    model_obj = build_model_object("test/mlx-compatible", snap.parent.parent, snap)
+
+    assert model_obj["health"] == "healthy"
+    # Should be compatible (assuming mlx-lm supports llama)
+    if model_obj["runtime_compatible"]:
+        assert model_obj["reason"] is None, "Fully compatible models must have reason=None"
+
+
+# ============================================================================
+# Edge Cases
+# ============================================================================
+
+def test_reason_chain_mixed_legacy_and_modern_weights(isolated_cache):
+    """Model with BOTH legacy and modern weights should pass Gate 2.
+
+    Gate 2 logic: `if has_legacy and not has_valid` → fail
+    If has_valid=True (modern weights exist), should NOT fail at Gate 2.
+    """
+    snap = isolated_cache / "models--test--mixed-weights" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    _create_config(snap, model_type="llama")
+    # Both legacy and modern formats present
+    (snap / "weights.00.safetensors").write_bytes(b"fake" * 100)  # Legacy
+    (snap / "model.safetensors").write_bytes(b"fake" * 100)  # Modern
+
+    from mlxk2.operations.health import check_runtime_compatibility
+    compatible, reason = check_runtime_compatibility(snap, "MLX")
+
+    # Should pass Gate 2 (has valid modern weights)
+    if not compatible:
+        assert "Legacy format" not in reason, "Should not fail at Gate 2 when modern weights exist"
+
+
+def test_reason_chain_no_weights_at_all(isolated_cache):
+    """Model with config but no weights should fail health check, not runtime check."""
+    snap = isolated_cache / "models--test--no-weights" / "snapshots" / "main"
+    snap.mkdir(parents=True)
+    _create_config(snap)
+    # No weight files at all
+
+    from mlxk2.operations.health import _check_snapshot_health
+    healthy, reason = _check_snapshot_health(snap)
+
+    # Should fail health check
+    assert healthy is False
+    assert "weights" in reason.lower() or "No model weights" in reason
diff --git a/tests_2.0/test_server_api.py.disabled b/tests_2.0/test_server_api.py.disabled
new file mode 100644
index 0000000..75b5c8f
--- /dev/null
+++ b/tests_2.0/test_server_api.py.disabled
@@ -0,0 +1,263 @@
+"""
+Test server API endpoints for 2.0 implementation.
+"""
+
+import json
+import pytest
+from fastapi.testclient import TestClient
+from unittest.mock import Mock, patch
+
+from mlxk2.core.server_base import app
+from mlxk2.core.runner import MLXRunner
+
+
+class MockMLXRunner:
+    """Mock MLXRunner for testing."""
+    
+    def __init__(self, model_path, verbose=False):
+        self.model_spec = model_path
+        self.verbose = verbose
+        self._context_length = 4096
+        
+    def load_model(self):
+        pass
+        
+    def cleanup(self):
+        pass
+        
+    def _calculate_dynamic_max_tokens(self, server_mode=False):
+        if server_mode:
+            return self._context_length // 2  # Half context for server
+        else:
+            return self._context_length  # Full context for run
+            
+    def generate_streaming(self, prompt, max_tokens=None, temperature=0.7, 
+                         top_p=0.9, repetition_penalty=1.1, use_chat_template=True,
+                         use_chat_stop_tokens=False):
+        """Mock streaming generation."""
+        yield "Hello"
+        yield " "
+        yield "world"
+        yield "!"
+        
+    def generate_batch(self, prompt, max_tokens=None, temperature=0.7,
+                      top_p=0.9, repetition_penalty=1.1, use_chat_template=True,
+                      use_chat_stop_tokens=False):
+        """Mock batch generation."""
+        return "Hello world!"
+        
+    def _format_conversation(self, messages):
+        """Mock conversation formatting."""
+        formatted_parts = []
+        for msg in messages:
+            role = msg["role"]
+            content = msg["content"]
+            if role == "system":
+                formatted_parts.append(f"System: {content}")
+            elif role == "user":
+                formatted_parts.append(f"Human: {content}")
+            elif role == "assistant":
+                formatted_parts.append(f"Assistant: {content}")
+        
+        return "\n\n".join(formatted_parts) + "\n\nAssistant: "
+
+
+@pytest.fixture
+def client():
+    """Create test client."""
+    with TestClient(app) as client:
+        yield client
+
+
+@pytest.fixture
+def mock_runner():
+    """Create mock runner."""
+    return MockMLXRunner("test-model")
+
+
+def test_health_endpoint(client):
+    """Test health check endpoint."""
+    response = client.get("/health")
+    assert response.status_code == 200
+    data = response.json()
+    assert data["status"] == "healthy"
+    assert data["service"] == "mlx-knife-server-2.0"
+
+
+def test_models_endpoint(client):
+    """Test models listing endpoint."""
+    # Mock the model cache and health check
+    with patch('mlxk2.core.server_base.get_current_model_cache') as mock_cache, \
+         patch('mlxk2.core.server_base.cache_dir_to_hf') as mock_cache_to_hf, \
+         patch('mlxk2.core.server_base.detect_framework') as mock_framework, \
+         patch('mlxk2.core.server_base.is_model_healthy') as mock_healthy:
+        
+        # Setup mocks
+        mock_cache_dir = Mock()
+        mock_cache_dir.name = "models--test--model"
+        mock_cache_dir.iterdir.return_value = [mock_cache_dir]
+        
+        mock_cache.return_value.iterdir.return_value = [mock_cache_dir]
+        mock_cache_to_hf.return_value = "test/model"
+        mock_framework.return_value = "MLX"
+        mock_healthy.return_value = (True, None)
+        
+        # Mock snapshots directory
+        mock_snapshots_dir = Mock()
+        mock_snapshots_dir.exists.return_value = True
+        mock_snapshot = Mock()
+        mock_snapshot.is_dir.return_value = True
+        mock_snapshots_dir.iterdir.return_value = [mock_snapshot]
+        mock_cache_dir.__truediv__.return_value = mock_snapshots_dir
+        
+        response = client.get("/v1/models")
+        assert response.status_code == 200
+        data = response.json()
+        assert "data" in data
+        assert data["object"] == "list"
+
+
+@patch('mlxk2.core.server_base.get_or_load_model')
+def test_completions_endpoint(mock_get_model, client, mock_runner):
+    """Test completions endpoint."""
+    mock_get_model.return_value = mock_runner
+    
+    request_data = {
+        "model": "test/model",
+        "prompt": "Hello",
+        "max_tokens": 10,
+        "temperature": 0.7
+    }
+    
+    response = client.post("/v1/completions", json=request_data)
+    assert response.status_code == 200
+    
+    data = response.json()
+    assert data["object"] == "text_completion"
+    assert "choices" in data
+    assert len(data["choices"]) == 1
+    assert data["choices"][0]["text"] == "Hello world!"
+
+
+@patch('mlxk2.core.server_base.get_or_load_model')
+def test_chat_completions_endpoint(mock_get_model, client, mock_runner):
+    """Test chat completions endpoint."""
+    mock_get_model.return_value = mock_runner
+    
+    request_data = {
+        "model": "test/model",
+        "messages": [
+            {"role": "user", "content": "Hello"}
+        ],
+        "max_tokens": 10,
+        "temperature": 0.7
+    }
+    
+    response = client.post("/v1/chat/completions", json=request_data)
+    assert response.status_code == 200
+    
+    data = response.json()
+    assert data["object"] == "chat.completion"
+    assert "choices" in data
+    assert len(data["choices"]) == 1
+    assert data["choices"][0]["message"]["role"] == "assistant"
+    assert data["choices"][0]["message"]["content"] == "Hello world!"
+
+
+@patch('mlxk2.core.server_base.get_or_load_model')
+def test_streaming_completions(mock_get_model, client, mock_runner):
+    """Test streaming completions."""
+    mock_get_model.return_value = mock_runner
+    
+    request_data = {
+        "model": "test/model",
+        "prompt": "Hello",
+        "stream": True,
+        "max_tokens": 10
+    }
+    
+    response = client.post("/v1/completions", json=request_data)
+    assert response.status_code == 200
+    assert response.headers["content-type"] == "text/plain; charset=utf-8"
+
+
+@patch('mlxk2.core.server_base.get_or_load_model')
+def test_streaming_chat_completions(mock_get_model, client, mock_runner):
+    """Test streaming chat completions."""
+    mock_get_model.return_value = mock_runner
+    
+    request_data = {
+        "model": "test/model",
+        "messages": [
+            {"role": "user", "content": "Hello"}
+        ],
+        "stream": True,
+        "max_tokens": 10
+    }
+    
+    response = client.post("/v1/chat/completions", json=request_data)
+    assert response.status_code == 200
+    assert response.headers["content-type"] == "text/plain; charset=utf-8"
+
+
+def test_model_hot_swapping(client):
+    """Test that model hot-swapping clears previous models."""
+    with patch('mlxk2.core.server_base.resolve_model_for_operation') as mock_resolve, \
+         patch('mlxk2.core.server_base.get_current_model_cache') as mock_cache, \
+         patch('mlxk2.core.server_base.MLXRunner') as mock_runner_class:
+        
+        # Setup for first model
+        mock_resolve.return_value = ("test/model1", None, None)
+        mock_cache_dir = Mock()
+        mock_cache_dir.__truediv__.return_value.exists.return_value = True
+        mock_cache.return_value = mock_cache_dir
+        
+        mock_runner1 = Mock()
+        mock_runner1.load_model = Mock()
+        mock_runner1.cleanup = Mock()
+        mock_runner_class.return_value = mock_runner1
+        
+        # Load first model
+        from mlxk2.core.server_base import get_or_load_model
+        runner1 = get_or_load_model("test/model1")
+        
+        # Setup for second model
+        mock_resolve.return_value = ("test/model2", None, None)
+        mock_runner2 = Mock()
+        mock_runner2.load_model = Mock()
+        mock_runner2.cleanup = Mock()
+        mock_runner_class.return_value = mock_runner2
+        
+        # Load second model - should cleanup first
+        runner2 = get_or_load_model("test/model2")
+        
+        # Verify cleanup was called on first runner
+        mock_runner1.cleanup.assert_called_once()
+
+
+def test_server_mode_token_limits():
+    """Test that server mode uses half context for DoS protection."""
+    runner = MockMLXRunner("test-model")
+    
+    # Server mode should use half context
+    server_tokens = runner._calculate_dynamic_max_tokens(server_mode=True)
+    assert server_tokens == 2048  # Half of 4096
+    
+    # Run mode should use full context
+    run_tokens = runner._calculate_dynamic_max_tokens(server_mode=False)
+    assert run_tokens == 4096  # Full context
+
+
+@patch('mlxk2.core.server_base.get_or_load_model')
+def test_error_handling(mock_get_model, client):
+    """Test error handling in API endpoints."""
+    # Test model not found
+    mock_get_model.side_effect = Exception("Model not found")
+    
+    request_data = {
+        "model": "nonexistent/model",
+        "prompt": "Hello"
+    }
+    
+    response = client.post("/v1/completions", json=request_data)
+    assert response.status_code == 500
diff --git a/tests_2.0/test_server_api_minimal.py b/tests_2.0/test_server_api_minimal.py
new file mode 100644
index 0000000..4ab917b
--- /dev/null
+++ b/tests_2.0/test_server_api_minimal.py
@@ -0,0 +1,32 @@
+"""
+Minimal server API tests to keep suite aligned with current code.
+Focus: non-streaming chat completions use chat stop tokens in batch path.
+"""
+
+from unittest.mock import Mock, patch
+from fastapi.testclient import TestClient
+
+from mlxk2.core.server_base import app
+
+
+def test_chat_completions_batch_uses_chat_stop_tokens_flag():
+    client = TestClient(app)
+
+    mock_runner = Mock()
+    mock_runner.generate_batch.return_value = "Assistant: Hello"
+    mock_runner._format_conversation.return_value = "Human: Hi\n\nAssistant:"
+
+    with patch('mlxk2.core.server_base.get_or_load_model', return_value=mock_runner):
+        payload = {
+            "model": "test/model",
+            "messages": [{"role": "user", "content": "Hi"}],
+            "stream": False,
+        }
+        resp = client.post("/v1/chat/completions", json=payload)
+        assert resp.status_code == 200
+
+        # Ensure server passed use_chat_stop_tokens=True to batch generator
+        assert mock_runner.generate_batch.called
+        kwargs = mock_runner.generate_batch.call_args.kwargs
+        assert kwargs.get("use_chat_stop_tokens") is True
+
diff --git a/tests_2.0/test_server_models_and_errors.py b/tests_2.0/test_server_models_and_errors.py
new file mode 100644
index 0000000..93bc6a4
--- /dev/null
+++ b/tests_2.0/test_server_models_and_errors.py
@@ -0,0 +1,151 @@
+"""
+Minimal server tests for /v1/models and error mappings (404/503).
+
+Keeps scope small and deterministic by mocking model/cache access.
+"""
+
+from unittest.mock import Mock, MagicMock, patch
+
+from fastapi.testclient import TestClient
+
+from mlxk2.core.server_base import app
+
+
+def test_models_endpoint_minimal_structure():
+    """/v1/models returns list object with model entries and context_length field."""
+    client = TestClient(app)
+
+    # Note: cache_dir_to_hf/detect_framework/is_model_healthy are imported inside
+    # the endpoint function, so patch their origin modules, not server_base.
+    with patch('mlxk2.core.server_base.get_current_model_cache') as mock_cache, \
+         patch('mlxk2.core.cache.cache_dir_to_hf') as mock_cache_to_hf, \
+         patch('mlxk2.operations.common.detect_framework') as mock_framework, \
+         patch('mlxk2.operations.health.is_model_healthy') as mock_healthy:
+
+        # Simulate a single cached model directory
+        mock_cache_dir = MagicMock()
+        mock_cache_dir.name = "models--org--model"
+        mock_cache.return_value.iterdir.return_value = [mock_cache_dir]
+
+        # Map cache dir -> external id and mark as MLX + healthy
+        mock_cache_to_hf.return_value = "org/model"
+        mock_framework.return_value = "MLX"
+        mock_healthy.return_value = (True, None)
+
+        # Provide a snapshots directory with one folder to allow context_length probing
+        mock_snapshots_dir = MagicMock()
+        mock_snapshots_dir.exists.return_value = True
+        mock_snapshot = MagicMock()
+        mock_snapshot.is_dir.return_value = True
+        mock_snapshots_dir.iterdir.return_value = [mock_snapshot]
+        mock_cache_dir.__truediv__.return_value = mock_snapshots_dir
+
+        resp = client.get("/v1/models")
+        assert resp.status_code == 200
+        data = resp.json()
+        assert data.get("object") == "list"
+        assert isinstance(data.get("data"), list)
+        # Verify minimal shape of first entry
+        assert data["data"], "Expected at least one model in mocked list"
+        entry = data["data"][0]
+        assert entry.get("id") == "org/model"
+        assert entry.get("object") == "model"
+        assert "context_length" in entry  # may be None if probing fails
+
+
+def test_unknown_model_maps_to_404():
+    """Unknown/invalid model should map to 404 from inner helper."""
+    from fastapi import HTTPException
+
+    client = TestClient(app)
+
+    with patch('mlxk2.core.server_base.get_or_load_model') as mock_get:
+        mock_get.side_effect = HTTPException(status_code=404, detail="not found")
+
+        payload = {"model": "does/not-exist", "prompt": "hi"}
+        resp = client.post("/v1/completions", json=payload)
+        assert resp.status_code == 404
+
+
+def test_models_endpoint_filters_non_mlx_and_unhealthy():
+    """Ensure /v1/models excludes non-MLX and unhealthy entries."""
+    client = TestClient(app)
+
+    with patch('mlxk2.core.server_base.get_current_model_cache') as mock_cache, \
+         patch('mlxk2.core.cache.cache_dir_to_hf') as mock_cache_to_hf, \
+         patch('mlxk2.operations.common.detect_framework') as mock_framework, \
+         patch('mlxk2.operations.health.is_model_healthy') as mock_healthy:
+
+        # Two cached dirs
+        d1 = MagicMock(); d1.name = "models--org--mlx"
+        d2 = MagicMock(); d2.name = "models--org--pt"
+        mock_cache.return_value.iterdir.return_value = [d1, d2]
+
+        # Map names
+        def map_name(n):
+            if n == "models--org--mlx":
+                return "org/mlx"
+            return "org/pt"
+
+        mock_cache_to_hf.side_effect = map_name
+
+        # Framework detection: d1 is MLX, d2 is not
+        def detect_fw(model_name, *_args, **_kwargs):
+            return "MLX" if model_name.endswith("/mlx") else "PyTorch"
+
+        mock_framework.side_effect = detect_fw
+
+        # Health: return False for the MLX one to ensure it is filtered, too
+        def health(model_name):
+            return (False, None) if model_name.endswith("/mlx") else (True, None)
+
+        mock_healthy.side_effect = health
+
+        resp = client.get("/v1/models")
+        assert resp.status_code == 200
+        data = resp.json()
+        # Both should be filtered: one not MLX, one unhealthy
+        assert data.get("data") == []
+
+
+def test_chat_unknown_model_maps_to_404():
+    from fastapi import HTTPException
+
+    client = TestClient(app)
+
+    with patch('mlxk2.core.server_base.get_or_load_model') as mock_get:
+        mock_get.side_effect = HTTPException(status_code=404, detail="not found")
+
+        payload = {"model": "does/not-exist", "messages": [{"role": "user", "content": "hi"}], "stream": False}
+        resp = client.post("/v1/chat/completions", json=payload)
+        assert resp.status_code == 404
+
+
+def test_chat_shutdown_event_maps_to_503_and_is_cleared():
+    from mlxk2.core import server_base
+
+    client = TestClient(app)
+
+    try:
+        server_base._shutdown_event.set()
+        payload = {"model": "any/model", "messages": [{"role": "user", "content": "hi"}], "stream": False}
+        resp = client.post("/v1/chat/completions", json=payload)
+        assert resp.status_code == 503
+    finally:
+        server_base._shutdown_event.clear()
+
+
+def test_shutdown_event_maps_to_503_and_is_cleared():
+    """When shutdown flag is set, endpoints respond 503; then clear for isolation."""
+    from mlxk2.core import server_base
+
+    client = TestClient(app)
+
+    try:
+        server_base._shutdown_event.set()
+        payload = {"model": "any/model", "prompt": "hi"}
+        resp = client.post("/v1/completions", json=payload)
+        assert resp.status_code == 503
+    finally:
+        # Ensure we don't leak shutdown state to other tests
+        server_base._shutdown_event.clear()
diff --git a/tests_2.0/test_server_streaming_minimal.py b/tests_2.0/test_server_streaming_minimal.py
new file mode 100644
index 0000000..410c837
--- /dev/null
+++ b/tests_2.0/test_server_streaming_minimal.py
@@ -0,0 +1,113 @@
+"""
+Streaming SSE minimal tests for 2.0 server.
+
+Covers:
+- Happy-path SSE for /v1/completions with a few chunks
+- Interrupt path yields an interrupt marker chunk
+- Chat streaming passes use_chat_stop_tokens=True to the runner
+"""
+
+import json
+from typing import Iterator
+from unittest.mock import patch
+
+from fastapi.testclient import TestClient
+
+from mlxk2.core.server_base import app
+
+
+def _iter_sse_lines(resp) -> Iterator[str]:
+    """Iterate non-empty SSE lines as strings from a streaming response."""
+    for raw in resp.iter_lines():
+        if not raw:
+            continue
+        if isinstance(raw, bytes):
+            line = raw.decode("utf-8", errors="ignore")
+        else:
+            line = raw
+        if line.strip():
+            yield line
+
+
+def test_streaming_completions_happy_path_sse():
+    client = TestClient(app)
+
+    class DummyRunner:
+        def _calculate_dynamic_max_tokens(self, server_mode: bool = True):
+            return 16
+        def generate_streaming(self, **kwargs):
+            yield "Hello"
+            yield " world"
+            yield "!"
+
+    with patch('mlxk2.core.server_base.get_or_load_model', return_value=DummyRunner()):
+        payload = {"model": "org/model", "prompt": "Hi", "stream": True}
+        with client.stream("POST", "/v1/completions", json=payload) as resp:
+            assert resp.status_code == 200
+            # Content type can vary under TestClient; just ensure header exists
+            assert "content-type" in resp.headers
+
+            lines = list(_iter_sse_lines(resp))
+            # Expect at least initial data + a few chunks + final [DONE]
+            assert any(l.startswith("data: ") for l in lines)
+            assert any(l.strip() == "data: [DONE]" for l in lines)
+
+
+def test_streaming_completions_interrupt_marker():
+    client = TestClient(app)
+
+    class InterruptingRunner:
+        def _calculate_dynamic_max_tokens(self, server_mode: bool = True):
+            return 16
+        def generate_streaming(self, **kwargs):
+            yield "Hello"
+            raise KeyboardInterrupt()
+
+    with patch('mlxk2.core.server_base.get_or_load_model', return_value=InterruptingRunner()):
+        payload = {"model": "org/model", "prompt": "Hi", "stream": True}
+        with client.stream("POST", "/v1/completions", json=payload) as resp:
+            assert resp.status_code == 200
+            lines = [l for l in _iter_sse_lines(resp) if l.startswith("data: ")]
+            # Find JSON chunks (skip [DONE])
+            json_chunks = []
+            for l in lines:
+                if l.strip() == "data: [DONE]":
+                    continue
+                try:
+                    json_chunks.append(json.loads(l[len("data: "):]))
+                except Exception:
+                    pass
+            # One of the chunks should contain the interrupt marker text
+            assert any("interrupted" in (c.get("choices", [{}])[0].get("text", "").lower()) for c in json_chunks)
+
+
+def test_chat_streaming_uses_chat_stop_tokens_flag():
+    client = TestClient(app)
+
+    captured = {}
+
+    class CapturingRunner:
+        def _calculate_dynamic_max_tokens(self, server_mode: bool = True):
+            return 16
+        def _format_conversation(self, messages):
+            return "prompt"
+
+        def generate_streaming(self, **kwargs):
+            captured.update(kwargs)
+            yield "Hi"
+            yield " there"
+
+    with patch('mlxk2.core.server_base.get_or_load_model', return_value=CapturingRunner()):
+        payload = {
+            "model": "org/model",
+            "messages": [{"role": "user", "content": "Hi"}],
+            "stream": True,
+        }
+        with client.stream("POST", "/v1/chat/completions", json=payload) as resp:
+            assert resp.status_code == 200
+            # Consume stream to ensure generator ran and captured kwargs
+            for _ in _iter_sse_lines(resp):
+                pass
+
+    assert captured.get("use_chat_stop_tokens") is True
+    assert captured.get("use_chat_template") is False
diff --git a/tests_2.0/test_server_token_limits_api.py b/tests_2.0/test_server_token_limits_api.py
new file mode 100644
index 0000000..a898061
--- /dev/null
+++ b/tests_2.0/test_server_token_limits_api.py
@@ -0,0 +1,115 @@
+"""
+Server-level token limit tests (edge cases without changing core behavior).
+
+Focus: ensure endpoints pass effective max_tokens correctly:
+- When request.max_tokens is None -> use runner._calculate_dynamic_max_tokens(server_mode=True)
+- When request.max_tokens is set -> pass through unchanged
+"""
+
+from unittest.mock import patch
+
+from fastapi.testclient import TestClient
+
+from mlxk2.core.server_base import app
+
+
+def test_server_completions_uses_dynamic_when_none():
+    client = TestClient(app)
+
+    class Runner:
+        def _calculate_dynamic_max_tokens(self, server_mode=True):
+            assert server_mode is True
+            return 123
+
+        def generate_batch(self, **kwargs):
+            # Assert server passes the dynamic value
+            assert kwargs.get("max_tokens") == 123
+            return "ok"
+
+    with patch('mlxk2.core.server_base.get_or_load_model', return_value=Runner()):
+        payload = {"model": "org/model", "prompt": "Hi"}  # max_tokens omitted
+        resp = client.post("/v1/completions", json=payload)
+        assert resp.status_code == 200
+
+
+def test_server_completions_respects_explicit_max_tokens():
+    client = TestClient(app)
+
+    seen = {}
+
+    class Runner:
+        def _calculate_dynamic_max_tokens(self, server_mode=True):
+            return 999  # should be ignored when explicit max_tokens provided
+
+        def generate_batch(self, **kwargs):
+            seen.update(kwargs)
+            return "ok"
+
+    with patch('mlxk2.core.server_base.get_or_load_model', return_value=Runner()):
+        payload = {"model": "org/model", "prompt": "Hi", "max_tokens": 7}
+        resp = client.post("/v1/completions", json=payload)
+        assert resp.status_code == 200
+        assert seen.get("max_tokens") == 7
+
+
+def test_server_chat_streaming_uses_dynamic_when_none():
+    client = TestClient(app)
+
+    captured = {}
+
+    class Runner:
+        def _calculate_dynamic_max_tokens(self, server_mode=True):
+            assert server_mode is True
+            return 42
+
+        def _format_conversation(self, messages):
+            return "prompt"
+
+        def generate_streaming(self, **kwargs):
+            captured.update(kwargs)
+            yield "A"
+            yield "B"
+
+    with patch('mlxk2.core.server_base.get_or_load_model', return_value=Runner()):
+        payload = {
+            "model": "org/model",
+            "messages": [{"role": "user", "content": "Hi"}],
+            "stream": True,
+        }
+        with client.stream("POST", "/v1/chat/completions", json=payload) as resp:
+            assert resp.status_code == 200
+            for _ in resp.iter_lines():
+                pass
+
+    assert captured.get("max_tokens") == 42
+    assert captured.get("use_chat_stop_tokens") is True
+    assert captured.get("use_chat_template") is False
+
+
+def test_server_chat_non_streaming_respects_explicit_max_tokens():
+    client = TestClient(app)
+
+    seen = {}
+
+    class Runner:
+        def _calculate_dynamic_max_tokens(self, server_mode=True):
+            return 111
+
+        def _format_conversation(self, messages):
+            return "prompt"
+
+        def generate_batch(self, **kwargs):
+            seen.update(kwargs)
+            return "ok"
+
+    with patch('mlxk2.core.server_base.get_or_load_model', return_value=Runner()):
+        payload = {
+            "model": "org/model",
+            "messages": [{"role": "user", "content": "Hi"}],
+            "stream": False,
+            "max_tokens": 5,
+        }
+        resp = client.post("/v1/chat/completions", json=payload)
+        assert resp.status_code == 200
+        assert seen.get("max_tokens") == 5
+
diff --git a/tests_2.0/test_stop_tokens_live.py b/tests_2.0/test_stop_tokens_live.py
new file mode 100644
index 0000000..bfd03e6
--- /dev/null
+++ b/tests_2.0/test_stop_tokens_live.py
@@ -0,0 +1,467 @@
+"""Real-model stop token detection tests for Issue #32 (ADR-009).
+
+This test suite validates stop token handling with real models that exhibit
+known issues:
+- MXFP4: Visible `<|end|>` tokens in output
+- Qwen 2.5: Self-conversation (chat template role markers)
+- Llama 3.2: Control baseline (should work correctly)
+
+Test Strategy (ADR-009):
+1. Phase 1: Baseline measurement (document broken behavior)
+2. Phase 2: Fix validation (verify 2-LOC fix works)
+3. Phase 3: Empirical mapping (document tokenizer configs)
+
+Opt-in via: pytest -m live_stop_tokens
+Requires: HF_HOME set to SSD cache (CoW same-volume requirement, ADR-007)
+
+RAM Safety:
+- Tests automatically skip models that exceed available RAM
+- Progressive budget scaling: 40% (16GB), 50% (32GB), 60% (64GB), 70% (96GB+)
+- Larger systems have lower relative overhead, enabling better RAM utilization
+- See TESTING.md: "RAM-Aware Model Selection Strategy"
+"""
+
+from __future__ import annotations
+
+import os
+import sys
+import pytest
+import json
+import subprocess
+from pathlib import Path
+from typing import Dict, Any, Optional
+import importlib
+import importlib.util
+
+# Opt-in marker for live tests
+pytestmark = [pytest.mark.live_stop_tokens, pytest.mark.slow]
+
+
+@pytest.fixture(scope="module", autouse=True)
+def _use_real_mlx_modules():
+    """Ensure live tests use real mlx / mlx-lm without polluting the rest of the suite."""
+    stub_path = Path(__file__).parent / "stubs"
+    stub_path_str = str(stub_path)
+
+    # Remove stub path from sys.path (if present) and remember to restore it later
+    path_removed = False
+    if stub_path_str in sys.path:
+        sys.path = [p for p in sys.path if p != stub_path_str]
+        path_removed = True
+
+    # Remove stub modules from sys.modules so real modules can be imported
+    removed_modules: Dict[str, Any] = {}
+    for module_name, module in list(sys.modules.items()):
+        module_file = getattr(module, "__file__", "") or ""
+        if module_file and stub_path_str in module_file:
+            removed_modules[module_name] = module
+            sys.modules.pop(module_name, None)
+    # Also clear any previously installed huggingface_hub shims
+    removed_hf_modules: Dict[str, Any] = {}
+    for module_name, module in list(sys.modules.items()):
+        if module_name == "huggingface_hub" or module_name.startswith("huggingface_hub."):
+            removed_hf_modules[module_name] = module
+            sys.modules.pop(module_name, None)
+
+    # Require real mlx / mlx-lm; skip entire module if not available
+    missing_runtime = False
+    if (
+        importlib.util.find_spec("mlx.core") is None
+        or importlib.util.find_spec("mlx_lm") is None
+    ):
+        missing_runtime = True
+    else:
+        try:
+            huggingface_hub = importlib.import_module("huggingface_hub")
+        except ImportError:
+            missing_runtime = True
+        else:
+            if not hasattr(huggingface_hub, "snapshot_download"):
+                for name, mod in removed_modules.items():
+                    sys.modules[name] = mod
+                for name, mod in removed_hf_modules.items():
+                    sys.modules[name] = mod
+                if path_removed and stub_path_str not in sys.path:
+                    sys.path.insert(0, stub_path_str)
+                pytest.skip(
+                    "requires huggingface_hub.snapshot_download (install latest huggingface-hub)",
+                    allow_module_level=True,
+                )
+    if missing_runtime:
+        # Restore previous state before skipping so rest of suite still uses stubs
+        sys.modules.update({name: mod for name, mod in removed_modules.items()
+                            if name not in sys.modules})
+        sys.modules.update({name: mod for name, mod in removed_hf_modules.items()
+                            if name not in sys.modules})
+        if path_removed and stub_path_str not in sys.path:
+            sys.path.insert(0, stub_path_str)
+        pytest.skip(
+            "requires mlx / mlx-lm native runtime (Apple Silicon)",
+            allow_module_level=True,
+        )
+
+    try:
+        yield
+    finally:
+        # Restore stub modules for the remainder of the test run
+        for name, module in removed_modules.items():
+            sys.modules[name] = module
+        for name, module in removed_hf_modules.items():
+            sys.modules[name] = module
+
+        # Ensure stub path is back at the front for unit tests
+        if path_removed and stub_path_str not in sys.path:
+            sys.path.insert(0, stub_path_str)
+
+# Skip if HF_HOME not set (required for CoW same-volume, ADR-007)
+_HF_HOME = os.environ.get("HF_HOME")
+
+requires_hf_home = pytest.mark.skipif(
+    not _HF_HOME,
+    reason="requires HF_HOME set to SSD cache for CoW same-volume (ADR-007)"
+)
+
+
+def get_system_ram_gb() -> float:
+    """Detect system RAM in GB (macOS portable)."""
+    try:
+        result = subprocess.run(
+            ["sysctl", "hw.memsize"],
+            capture_output=True,
+            text=True,
+            check=True
+        )
+        # Output: "hw.memsize: 68719476736"
+        memsize_bytes = int(result.stdout.strip().split(":")[1].strip())
+        return memsize_bytes / (1024**3)  # Convert to GB
+    except Exception:
+        # Fallback: assume minimum safe config (16GB)
+        return 16.0
+
+
+def get_safe_ram_budget_gb() -> float:
+    """Get safe RAM budget for model loading (progressive scaling).
+
+    Progressive budget strategy (relative overhead decreases with larger systems):
+    - 16GB System: 40% budget (6.4GB) - high relative OS overhead
+    - 32GB System: 50% budget (16GB) - moderate overhead
+    - 64GB System: 60% budget (38.4GB) - low overhead
+    - 96GB+ System: 70% budget (67GB+) - minimal overhead
+
+    Rationale:
+    - OS/System baseline overhead is ~4-6GB (relatively constant)
+    - Larger systems have more headroom after OS overhead
+    - Progressive scaling allows better utilization of high-RAM systems
+    """
+    system_ram = get_system_ram_gb()
+
+    # Progressive budget scaling
+    if system_ram >= 96:
+        budget_ratio = 0.70  # 70% for 96GB+ systems
+    elif system_ram >= 64:
+        budget_ratio = 0.60  # 60% for 64GB systems
+    elif system_ram >= 32:
+        budget_ratio = 0.50  # 50% for 32GB systems
+    else:
+        budget_ratio = 0.40  # 40% for 16GB systems (conservative)
+
+    safe_budget = system_ram * budget_ratio
+    return safe_budget
+
+
+# Test models from ADR-009 with RAM requirements
+# RAM estimates from TESTING.md: "RAM-Aware Model Selection Strategy"
+TEST_MODELS = {
+    "mxfp4": {
+        "id": "mlx-community/gpt-oss-20b-MXFP4-Q8",
+        "expected_issue": "visible_end_token",
+        "description": "MXFP4 format with visible <|end|> in output",
+        "ram_needed_gb": 12.0  # 20B MXFP4 (~12GB empirical)
+    },
+    "qwen25": {
+        "id": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
+        "expected_issue": "self_conversation",
+        "description": "Qwen 2.5 generates chat template markers",
+        "ram_needed_gb": 1.0  # 0.5B 4-bit (~1GB)
+    },
+    "llama32": {
+        "id": "mlx-community/Llama-3.2-3B-Instruct-4bit",
+        "expected_issue": None,
+        "description": "Control baseline (should work correctly)",
+        "ram_needed_gb": 4.0  # 3B 4-bit (~4GB)
+    }
+}
+
+
+def should_skip_model(model_key: str) -> tuple[bool, str]:
+    """Check if model should be skipped due to insufficient RAM.
+
+    Returns:
+        (should_skip, reason)
+    """
+    model_info = TEST_MODELS[model_key]
+    ram_needed = model_info["ram_needed_gb"]
+    ram_budget = get_safe_ram_budget_gb()
+    system_ram = get_system_ram_gb()
+
+    if ram_needed > ram_budget:
+        budget_pct = int((ram_budget / system_ram * 100) if system_ram > 0 else 40)
+        return (
+            True,
+            f"Model requires {ram_needed}GB but only {ram_budget:.1f}GB available "
+            f"({budget_pct}% of {system_ram:.0f}GB system RAM). See TESTING.md RAM-Aware Model Selection."
+        )
+    return (False, "")
+
+# Standard test prompt (simple, predictable)
+TEST_PROMPT = "Write one sentence about cats."
+MAX_TOKENS = 50
+
+
+class TestStopTokensValidation:
+    """Validation: Verify stop token handling works correctly (Issue #32, ADR-009)."""
+
+    @requires_hf_home
+    def test_mxfp4_stop_token_filtering(self):
+        """MXFP4: Stop tokens should be filtered correctly.
+
+        After ADR-009 2-LOC fix (eos_token_id → eos_token_ids):
+        - Model should stop cleanly without visible stop tokens
+        - No `<|end|>` or `<|return|>` in output
+
+        Background (Issue #32):
+        - MXFP4 previously showed visible `<|end|>` tokens
+        - Root cause: Runner only checked singular eos_token_id
+        - Fix: Use eos_token_ids Set to handle multiple EOS tokens
+        """
+        # RAM Safety Check
+        should_skip, reason = should_skip_model("mxfp4")
+        if should_skip:
+            pytest.skip(reason)
+
+        from mlxk2.core.runner import MLXRunner
+
+        model_id = TEST_MODELS["mxfp4"]["id"]
+
+        # Run inference
+        with MLXRunner(model_id) as runner:
+            output = runner.generate_batch(
+                prompt=TEST_PROMPT,
+                max_tokens=MAX_TOKENS
+            )
+
+        # Validate clean output
+        print(f"\n{'='*60}")
+        print(f"VALIDATION: MXFP4")
+        print(f"{'='*60}")
+        print(f"Model: {model_id}")
+        print(f"Prompt: {TEST_PROMPT}")
+        print(f"Output: {output!r}")
+
+        # Assert no visible stop tokens
+        assert "<|end|>" not in output, "MXFP4 should filter <|end|> token"
+        assert "<|return|>" not in output, "MXFP4 should filter <|return|> token"
+
+        print("✓ MXFP4: Stop tokens correctly filtered")
+
+    @requires_hf_home
+    def test_qwen25_no_self_conversation(self):
+        """Qwen 2.5: Should not generate chat template role markers (self-conversation).
+
+        Self-Conversation Definition (ADR-009):
+        - Model generates chat template role markers (User:, Assistant:, etc.)
+        - Common patterns: '\nUser:', '\nAssistant:', '<|im_start|>user', '<|im_start|>assistant'
+        - Specific to Qwen: '<|im_start|>', '<|im_end|>' markers
+
+        Expected Behavior:
+        - Model stops cleanly after its response
+        - No chat template markers in output
+        """
+        # RAM Safety Check
+        should_skip, reason = should_skip_model("qwen25")
+        if should_skip:
+            pytest.skip(reason)
+
+        from mlxk2.core.runner import MLXRunner
+
+        model_id = TEST_MODELS["qwen25"]["id"]
+
+        # Run inference
+        with MLXRunner(model_id) as runner:
+            output = runner.generate_batch(
+                prompt=TEST_PROMPT,
+                max_tokens=MAX_TOKENS
+            )
+
+        # Validate clean output
+        print(f"\n{'='*60}")
+        print(f"VALIDATION: Qwen 2.5")
+        print(f"{'='*60}")
+        print(f"Model: {model_id}")
+        print(f"Prompt: {TEST_PROMPT}")
+        print(f"Output: {output!r}")
+
+        # Check for self-conversation patterns
+        generic_markers = ["\nUser:", "\nAssistant:", "\nHuman:", "\nAI:"]
+        qwen_markers = ["<|im_start|>user", "<|im_start|>assistant", "<|im_start|>", "<|im_end|>"]
+
+        found_generic = [m for m in generic_markers if m in output]
+        found_qwen = [m for m in qwen_markers if m in output]
+
+        print(f"Generic markers found: {found_generic}")
+        print(f"Qwen markers found: {found_qwen}")
+
+        # Assert no self-conversation
+        assert not found_generic, f"Qwen 2.5 should not generate generic chat markers. Found: {found_generic}"
+        assert not found_qwen, f"Qwen 2.5 should not generate Qwen-specific markers. Found: {found_qwen}"
+
+        print("✓ Qwen 2.5: No self-conversation")
+
+    @requires_hf_home
+    def test_llama32_regression_control(self):
+        """Llama 3.2: Regression control (should work correctly).
+
+        Llama 3.2 has 3 eos_token_ids: [128008, 128001, 128009]
+        This validates that the 2-LOC fix correctly handles multi-EOS models.
+
+        Expected Behavior:
+        - Clean output without visible stop tokens
+        - No self-conversation
+        - Serves as regression baseline
+        """
+        # RAM Safety Check
+        should_skip, reason = should_skip_model("llama32")
+        if should_skip:
+            pytest.skip(reason)
+
+        from mlxk2.core.runner import MLXRunner
+
+        model_id = TEST_MODELS["llama32"]["id"]
+
+        # Run inference
+        with MLXRunner(model_id) as runner:
+            output = runner.generate_batch(
+                prompt=TEST_PROMPT,
+                max_tokens=MAX_TOKENS
+            )
+
+        # Validate clean output
+        print(f"\n{'='*60}")
+        print(f"VALIDATION: Llama 3.2 (Regression Control)")
+        print(f"{'='*60}")
+        print(f"Model: {model_id}")
+        print(f"Prompt: {TEST_PROMPT}")
+        print(f"Output: {output!r}")
+
+        # Llama 3.2 stop tokens
+        llama_stop_tokens = ["<|eot_id|>", "</s>", "<|end_of_text|>"]
+        found_stop = [t for t in llama_stop_tokens if t in output]
+        assert not found_stop, f"Llama 3.2 should filter stop tokens. Found: {found_stop}"
+
+        # No generic chat markers
+        generic_markers = ["\nUser:", "\nAssistant:", "\nHuman:", "\nAI:"]
+        found_markers = [m for m in generic_markers if m in output]
+        assert not found_markers, f"Llama 3.2 should not self-converse. Found: {found_markers}"
+
+        print("✓ Llama 3.2: Clean output (regression control passed)")
+
+
+class TestStopTokensEmpiricalMapping:
+    """Phase 3: Empirical mapping - document tokenizer configs and observed tokens."""
+
+    @requires_hf_home
+    def test_empirical_mapping_all_models(self):
+        """Document tokenizer configs and empirically observed stop tokens.
+
+        Generates report: stop_token_config_report.json
+
+        Report Format (ADR-009):
+        {
+          "model": "gpt-oss",
+          "configured_eos": ["<|return|>"],     # From tokenizer.eos_token
+          "configured_eos_ids": [50256, ...],   # From tokenizer.eos_token_ids
+          "generated_tokens": ["<|end|>", ...], # Empirically observed
+          "workaround_needed": True/False
+        }
+        """
+        from mlxk2.core.runner import MLXRunner
+
+        report = {}
+        system_ram = get_system_ram_gb()
+        ram_budget = get_safe_ram_budget_gb()
+
+        # Calculate actual budget ratio used
+        budget_ratio = ram_budget / system_ram if system_ram > 0 else 0.40
+
+        # Add system info to report
+        report["_system_info"] = {
+            "system_ram_gb": round(system_ram, 1),
+            "ram_budget_gb": round(ram_budget, 1),
+            "budget_ratio": round(budget_ratio, 2)
+        }
+
+        for model_key, model_info in TEST_MODELS.items():
+            model_id = model_info["id"]
+
+            # Skip models that exceed RAM budget
+            should_skip, skip_reason = should_skip_model(model_key)
+            if should_skip:
+                print(f"\nSkipping {model_key}: {skip_reason}")
+                report[model_key] = {
+                    "model_id": model_id,
+                    "skipped": True,
+                    "skip_reason": skip_reason
+                }
+                continue
+
+            with MLXRunner(model_id) as runner:
+                # Get tokenizer config
+                tokenizer = runner.tokenizer
+
+                # Extract configured stop tokens
+                eos_token = getattr(tokenizer, "eos_token", None)
+                eos_token_id = getattr(tokenizer, "eos_token_id", None)
+
+                # Try to get eos_token_ids (Set or List)
+                eos_token_ids = None
+                if hasattr(tokenizer, "eos_token_ids"):
+                    eos_token_ids = tokenizer.eos_token_ids
+                    if hasattr(eos_token_ids, "__iter__"):
+                        eos_token_ids = list(eos_token_ids)
+
+                # Run inference to observe actual behavior
+                output = runner.generate_batch(
+                    prompt=TEST_PROMPT,
+                    max_tokens=MAX_TOKENS
+                )
+
+                # Detect visible stop tokens
+                potential_stop_tokens = ["<|end|>", "<|eot_id|>", "<|im_end|>", "<|endoftext|>"]
+                found_stop_tokens = [t for t in potential_stop_tokens if t in output]
+
+                report[model_key] = {
+                    "model_id": model_id,
+                    "configured_eos_token": eos_token,
+                    "configured_eos_token_id": eos_token_id,
+                    "configured_eos_token_ids": eos_token_ids,
+                    "generated_output": output[:100],  # First 100 chars for reference
+                    "visible_stop_tokens": found_stop_tokens,
+                    "workaround_needed": bool(found_stop_tokens)
+                }
+
+        # Write report
+        report_path = Path("stop_token_config_report.json")
+        report_path.write_text(json.dumps(report, indent=2))
+
+        print(f"\n{'='*60}")
+        print(f"EMPIRICAL MAPPING REPORT")
+        print(f"{'='*60}")
+        print(json.dumps(report, indent=2))
+        print(f"\nReport saved to: {report_path.absolute()}")
+
+        # Summary
+        models_needing_fix = [
+            k for k, v in report.items()
+            if isinstance(v, dict) and v.get("workaround_needed")
+        ]
+        print(f"\nModels needing fix: {models_needing_fix}")
diff --git a/tests_2.0/test_token_limits.py b/tests_2.0/test_token_limits.py
new file mode 100644
index 0000000..29e2347
--- /dev/null
+++ b/tests_2.0/test_token_limits.py
@@ -0,0 +1,390 @@
+"""
+Token limit tests for Step 1.1/1.2.
+Tests dynamic token calculation and server vs run mode differences.
+"""
+
+import pytest
+from unittest.mock import Mock, patch
+from pathlib import Path
+
+from mlxk2.core.runner import MLXRunner, get_model_context_length
+from conftest_runner import mock_mlx_runner_environment
+
+
+class TestDynamicTokenLimits:
+    """Test dynamic token limit calculation based on model context."""
+    
+    def test_context_length_detection(self):
+        """Test that context length is properly extracted from config"""
+        # Test various config key patterns
+        configs = [
+            {"max_position_embeddings": 8192},
+            {"n_positions": 4096},
+            {"context_length": 16384},
+            {"max_sequence_length": 32768},
+            {"seq_len": 2048}
+        ]
+        
+        expected_lengths = [8192, 4096, 16384, 32768, 2048]
+        
+        for config, expected in zip(configs, expected_lengths):
+            with patch('builtins.open') as mock_open:
+                mock_open.return_value.__enter__.return_value.read.return_value = str(config).replace("'", '"')
+                
+                result = get_model_context_length("/fake/path")
+                assert result == expected
+    
+    def test_context_length_fallback(self):
+        """Test fallback to default when config unavailable"""
+        # Missing file
+        with patch('builtins.open', side_effect=FileNotFoundError()):
+            result = get_model_context_length("/nonexistent/path")
+            assert result == 4096
+        
+        # Invalid JSON
+        with patch('builtins.open') as mock_open:
+            mock_open.return_value.__enter__.return_value.read.return_value = "invalid json"
+            result = get_model_context_length("/fake/path")
+            assert result == 4096
+        
+        # Missing keys
+        with patch('builtins.open') as mock_open:
+            mock_open.return_value.__enter__.return_value.read.return_value = '{"other_key": 1234}'
+            result = get_model_context_length("/fake/path")
+            assert result == 4096
+    
+    @patch('mlxk2.core.runner.get_model_context_length')
+    def test_runner_dynamic_calculation_run_mode(self, mock_context_length):
+        """Test dynamic token calculation for run command (full context)"""
+        mock_context_length.return_value = 8192
+        
+        with patch('mlxk2.core.runner.load') as mock_load:
+            mock_load.return_value = (Mock(), Mock())
+            
+            with patch('mlxk2.core.runner.resolve_model_for_operation') as mock_resolve:
+                mock_resolve.return_value = ("test-model", None, None)
+                
+                with patch('mlxk2.core.cache.get_current_model_cache') as mock_cache:
+                    mock_cache.return_value = Mock()
+                    
+                    # Create runner and test calculation
+                    runner = MLXRunner("test-model")
+                    runner._context_length = 8192
+                    
+                    # Run mode: should use full context
+                    limit = runner._calculate_dynamic_max_tokens(server_mode=False)
+                    assert limit == 8192
+    
+    @patch('mlxk2.core.runner.get_model_context_length')
+    def test_runner_dynamic_calculation_server_mode(self, mock_context_length):
+        """Test dynamic token calculation for server (half context for DoS protection)"""
+        mock_context_length.return_value = 8192
+        
+        with patch('mlxk2.core.runner.load') as mock_load:
+            mock_load.return_value = (Mock(), Mock())
+            
+            with patch('mlxk2.core.runner.resolve_model_for_operation') as mock_resolve:
+                mock_resolve.return_value = ("test-model", None, None)
+                
+                with patch('mlxk2.core.cache.get_current_model_cache') as mock_cache:
+                    mock_cache.return_value = Mock()
+                    
+                    # Create runner and test calculation
+                    runner = MLXRunner("test-model")
+                    runner._context_length = 8192
+                    
+                    # Server mode: should use half context
+                    limit = runner._calculate_dynamic_max_tokens(server_mode=True)
+                    assert limit == 4096
+    
+    def test_no_context_length_fallback(self):
+        """Test behavior when context length is unavailable"""
+        with patch('mlxk2.core.runner.load') as mock_load:
+            mock_load.return_value = (Mock(), Mock())
+            
+            with patch('mlxk2.core.runner.resolve_model_for_operation') as mock_resolve:
+                mock_resolve.return_value = ("test-model", None, None)
+                
+                with patch('mlxk2.core.cache.get_current_model_cache') as mock_cache:
+                    mock_cache.return_value = Mock()
+                    
+                    # Create runner with no context length
+                    runner = MLXRunner("test-model")
+                    runner._context_length = None
+                    
+                    # Should fallback to default
+                    limit = runner._calculate_dynamic_max_tokens(server_mode=False)
+                    assert limit == 2048
+                    
+                    limit = runner._calculate_dynamic_max_tokens(server_mode=True)
+                    assert limit == 2048
+
+
+class TestTokenLimitApplication:
+    """Test that token limits are properly applied during generation."""
+    
+    @patch('mlxk2.core.runner.load')
+    @patch('mlxk2.core.runner.resolve_model_for_operation')
+    @patch('mlxk2.core.cache.get_current_model_cache')
+    @patch('mlxk2.core.runner.get_model_context_length')
+    def test_generate_streaming_uses_dynamic_limits(self, mock_context, mock_cache, mock_resolve, mock_load):
+        """Test that generate_streaming uses dynamic limits when max_tokens=None"""
+        # Setup mocks
+        mock_context.return_value = 8192
+        mock_resolve.return_value = ("test-model", None, None)
+        mock_cache.return_value = Mock()
+        
+        mock_model = Mock()
+        mock_tokenizer = Mock()
+        mock_tokenizer.eos_token = "</s>"
+        mock_tokenizer.eos_token_id = 2
+        mock_tokenizer.eos_token_ids = {mock_tokenizer.eos_token_id}
+        mock_tokenizer.additional_special_tokens = []
+        mock_tokenizer.added_tokens_decoder = {}
+        mock_tokenizer.encode.return_value = [1, 2, 3]
+        mock_load.return_value = (mock_model, mock_tokenizer)
+        
+        with patch('mlxk2.core.runner.generate_step') as mock_gen:
+            mock_gen.return_value = iter([])  # Empty generation
+            
+            with MLXRunner("test-model") as runner:
+                # Call with max_tokens=None
+                list(runner.generate_streaming("test", max_tokens=None))
+                
+                # Should call generate_step with dynamic limit (full context for run mode)
+                mock_gen.assert_called_once()
+                call_kwargs = mock_gen.call_args[1]
+                assert call_kwargs['max_tokens'] == 8192  # Full context
+    
+    @patch('mlxk2.core.runner.load')
+    @patch('mlxk2.core.runner.resolve_model_for_operation')
+    @patch('mlxk2.core.cache.get_current_model_cache')
+    @patch('mlxk2.core.runner.get_model_context_length')
+    def test_generate_streaming_respects_explicit_limits(self, mock_context, mock_cache, mock_resolve, mock_load):
+        """Test that explicit max_tokens is respected"""
+        # Setup mocks
+        mock_context.return_value = 8192
+        mock_resolve.return_value = ("test-model", None, None)
+        mock_cache.return_value = Mock()
+        
+        mock_model = Mock()
+        mock_tokenizer = Mock()
+        mock_tokenizer.eos_token = "</s>"
+        mock_tokenizer.eos_token_id = 2
+        mock_tokenizer.eos_token_ids = {mock_tokenizer.eos_token_id}
+        mock_tokenizer.additional_special_tokens = []
+        mock_tokenizer.added_tokens_decoder = {}
+        mock_tokenizer.encode.return_value = [1, 2, 3]
+        mock_load.return_value = (mock_model, mock_tokenizer)
+        
+        with patch('mlxk2.core.runner.generate_step') as mock_gen:
+            mock_gen.return_value = iter([])  # Empty generation
+            
+            with MLXRunner("test-model") as runner:
+                # Call with explicit max_tokens
+                list(runner.generate_streaming("test", max_tokens=500))
+                
+                # Should use explicit limit, not dynamic
+                mock_gen.assert_called_once()
+                call_kwargs = mock_gen.call_args[1]
+                assert call_kwargs['max_tokens'] == 500
+    
+    @patch('mlxk2.core.runner.load')
+    @patch('mlxk2.core.runner.resolve_model_for_operation')
+    @patch('mlxk2.core.cache.get_current_model_cache')
+    @patch('mlxk2.core.runner.get_model_context_length')
+    def test_generate_batch_uses_dynamic_limits(self, mock_context, mock_cache, mock_resolve, mock_load):
+        """Test that generate_batch also uses dynamic limits"""
+        # Setup mocks
+        mock_context.return_value = 16384
+        mock_resolve.return_value = ("test-model", None, None)
+        mock_cache.return_value = Mock()
+        
+        mock_model = Mock()
+        mock_tokenizer = Mock()
+        mock_tokenizer.eos_token = "</s>"
+        mock_tokenizer.eos_token_id = 2
+        mock_tokenizer.eos_token_ids = {mock_tokenizer.eos_token_id}
+        mock_tokenizer.additional_special_tokens = []
+        mock_tokenizer.added_tokens_decoder = {}
+        mock_tokenizer.encode.return_value = [1, 2, 3]
+        mock_tokenizer.decode.return_value = "test response"
+        mock_load.return_value = (mock_model, mock_tokenizer)
+        
+        with patch('mlxk2.core.runner.generate_step') as mock_gen:
+            mock_gen.return_value = iter([])  # Empty generation
+            
+            with MLXRunner("test-model") as runner:
+                # Call with max_tokens=None
+                runner.generate_batch("test", max_tokens=None)
+                
+                # Should use dynamic limit
+                mock_gen.assert_called_once()
+                call_kwargs = mock_gen.call_args[1]
+                assert call_kwargs['max_tokens'] == 16384  # Full context
+
+
+class TestLargeContextModels:
+    """Test behavior with large context models."""
+    
+    @patch('mlxk2.core.runner.get_model_context_length')
+    def test_large_context_model_limits(self, mock_context_length):
+        """Test dynamic limits for large context models"""
+        mock_context_length.return_value = 32768  # 32K context
+        
+        with patch('mlxk2.core.runner.load') as mock_load:
+            mock_load.return_value = (Mock(), Mock())
+            
+            with patch('mlxk2.core.runner.resolve_model_for_operation') as mock_resolve:
+                mock_resolve.return_value = ("large-model", None, None)
+                
+                with patch('mlxk2.core.cache.get_current_model_cache') as mock_cache:
+                    mock_cache.return_value = Mock()
+                    
+                    runner = MLXRunner("large-model")
+                    runner._context_length = 32768
+                    
+                    # Run mode: full context
+                    run_limit = runner._calculate_dynamic_max_tokens(server_mode=False)
+                    assert run_limit == 32768
+                    
+                    # Server mode: half context
+                    server_limit = runner._calculate_dynamic_max_tokens(server_mode=True)
+                    assert server_limit == 16384
+    
+    @patch('mlxk2.core.runner.get_model_context_length')
+    def test_very_large_context_handling(self, mock_context_length):
+        """Test handling of very large context models (128K+)"""
+        mock_context_length.return_value = 131072  # 128K context
+        
+        with patch('mlxk2.core.runner.load') as mock_load:
+            mock_load.return_value = (Mock(), Mock())
+            
+            with patch('mlxk2.core.runner.resolve_model_for_operation') as mock_resolve:
+                mock_resolve.return_value = ("huge-model", None, None)
+                
+                with patch('mlxk2.core.cache.get_current_model_cache') as mock_cache:
+                    mock_cache.return_value = Mock()
+                    
+                    runner = MLXRunner("huge-model")
+                    runner._context_length = 131072
+                    
+                    # Should handle very large contexts
+                    run_limit = runner._calculate_dynamic_max_tokens(server_mode=False)
+                    assert run_limit == 131072
+                    
+                    server_limit = runner._calculate_dynamic_max_tokens(server_mode=True)
+                    assert server_limit == 65536
+
+
+class TestTokenLimitEdgeCases:
+    """Test edge cases in token limit calculation."""
+    
+    def test_zero_context_length(self):
+        """Test handling of zero context length"""
+        with patch('mlxk2.core.runner.load') as mock_load:
+            mock_load.return_value = (Mock(), Mock())
+            
+            with patch('mlxk2.core.runner.resolve_model_for_operation') as mock_resolve:
+                mock_resolve.return_value = ("test-model", None, None)
+                
+                with patch('mlxk2.core.cache.get_current_model_cache') as mock_cache:
+                    mock_cache.return_value = Mock()
+                    
+                    runner = MLXRunner("test-model")
+                    runner._context_length = 0
+                    
+                    # Should fallback to default
+                    limit = runner._calculate_dynamic_max_tokens(server_mode=False)
+                    assert limit == 2048
+    
+    def test_negative_context_length(self):
+        """Test handling of negative context length"""
+        runner = MLXRunner.__new__(MLXRunner)  # Create without __init__
+        runner._context_length = -1000
+        
+        # Should fallback to default for negative values
+        limit = runner._calculate_dynamic_max_tokens(server_mode=False)
+        assert limit == 2048
+    
+    def test_odd_context_length_division(self):
+        """Test server mode with odd context lengths"""
+        with patch('mlxk2.core.runner.load') as mock_load:
+            mock_load.return_value = (Mock(), Mock())
+            
+            with patch('mlxk2.core.runner.resolve_model_for_operation') as mock_resolve:
+                mock_resolve.return_value = ("test-model", None, None)
+                
+                with patch('mlxk2.core.cache.get_current_model_cache') as mock_cache:
+                    mock_cache.return_value = Mock()
+                    
+                    runner = MLXRunner("test-model")
+                    runner._context_length = 8193  # Odd number
+                    
+                    # Server mode should handle integer division
+                    limit = runner._calculate_dynamic_max_tokens(server_mode=True)
+                    assert limit == 4096  # 8193 // 2
+
+
+class TestServerVsRunDifferences:
+    """Test the key difference between server and run mode token policies."""
+    
+    def test_run_vs_server_mode_policy_difference(self):
+        """Test the fundamental difference: run uses full, server uses half"""
+        with patch('mlxk2.core.runner.load') as mock_load:
+            mock_load.return_value = (Mock(), Mock())
+            
+            with patch('mlxk2.core.runner.resolve_model_for_operation') as mock_resolve:
+                mock_resolve.return_value = ("test-model", None, None)
+                
+                with patch('mlxk2.core.cache.get_current_model_cache') as mock_cache:
+                    mock_cache.return_value = Mock()
+                    
+                    runner = MLXRunner("test-model")
+                    runner._context_length = 8192
+                    
+                    # Run command: full context (user's own machine, be generous)
+                    run_limit = runner._calculate_dynamic_max_tokens(server_mode=False)
+                    
+                    # Server: half context (DoS protection)
+                    server_limit = runner._calculate_dynamic_max_tokens(server_mode=True)
+                    
+                    # Should be exactly 2:1 ratio
+                    assert run_limit == 8192
+                    assert server_limit == 4096
+                    assert run_limit == 2 * server_limit
+    
+    def test_rationale_for_different_policies(self):
+        """Document the rationale for different token policies"""
+        # This test serves as documentation
+        
+        # Run command rationale:
+        # - User's own machine and models
+        # - User has full control over resource usage
+        # - No DoS concerns (single user)
+        # - Be generous with token limits
+        
+        # Server rationale:
+        # - Potentially multiple concurrent requests
+        # - DoS protection needed
+        # - Resource sharing concerns
+        # - Conservative token limits
+        
+        with patch('mlxk2.core.runner.load') as mock_load:
+            mock_load.return_value = (Mock(), Mock())
+            
+            with patch('mlxk2.core.runner.resolve_model_for_operation') as mock_resolve:
+                mock_resolve.return_value = ("test-model", None, None)
+                
+                with patch('mlxk2.core.cache.get_current_model_cache') as mock_cache:
+                    mock_cache.return_value = Mock()
+                    
+                    runner = MLXRunner("test-model")
+                    runner._context_length = 8192
+                    
+                    # These policies should be clearly different
+                    run_policy = runner._calculate_dynamic_max_tokens(server_mode=False)
+                    server_policy = runner._calculate_dynamic_max_tokens(server_mode=True)
+                    
+                    assert run_policy > server_policy
+                    assert run_policy / server_policy == 2.0  # Exactly 2x difference