mirror of
https://github.com/cloudstack-llc/mlx-knife.git
synced 2026-07-01 20:44:14 -04:00
57bf6d86be
Major Features Added: • Complete run command implementation with interactive/single-shot modes • MLXRunner core engine ported from 1.x with modular architecture • OpenAI-compatible server with SIGINT-robust supervisor mode • Experimental push feature properly isolated behind environment variable Key Improvements: - Full feature parity with 1.1.1 stable releases - Enhanced human output formatting across all commands - Clean separation of stable (184 tests) vs experimental features - Updated demo GIF showcasing improved 2.0 interface Fixes: - Pull operation cache pollution (Issue #30) with preflight access checks - Test stability improvements across all environments Architecture: - Modular runner design with focused helper modules - Thread-safe model loading and memory management - stable testing across Python 3.9-3.13 Ready for use as comprehensive 1.x alternative.
9.7 KiB
9.7 KiB
2.0 Server/Run Test Specifications
Purpose: Abstract test specifications extracted from 1.x for implementation in 2.0
Created: 2025-09-10
For: Sonnet implementation sessions
Open Issues to Address
Issue #30: Gated Models Preflight
- Test: Mock 403 response → Verify NO cache writes
- Test: Clear error message with actionable guidance
- Test: Successful auth → Normal pull flow
Ctrl-C Interruption Support
- Test: Long generation → Ctrl-C → Clean interruption
- Test: Server request → Ctrl-C → Graceful shutdown
- Test: No zombie processes after interrupt
Core Principles
- Test-First: Write failing tests before implementation
- Isolated Caches: Use temp_cache_dir fixtures, never touch user cache
- Abstract Contracts: Test behaviors, not implementations
- Model-Agnostic: Use tiny test models where possible
Server API Contract Tests
1. OpenAI Compatibility (test_server_api_compliance.py)
class TestOpenAICompliance:
"""Verify OpenAI API compatibility"""
def test_models_endpoint(self):
# GET /v1/models
# Returns: {"data": [{"id": "model-name", "object": "model", ...}]}
def test_chat_completions_basic(self):
# POST /v1/chat/completions
# Body: {"model": "...", "messages": [...], "stream": false}
# Returns: {"choices": [{"message": {"content": "..."}}]}
def test_chat_completions_streaming(self):
# POST /v1/chat/completions with stream=true
# Returns: SSE stream with data: prefixed chunks
# Final: data: [DONE]
def test_completions_endpoint(self):
# POST /v1/completions
# Body: {"model": "...", "prompt": "...", "stream": false}
# Returns: {"choices": [{"text": "..."}]}
2. Dynamic Token Management (test_server_token_limits.py)
class TestDynamicTokens:
"""Test model-aware token limits (Issue #15/16)"""
def test_no_max_tokens_uses_dynamic(self):
# Given: Model with 8K context
# When: max_tokens=None in request
# Then: Server uses appropriate dynamic limit (~2000-4000)
def test_respects_explicit_max_tokens(self):
# Given: Any model
# When: max_tokens=500 in request
# Then: Server respects explicit limit
def test_large_context_models(self):
# Given: 30K+ context model
# When: max_tokens=None
# Then: Larger dynamic limit applied
3. Model Hot-Swapping (test_server_model_switching.py)
class TestModelSwitching:
"""Test model switching without restart"""
def test_switch_between_models(self):
# Given: Server running with model A
# When: Request specifies model B
# Then: Model B loads, A unloads, response from B
def test_concurrent_model_requests(self):
# Given: Multiple requests with different models
# Then: Proper queueing/switching without crashes
4. Stop Token Filtering (test_server_stop_tokens.py)
class TestStopTokens:
"""Test stop token handling (Issue #14, #20)"""
def test_chat_stop_tokens_filtered(self):
# Given: Chat mode
# Then: "\nHuman:", "\nAssistant:" never in output
def test_streaming_vs_batch_consistency(self):
# Given: Same prompt
# When: stream=true vs stream=false
# Then: Identical output (no extra tokens)
Run Command Contract Tests
1. Complete Run Command (test_run_complete.py)
class TestRunBasic:
"""Basic run command functionality"""
def test_run_single_shot_streaming(self):
# mlxk run model "prompt"
# Returns: Generated text to stdout, token-by-token
def test_run_single_shot_batch(self):
# mlxk run model "prompt" --no-stream
# Returns: Complete output at once
def test_run_interactive_streaming(self):
# mlxk run model (no prompt)
# Triggers: Interactive chat mode with streaming responses
def test_run_interactive_batch(self):
# mlxk run model --no-stream (no prompt)
# Triggers: Interactive chat mode with batch responses
def test_run_full_context_tokens(self):
# mlxk run model "prompt"
# Uses: Full model context length (no DoS protection)
# Verify: max_tokens defaults to model's full context
def test_conversation_history_tracking(self):
# Interactive mode maintains conversation context
# Each new input includes previous conversation
def test_chat_template_integration(self):
# Uses model's native chat template for conversation formatting
# Falls back to Human:/Assistant: if no template available
2. Server Token Management (test_server_tokens.py)
class TestServerTokens:
"""Server-specific token limit behavior"""
def test_server_half_context_protection(self):
# Server mode uses half model context for DoS protection
# Given: Model with 8K context
# Server: Uses max 4K tokens by default
# Run: Uses full 8K tokens by default
def test_server_vs_run_token_limits(self):
# Verify different token policies:
# Run command: Full context (generous)
# Server API: Half context (defensive)
3. Reasoning Models (test_reasoning_models.py)
class TestReasoningModels:
"""GPT-OSS/MXFP4 reasoning support"""
def test_gpt_oss_reasoning_detection(self):
# Model name contains "gpt-oss" or "mxfp4"
# Automatic reasoning extraction
def test_reasoning_formatting(self):
# Output: **[Reasoning]** ... **[Answer]** ...
def test_hide_reasoning_flag(self):
# mlxk run model "prompt" --hide-reasoning
# Shows only answer, no reasoning
4. Memory Management (test_memory_safety.py)
class TestMemorySafety:
"""Context manager and cleanup"""
def test_context_manager_cleanup(self):
# Model loaded in context
# Automatic cleanup on exit/exception
def test_exception_safety(self):
# Exception during generation
# Resources still cleaned up
Show Command Enhancements
Quantization Display (test_show_quantization.py)
class TestShowQuantization:
"""Enhanced quantization info (beta.3)"""
def test_mxfp4_detection(self):
# Config has quantization.mode = "mxfp4"
# Shows: "Advanced mode 'mxfp4' (requires MLX ≥0.29.0)"
def test_gguf_variants(self):
# Multiple .gguf files
# Lists all variants with sizes
def test_precision_display(self):
# Shows: int4, int8, gguf, etc.
Test Data Requirements
⚠️ CRITICAL: Test Model Strategy
NIEMALS user cache für Tests verwenden! Immer temp_cache_dir fixture!
Minimal Test Models
tiny-models:
- hf-internal-testing/tiny-random-gpt2 # 12MB, for basic tests
- local-mock-models/fake-mxfp4-model # Mock config.json only
- local-mock-models/fake-reasoning-model # Mock with reasoning markers
real-models-optional: # For @pytest.mark.server tests only
- mlx-community/Phi-3-mini-4k-instruct-4bit
- gpt-oss-20b-MXFP4-Q8 # For reasoning tests
Implementation Priority
Priority A: Beta.1 - Complete Run Command (CRITICAL - Must Have)
mlxk2/core/runner.py- MLX execution engine ✅- Single-shot run:
mlxk2 run model "prompt"✅ - Interactive run:
mlxk2 run model(no prompt) - Streaming and batch modes for both
- Full context token limits (no DoS protection)
- Conversation history tracking
- Chat template integration
- Ctrl-C handling
Priority B: Beta.2 - Server Implementation (HIGH - Should Have)
- OpenAI-compatible API server
- Half context token limits for server (DoS protection)
- Model hot-swapping support
- SSE streaming in server endpoints
- Reasoning model support
- System prompt support
Priority C: Beta.3 - Advanced Features (MEDIUM - Could Have)
- Performance optimizations
- Enhanced error handling
- Advanced reasoning features
- Issue #30: Gated models preflight
Critical Implementation Notes
1. Streaming Architecture
# 1.x uses generator pattern - PRESERVE THIS
def generate_streaming(prompt, **kwargs):
for token in model.generate(...):
yield token
# Server SSE format - MUST MATCH
data: {"choices": [{"delta": {"content": "token"}}]}
data: [DONE]
2. Stop Token Management
# Priority order (from 1.x mlx_runner.py)
CHAT_STOP_TOKENS = ["\nHuman:", "\nAssistant:", "\nUser:", "\nYou:"]
# 1. Check model's native stop tokens first
# 2. Add chat stop tokens as fallback
# 3. Filter from output in both streaming and batch
3. Model Loading Pattern
# Context manager pattern from 1.x - CRITICAL
class MLXRunner:
def __enter__(self):
self.load_model()
return self
def __exit__(self, ...):
self.cleanup() # MUST cleanup even on exception
Version Strategy
Local Git Tags (Not Published)
2.0.0-beta.1-local- Basic server/run port2.0.0-beta.2-local- Full reasoning support
Public Release
2.0.0-beta.3- First public beta (fully tested)
Gotchas for Sonnet Sessions
- Don't forget MLX version checks: MXFP4 requires MLX ≥0.29.0
- Test with isolated caches: Never assume user has models
- Preserve 1.x CLI interface: Same commands, same flags
- Keep modular boundaries: Core vs Operations vs Output
- Test streaming separately: Different code paths
References
- 1.x source:
git show main:mlx_knife/server.py - 1.x tests:
git show main:tests/integration/test_server_functionality.py - Test patterns:
tests_2.0/conftest.pyfor fixtures