mirror of https://github.com/cloudstack-llc/mlx-knife.git synced 2026-07-01 20:44:14 -04:00

Files

T

The BROKE Cluster Team 57bf6d86be 2.0.0-beta.3: Feature Complete - Full 1.1.1 Parity Achieved

Major Features Added:
  • Complete run command implementation with interactive/single-shot modes
  • MLXRunner core engine ported from 1.x with modular architecture
  • OpenAI-compatible server with SIGINT-robust supervisor mode
  • Experimental push feature properly isolated behind environment variable

  Key Improvements:
  - Full feature parity with 1.1.1 stable releases
  - Enhanced human output formatting across all commands
  - Clean separation of stable (184 tests) vs experimental features
  - Updated demo GIF showcasing improved 2.0 interface

  Fixes:
  - Pull operation cache pollution (Issue #30) with preflight access checks
  - Test stability improvements across all environments

  Architecture:
  - Modular runner design with focused helper modules
  - Thread-safe model loading and memory management
  - stable testing across Python 3.9-3.13

  Ready for use as comprehensive 1.x alternative.

2025-09-14 18:04:18 +02:00

9.7 KiB

Raw Blame History

2.0 Server/Run Test Specifications

Purpose: Abstract test specifications extracted from 1.x for implementation in 2.0
Created: 2025-09-10
For: Sonnet implementation sessions

Open Issues to Address

Issue #30: Gated Models Preflight

Test: Mock 403 response → Verify NO cache writes
Test: Clear error message with actionable guidance
Test: Successful auth → Normal pull flow

Ctrl-C Interruption Support

Test: Long generation → Ctrl-C → Clean interruption
Test: Server request → Ctrl-C → Graceful shutdown
Test: No zombie processes after interrupt

Core Principles

Test-First: Write failing tests before implementation
Isolated Caches: Use temp_cache_dir fixtures, never touch user cache
Abstract Contracts: Test behaviors, not implementations
Model-Agnostic: Use tiny test models where possible

Server API Contract Tests

1. OpenAI Compatibility (`test_server_api_compliance.py`)

class TestOpenAICompliance:
    """Verify OpenAI API compatibility"""
    
    def test_models_endpoint(self):
        # GET /v1/models
        # Returns: {"data": [{"id": "model-name", "object": "model", ...}]}
        
    def test_chat_completions_basic(self):
        # POST /v1/chat/completions
        # Body: {"model": "...", "messages": [...], "stream": false}
        # Returns: {"choices": [{"message": {"content": "..."}}]}
        
    def test_chat_completions_streaming(self):
        # POST /v1/chat/completions with stream=true
        # Returns: SSE stream with data: prefixed chunks
        # Final: data: [DONE]
        
    def test_completions_endpoint(self):
        # POST /v1/completions
        # Body: {"model": "...", "prompt": "...", "stream": false}
        # Returns: {"choices": [{"text": "..."}]}

2. Dynamic Token Management (`test_server_token_limits.py`)

class TestDynamicTokens:
    """Test model-aware token limits (Issue #15/16)"""
    
    def test_no_max_tokens_uses_dynamic(self):
        # Given: Model with 8K context
        # When: max_tokens=None in request
        # Then: Server uses appropriate dynamic limit (~2000-4000)
        
    def test_respects_explicit_max_tokens(self):
        # Given: Any model
        # When: max_tokens=500 in request
        # Then: Server respects explicit limit
        
    def test_large_context_models(self):
        # Given: 30K+ context model
        # When: max_tokens=None
        # Then: Larger dynamic limit applied

3. Model Hot-Swapping (`test_server_model_switching.py`)

class TestModelSwitching:
    """Test model switching without restart"""
    
    def test_switch_between_models(self):
        # Given: Server running with model A
        # When: Request specifies model B
        # Then: Model B loads, A unloads, response from B
        
    def test_concurrent_model_requests(self):
        # Given: Multiple requests with different models
        # Then: Proper queueing/switching without crashes

4. Stop Token Filtering (`test_server_stop_tokens.py`)

class TestStopTokens:
    """Test stop token handling (Issue #14, #20)"""
    
    def test_chat_stop_tokens_filtered(self):
        # Given: Chat mode
        # Then: "\nHuman:", "\nAssistant:" never in output
        
    def test_streaming_vs_batch_consistency(self):
        # Given: Same prompt
        # When: stream=true vs stream=false
        # Then: Identical output (no extra tokens)

Run Command Contract Tests

1. Complete Run Command (`test_run_complete.py`)

class TestRunBasic:
    """Basic run command functionality"""
    
    def test_run_single_shot_streaming(self):
        # mlxk run model "prompt"
        # Returns: Generated text to stdout, token-by-token
        
    def test_run_single_shot_batch(self):
        # mlxk run model "prompt" --no-stream
        # Returns: Complete output at once
        
    def test_run_interactive_streaming(self):
        # mlxk run model (no prompt)
        # Triggers: Interactive chat mode with streaming responses
        
    def test_run_interactive_batch(self):
        # mlxk run model --no-stream (no prompt)
        # Triggers: Interactive chat mode with batch responses
        
    def test_run_full_context_tokens(self):
        # mlxk run model "prompt"
        # Uses: Full model context length (no DoS protection)
        # Verify: max_tokens defaults to model's full context
        
    def test_conversation_history_tracking(self):
        # Interactive mode maintains conversation context
        # Each new input includes previous conversation
        
    def test_chat_template_integration(self):
        # Uses model's native chat template for conversation formatting
        # Falls back to Human:/Assistant: if no template available

2. Server Token Management (`test_server_tokens.py`)

class TestServerTokens:
    """Server-specific token limit behavior"""
    
    def test_server_half_context_protection(self):
        # Server mode uses half model context for DoS protection
        # Given: Model with 8K context
        # Server: Uses max 4K tokens by default
        # Run: Uses full 8K tokens by default
        
    def test_server_vs_run_token_limits(self):
        # Verify different token policies:
        # Run command: Full context (generous)
        # Server API: Half context (defensive)

3. Reasoning Models (`test_reasoning_models.py`)

class TestReasoningModels:
    """GPT-OSS/MXFP4 reasoning support"""
    
    def test_gpt_oss_reasoning_detection(self):
        # Model name contains "gpt-oss" or "mxfp4"
        # Automatic reasoning extraction
        
    def test_reasoning_formatting(self):
        # Output: **[Reasoning]** ... **[Answer]** ...
        
    def test_hide_reasoning_flag(self):
        # mlxk run model "prompt" --hide-reasoning
        # Shows only answer, no reasoning

4. Memory Management (`test_memory_safety.py`)

class TestMemorySafety:
    """Context manager and cleanup"""
    
    def test_context_manager_cleanup(self):
        # Model loaded in context
        # Automatic cleanup on exit/exception
        
    def test_exception_safety(self):
        # Exception during generation
        # Resources still cleaned up

Show Command Enhancements

Quantization Display (`test_show_quantization.py`)

class TestShowQuantization:
    """Enhanced quantization info (beta.3)"""
    
    def test_mxfp4_detection(self):
        # Config has quantization.mode = "mxfp4"
        # Shows: "Advanced mode 'mxfp4' (requires MLX ≥0.29.0)"
        
    def test_gguf_variants(self):
        # Multiple .gguf files
        # Lists all variants with sizes
        
    def test_precision_display(self):
        # Shows: int4, int8, gguf, etc.

Test Data Requirements

⚠️ CRITICAL: Test Model Strategy

NIEMALS user cache für Tests verwenden! Immer temp_cache_dir fixture!

Minimal Test Models

tiny-models:
  - hf-internal-testing/tiny-random-gpt2  # 12MB, for basic tests
  - local-mock-models/fake-mxfp4-model     # Mock config.json only
  - local-mock-models/fake-reasoning-model # Mock with reasoning markers

real-models-optional:  # For @pytest.mark.server tests only
  - mlx-community/Phi-3-mini-4k-instruct-4bit
  - gpt-oss-20b-MXFP4-Q8  # For reasoning tests

Implementation Priority

Priority A: Beta.1 - Complete Run Command (CRITICAL - Must Have)

mlxk2/core/runner.py - MLX execution engine ✅
Single-shot run: mlxk2 run model "prompt" ✅
Interactive run: mlxk2 run model (no prompt)
Streaming and batch modes for both
Full context token limits (no DoS protection)
Conversation history tracking
Chat template integration
Ctrl-C handling

Priority B: Beta.2 - Server Implementation (HIGH - Should Have)

OpenAI-compatible API server
Half context token limits for server (DoS protection)
Model hot-swapping support
SSE streaming in server endpoints
Reasoning model support
System prompt support

Priority C: Beta.3 - Advanced Features (MEDIUM - Could Have)

Performance optimizations
Enhanced error handling
Advanced reasoning features
Issue #30: Gated models preflight

Critical Implementation Notes

1. Streaming Architecture

# 1.x uses generator pattern - PRESERVE THIS
def generate_streaming(prompt, **kwargs):
    for token in model.generate(...):
        yield token
        
# Server SSE format - MUST MATCH
data: {"choices": [{"delta": {"content": "token"}}]}
data: [DONE]

2. Stop Token Management

# Priority order (from 1.x mlx_runner.py)
CHAT_STOP_TOKENS = ["\nHuman:", "\nAssistant:", "\nUser:", "\nYou:"]

# 1. Check model's native stop tokens first
# 2. Add chat stop tokens as fallback
# 3. Filter from output in both streaming and batch

3. Model Loading Pattern

# Context manager pattern from 1.x - CRITICAL
class MLXRunner:
    def __enter__(self):
        self.load_model()
        return self
        
    def __exit__(self, ...):
        self.cleanup()  # MUST cleanup even on exception

Version Strategy

Local Git Tags (Not Published)

2.0.0-beta.1-local - Basic server/run port
2.0.0-beta.2-local - Full reasoning support

Public Release

2.0.0-beta.3 - First public beta (fully tested)

Gotchas for Sonnet Sessions

Don't forget MLX version checks: MXFP4 requires MLX ≥0.29.0
Test with isolated caches: Never assume user has models
Preserve 1.x CLI interface: Same commands, same flags
Keep modular boundaries: Core vs Operations vs Output
Test streaming separately: Different code paths

References

1.x source: git show main:mlx_knife/server.py
1.x tests: git show main:tests/integration/test_server_functionality.py
Test patterns: tests_2.0/conftest.py for fixtures

9.7 KiB Raw Blame History