Release MLX Knife 1.1.0-beta2 - Critical Bug Fixes & Test Stability

Major fixes: - Issue #19: Server response truncation resolved - large context models work at full capacity - Issue #20: End-Token filtering in non-streaming mode - clean professional output - Test stability: Fixed flaky server tests, improved lifecycle management Technical changes: - Server: Dynamic token limits by default (--max-tokens None) - MLXRunner: Added _filter_end_tokens_from_response() for batch consistency - Tests: 132/132 passing + 48 comprehensive server tests - Documentation: Updated CHANGELOG.md, README.md, TESTING.md
2026-07-01 20:44:14 -04:00 · 2025-08-22 23:16:50 +02:00
parent 74239c4e43
commit 1aad374d08
11 changed files with 806 additions and 24 deletions
@@ -2,7 +2,7 @@

 ## Current Status

-✅ **131/131 tests passing** (August 2025)  
+✅ **132/132 tests passing** (August 2025)  
 ✅ **Apple Silicon verified** (M1/M2/M3)  
 ✅ **Python 3.9-3.13 compatible**  
 ✅ **Beta ready** - comprehensive testing with real model execution
@@ -42,8 +42,9 @@ This approach ensures our tests reflect real-world usage, not mocked behavior.
 ```
 tests/
 ├── conftest.py                     # Shared fixtures and utilities
-├── integration/                    # System-level integration tests (85+ tests)
+├── integration/                    # System-level integration tests (84+ tests)
 │   ├── test_core_functionality.py      # Basic CLI operations
+│   ├── test_end_token_issue.py         # Issue #20: End-token filtering consistency
 │   ├── test_health_checks.py           # Model corruption detection  
 │   ├── test_issue_14.py               # Issue #14: Chat self-conversation fix
 │   ├── test_issue_15_16.py            # Issues #15/#16: Dynamic token limits
@@ -115,6 +116,9 @@ pytest tests/integration/test_health_checks.py -v
 # Core functionality (basic CLI commands)
 pytest tests/integration/test_core_functionality.py -v

+# Issue #20: End-token filtering consistency (new in 1.1.0-beta2)
+pytest tests/integration/test_end_token_issue.py -v
+
 # Advanced run command tests
 pytest tests/integration/test_run_command_advanced.py -v

@@ -128,8 +132,8 @@ pytest tests/integration/test_server_functionality.py -v
 # Run only basic operations tests
 pytest -k "TestBasicOperations" -v

-# Server tests are automatically excluded by default
-# (no command needed - this is the default behavior)
+# Server tests are excluded by default (marked with @pytest.mark.server)
+# They require significant RAM and time (48 tests × multiple models)

 # Skip tests requiring actual models
 pytest -k "not requires_model" -v
@@ -154,17 +158,42 @@ pytest --durations=10
 pytest -n auto
 ```

+### Server Tests (Advanced)
+
+**⚠️ Warning**: Server tests require significant system resources and time.
+
+```bash
+# Run comprehensive Issue #20 server tests (48 tests, ~30 minutes)
+pytest tests/integration/test_end_token_issue.py -m server -v
+
+# All server-marked tests (includes above + server functionality)
+pytest -m server -v
+
+# Quick server functionality test only
+pytest tests/integration/test_server_functionality.py -v
+
+# Server tests are RAM-aware - automatically skip models that don't fit
+```
+
+**Server Test Requirements:**
+- **RAM**: 8GB+ recommended (16GB+ for large models)  
+- **Time**: 20-40 minutes for full suite
+- **Models**: Multiple 4-bit quantized models (1B-30B parameters)
+- **Coverage**: Streaming vs non-streaming consistency, token limits, API compliance
+
 ## Python Version Compatibility

 ### Verification Results (August 2025)

+**✅ 132/132 tests passing** - All standard tests validated on Apple Silicon
+
 | Python Version | Status | Tests Passing |
 |----------------|--------|---------------|
-| 3.9.6 (macOS)  | ✅ Verified | 131/131 |
-| 3.10.x         | ✅ Verified | 131/131 |
-| 3.11.x         | ✅ Verified | 131/131 |
-| 3.12.x         | ✅ Verified | 131/131 |
-| 3.13.x         | ✅ Verified | 131/131 |
+| 3.9.6 (macOS)  | ✅ Verified | 132/132 |
+| 3.10.x         | ✅ Verified | 132/132 |
+| 3.11.x         | ✅ Verified | 132/132 |
+| 3.12.x         | ✅ Verified | 132/132 |
+| 3.13.x         | ✅ Verified | 132/132 |

 All versions tested with real MLX model execution (Phi-3-mini-4k-instruct-4bit).

@@ -394,16 +423,16 @@ tests/integration/test_issue_14.py::test_issue_14_self_conversation_regression_o
 ========== 7 passed in 45.23s ==========
 ```

-### Future Server Tests (Planned)
+### Additional Server Tests

-**Issue #15** - Token Limit vs Stop Token Race Condition:
+**Issues #15 & #16** - Dynamic Token Limits (Implemented in 1.1.0-beta1):
 ```bash
-pytest tests/integration/test_issue_15.py -m server -v
+pytest tests/integration/test_issue_15_16.py -v
 ```

-**Issue #16** - Interactive vs Server Token Policies:  
+**Issue #20** - End-Token Filtering (Implemented in 1.1.0-beta2):
 ```bash
-pytest tests/integration/test_issue_16.py -m server -v
+pytest tests/integration/test_end_token_issue.py -m server -v
 ```

 ### Troubleshooting Server Tests