Files
2026-01-25 18:51:49 -07:00

449 lines
13 KiB
Markdown

# Open WebUI Benchmark Suite
A comprehensive benchmarking framework for testing Open WebUI performance under various load conditions.
## Overview
This benchmark suite is designed to:
1. **Measure concurrent user capacity** - Test how many users can simultaneously use features like Channels
2. **Identify performance limits** - Find the point where response times degrade
3. **Compare compute profiles** - Test performance across different resource configurations
4. **Generate actionable reports** - Provide detailed metrics and recommendations
## Quick Start
### Prerequisites
- Python 3.11+
- Docker and Docker Compose
- A running Open WebUI instance (or use the provided Docker setup)
- Chromium browser (installed automatically via Playwright for UI benchmarks)
### Installation
```bash
cd benchmark
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .
# Install Playwright browsers (required for UI benchmarks)
playwright install chromium
```
### Configuration
Copy the example environment file and configure your admin credentials:
```bash
cp .env.example .env
```
Edit `.env` with your Open WebUI admin credentials:
```dotenv
OPEN_WEBUI_URL=http://localhost:8080
ADMIN_USER_EMAIL=your-admin@example.com
ADMIN_USER_PASSWORD=your-password
```
### Running Benchmarks
1. **Start Open WebUI with benchmark configuration:**
```bash
cd docker
./run.sh default # Use the default compute profile (2 CPU, 8GB RAM)
```
2. **Run the benchmark:**
```bash
# Run the default benchmark (chat-ui with auto-scaling), which automatically finds max sustainable users based on P95 response time
owb run
# Set a custom response time threshold (default: 1000ms)
owb run --response-threshold 2000
# Run with a fixed number of users (disables auto-scaling)
owb run -m 50
# Run with visible browsers for debugging
owb run --headed
owb run --headed --slow-mo 500 # Slow down for visual inspection
```
3. **List available benchmarks:**
```bash
owb list
```
4. **Run other benchmarks:**
```bash
# API-based chat benchmark (no browser)
owb run chat-api -m 50
# Channel API concurrency
owb run channels-api -m 50
# Channel WebSocket benchmark
owb run channels-ws -m 50
# Run all benchmarks
owb run all
```
5. **View results:**
Results are organized by benchmark name and timestamp:
```
results/
└── chat_ui_concurrency/
└── 20260126_014205/
├── result.json # Detailed benchmark data
├── results.csv # Tabular results
└── summary.txt # Human-readable summary
```
## Compute Profiles
Compute profiles define the resource constraints for the Open WebUI container:
| Profile | CPUs | Memory | Use Case |
|---------|------|--------|----------|
| `default` | 2 | 8GB | Local MacBook testing |
| `minimal` | 1 | 4GB | Testing lower bounds |
| `cloud_small` | 2 | 4GB | Small cloud VM |
| `cloud_medium` | 4 | 8GB | Medium cloud VM |
| `cloud_large` | 8 | 16GB | Large cloud VM |
List available profiles:
```bash
owb profiles
```
## Available Benchmarks
### Chat API Concurrency (`chat-api`)
Tests concurrent AI chat performance via the OpenAI-compatible API:
- Creates test users and makes a model publicly available
- Each user sends chat requests via the `/api/chat` endpoint
- Measures response times, throughput, and error rates
- Tests the backend's ability to handle concurrent LLM requests
**Usage:**
```bash
owb run chat-api -m 50 --model gpt-4o-mini
```
### Chat UI Concurrency (`chat-ui`) - Default
Tests concurrent AI chat performance through actual browser UI using Playwright. **This is the default benchmark** and runs in auto-scale mode by default.
**Auto-scale mode (default):**
- Progressively adds users until P95 response time exceeds threshold
- Automatically finds maximum sustainable concurrent users
- Reports performance at each level tested
**Fixed mode:**
- Test with a specific number of concurrent users
- Enabled by specifying `--max-users` / `-m`
**How it works:**
- Launches real Chromium browser instances (or contexts)
- Each browser logs in as a different user
- Sends chat messages and waits for streaming responses
- Measures actual user-experienced response times including rendering
- Tests full stack performance: UI, backend, and LLM together
**Usage:**
```bash
# Auto-scale mode (default) - finds max sustainable users
owb run
owb run --response-threshold 2000 # Custom threshold (default: 1000ms)
# Fixed mode - test specific user count
owb run -m 50
owb run -m 50 --model gpt-4o-mini
# Debugging options
owb run --headed # Visible browsers
owb run --headed --slow-mo 500 # Slow down for inspection
```
**Configuration:**
```yaml
chat_ui:
headless: true # Run browsers in headless mode
slow_mo: 0 # Slow down operations by ms (debugging)
viewport_width: 1280 # Browser viewport width
viewport_height: 720 # Browser viewport height
browser_timeout: 30000 # Default timeout in ms
screenshot_on_error: true # Capture screenshots on failure
use_isolated_browsers: false # Use separate browser instances vs contexts
```
**Notes:**
- Browser benchmarks require more resources than API benchmarks
- For high concurrency (50+), use headless mode and browser contexts
- Headed mode is useful for debugging UI issues
- The benchmark measures actual streaming response detection
### Channel Concurrency (`channels-api`)
Tests concurrent user capacity in Open WebUI Channels:
- Creates a test channel
- Progressively adds users (10, 20, 30, ... up to max)
- Each user sends messages at a configured rate
- Measures response times and error rates
- Identifies the maximum sustainable user count
**Configuration options:**
```yaml
channels:
max_concurrent_users: 100 # Maximum users to test
user_step_size: 10 # Increment users by this amount
sustain_time: 30 # Seconds to run at each level
message_frequency: 0.5 # Messages per second per user
```
### Channel WebSocket (`channels-ws`)
Tests WebSocket scalability for real-time message delivery in Channels:
- Establishes WebSocket connections for multiple users
- Tests real-time message broadcasting
- Measures message delivery latency
- Identifies WebSocket connection limits
## Configuration
Configuration files are located in `config/`:
- `benchmark_config.yaml` - Main benchmark settings
- `compute_profiles.yaml` - Resource profiles for Docker containers
### Environment Variables
All configuration can be set via environment variables (loaded from `.env` file):
| Variable | Description | Default |
|----------|-------------|---------|
| `OPEN_WEBUI_URL` | Open WebUI URL for benchmarking | `http://localhost:8080` |
| Variable | Description | Default |
|----------|-------------|---------|
| `OPEN_WEBUI_URL` | Open WebUI URL | `http://localhost:8080` |
| `OLLAMA_BASE_URL` | Ollama API URL | `http://host.docker.internal:11434` |
| `ENABLE_CHANNELS` | Enable Channels feature | `true` |
| `ADMIN_USER_EMAIL` | Admin email | - |
| `ADMIN_USER_PASSWORD` | Admin password | - |
| `MAX_CONCURRENT_USERS` | Max concurrent users | `50` |
| `USER_STEP_SIZE` | User increment step | `10` |
| `SUSTAIN_TIME_SECONDS` | Test duration per level | `30` |
| `MESSAGE_FREQUENCY` | Messages/sec per user | `0.5` |
| `OPEN_WEBUI_PORT` | Container port | `8080` |
| `CPU_LIMIT` | CPU limit | `2.0` |
| `MEMORY_LIMIT` | Memory limit | `8g` |
1. Create a new file in `benchmark/scenarios/`:
```python
from benchmark.core.base import BaseBenchmark
from benchmark.core.metrics import BenchmarkResult
class MyNewBenchmark(BaseBenchmark):
name = "My New Benchmark"
description = "Tests something new"
version = "1.0.0"
async def setup(self) -> None:
# Set up test environment
pass
async def run(self) -> BenchmarkResult:
# Execute the benchmark
# Use self.metrics to record timings
return self.metrics.get_result(self.name)
async def teardown(self) -> None:
# Clean up
pass
```
2. Register the benchmark in `benchmark/cli.py`
3. Add configuration options if needed in `config/benchmark_config.yaml`
### Custom Metrics Collection
```python
from benchmark.core.metrics import MetricsCollector
metrics = MetricsCollector()
metrics.start()
# Time individual operations
with metrics.time_operation("my_operation"):
await do_something()
# Or record manually
metrics.record_timing(
operation="api_call",
duration_ms=150.5,
success=True,
)
metrics.stop()
result = metrics.get_result("My Benchmark")
```
## Understanding Results
### Key Metrics
| Metric | Description | Good Threshold |
|--------|-------------|----------------|
| `avg_response_time_ms` | Average response time | < 2000ms |
| `p95_response_time_ms` | 95th percentile response time | < 3000ms |
| `error_rate_percent` | Percentage of failed requests | < 1% |
| `requests_per_second` | Throughput | > 10 |
### Result Files
- `*.json` - Detailed results for each benchmark run
- `benchmark_results_*.csv` - Combined results in CSV format
- `summary_*.txt` - Human-readable summary
### Interpreting Chat UI Benchmark Results
The chat-ui benchmark in auto-scale mode reports:
- **max_sustainable_users**: Maximum users where P95 stays under threshold
- **levels_tested**: Performance data at each user count level
- **% of Threshold**: How close P95 is to the configured limit
Example auto-scale result:
```
Auto-Scale Results
┏━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ Users ┃ P95 (ms) ┃ Avg (ms) ┃ % of Threshold ┃ Errors ┃
┡━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ 10 │ 731 │ 662 │ 37% │ 0.0% │
│ 30 │ 881 │ 748 │ 44% │ 0.0% │
│ 50 │ 1178 │ 1064 │ 59% │ 0.0% │
│ 70 │ 2133 │ 1854 │ 107% │ 0.8% │
└───────┴──────────┴──────────┴────────────────┴────────┘
P95 Threshold: 2000ms
Maximum Sustainable Users: 50
```
### Interpreting Channel Benchmark Results
The channel benchmark reports:
- **max_sustainable_users**: Maximum users where performance thresholds are met
- **results_by_level**: Performance at each user count level
- **tested_levels**: All user counts that were tested
Example result analysis:
```
Users: 10 | P95: 150ms | Errors: 0% | ✓ PASS
Users: 20 | P95: 280ms | Errors: 0.1% | ✓ PASS
Users: 30 | P95: 520ms | Errors: 0.3% | ✓ PASS
Users: 40 | P95: 1200ms | Errors: 0.8% | ✓ PASS
Users: 50 | P95: 3500ms | Errors: 2.1% | ✗ FAIL
Maximum sustainable users: 40
```
## Architecture
```
benchmark/
├── benchmark/
│ ├── core/ # Core framework
│ │ ├── base.py # Base benchmark class
│ │ ├── config.py # Configuration management
│ │ ├── metrics.py # Metrics collection
│ │ └── runner.py # Benchmark orchestration
│ ├── clients/ # API clients
│ │ ├── http_client.py # HTTP/REST client
│ │ ├── websocket_client.py # WebSocket client
│ │ └── browser_client.py # Playwright browser automation
│ ├── scenarios/ # Benchmark implementations
│ │ ├── channels.py # Channel benchmarks
│ │ └── chat_ui.py # Browser-based chat benchmark
│ ├── utils/ # Utilities
│ │ └── docker.py # Docker management
│ └── cli.py # Command-line interface
├── config/ # Configuration files
├── docker/ # Docker Compose for benchmarking
└── results/ # Benchmark output organized by {benchmark}/{timestamp}/
```
## Dependencies
The benchmark suite reuses Open WebUI dependencies where possible:
**From Open WebUI:**
- `httpx` - HTTP client
- `aiohttp` - Async HTTP
- `python-socketio` - WebSocket client
- `pydantic` - Data validation
- `pandas` - Data analysis
**Benchmark-specific:**
- `playwright` - Browser automation for UI testing
- `locust` - Load testing (optional, for advanced scenarios)
- `rich` - Terminal output
- `docker` - Docker SDK
- `matplotlib` - Plotting results
## Troubleshooting
### Common Issues
1. **Connection refused**: Ensure Open WebUI is running and accessible
2. **Authentication errors**: Check admin credentials in config
3. **Docker resource errors**: Ensure Docker has enough resources allocated
4. **WebSocket timeout**: Increase `websocket_timeout` in config
5. **Browser launch failures**: Run `playwright install chromium` to install browsers
6. **Login timeout in browser tests**: Check that `.env` has correct `ADMIN_USER_NAME` (with quotes if it contains spaces)
7. **High browser concurrency fails**: Use `--headless` mode and ensure sufficient system resources
### Debug Mode
Set logging level to DEBUG:
```bash
export BENCHMARK_LOG_LEVEL=DEBUG
owb run channels
```
## Contributing
When adding new benchmarks:
1. Follow the `BaseBenchmark` interface
2. Add tests for the new benchmark
3. Update configuration schema if needed
4. Add documentation to this README
## License
MIT License - See LICENSE file