benchmark/README.md

# Open WebUI Benchmark Suite

A comprehensive benchmarking framework for testing Open WebUI performance under various load conditions.

## Overview

This benchmark suite is designed to:

1. **Measure concurrent user capacity** - Test how many users can simultaneously use features like Channels
2. **Identify performance limits** - Find the point where response times degrade
3. **Compare compute profiles** - Test performance across different resource configurations
4. **Generate actionable reports** - Provide detailed metrics and recommendations

## Quick Start

### Prerequisites

- Python 3.11+
- Docker and Docker Compose
- A running Open WebUI instance (or use the provided Docker setup)
- Chromium browser (installed automatically via Playwright for UI benchmarks)

### Installation

```bash
cd benchmark
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e .

# Install Playwright browsers (required for UI benchmarks)
playwright install chromium
```

### Configuration

Copy the example environment file and configure your admin credentials:

```bash
cp .env.example .env
```

Edit `.env` with your Open WebUI admin credentials:

```dotenv
OPEN_WEBUI_URL=http://localhost:8080
ADMIN_USER_EMAIL=your-admin@example.com
ADMIN_USER_PASSWORD=your-password
```

### Running Benchmarks

1. **Start Open WebUI with benchmark configuration:**

```bash
cd docker
./run.sh default  # Use the default compute profile (2 CPU, 8GB RAM)
```

2. **Run the benchmark:**

```bash
# Run the default benchmark (chat-ui with auto-scaling), which automatically finds max sustainable users based on P95 response time
owb run

# Set a custom response time threshold (default: 1000ms)
owb run --response-threshold 2000

# Run with a fixed number of users (disables auto-scaling)
owb run -m 50

# Run with visible browsers for debugging
owb run --headed
owb run --headed --slow-mo 500  # Slow down for visual inspection
```

3. **List available benchmarks:**

```bash
owb list
```

4. **Run other benchmarks:**

```bash
# API-based chat benchmark (no browser)
owb run chat-api -m 50

# Channel API concurrency
owb run channels-api -m 50

# Channel WebSocket benchmark
owb run channels-ws -m 50

# Run all benchmarks
owb run all
```

5. **View results:**

Results are organized by benchmark name and timestamp:

```
results/
└── chat_ui_concurrency/
    └── 20260126_014205/
        ├── result.json    # Detailed benchmark data
        ├── results.csv    # Tabular results
        └── summary.txt    # Human-readable summary
```

## Compute Profiles

Compute profiles define the resource constraints for the Open WebUI container:

| Profile | CPUs | Memory | Use Case |
|---------|------|--------|----------|
| `default` | 2 | 8GB | Local MacBook testing |
| `minimal` | 1 | 4GB | Testing lower bounds |
| `cloud_small` | 2 | 4GB | Small cloud VM |
| `cloud_medium` | 4 | 8GB | Medium cloud VM |
| `cloud_large` | 8 | 16GB | Large cloud VM |

List available profiles:

```bash
owb profiles
```

## Available Benchmarks

### Chat API Concurrency (`chat-api`)

Tests concurrent AI chat performance via the OpenAI-compatible API:

- Creates test users and makes a model publicly available
- Each user sends chat requests via the `/api/chat` endpoint
- Measures response times, throughput, and error rates
- Tests the backend's ability to handle concurrent LLM requests

**Usage:**

```bash
owb run chat-api -m 50 --model gpt-4o-mini
```

### Chat UI Concurrency (`chat-ui`) - Default

Tests concurrent AI chat performance through actual browser UI using Playwright. **This is the default benchmark** and runs in auto-scale mode by default.

**Auto-scale mode (default):**
- Progressively adds users until P95 response time exceeds threshold
- Automatically finds maximum sustainable concurrent users
- Reports performance at each level tested

**Fixed mode:**
- Test with a specific number of concurrent users
- Enabled by specifying `--max-users` / `-m`

**How it works:**
- Launches real Chromium browser instances (or contexts)
- Each browser logs in as a different user
- Sends chat messages and waits for streaming responses
- Measures actual user-experienced response times including rendering
- Tests full stack performance: UI, backend, and LLM together

**Usage:**

```bash
# Auto-scale mode (default) - finds max sustainable users
owb run
owb run --response-threshold 2000  # Custom threshold (default: 1000ms)

# Fixed mode - test specific user count
owb run -m 50
owb run -m 50 --model gpt-4o-mini

# Debugging options
owb run --headed                    # Visible browsers
owb run --headed --slow-mo 500      # Slow down for inspection
```

**Configuration:**

```yaml
chat_ui:
  headless: true              # Run browsers in headless mode
  slow_mo: 0                  # Slow down operations by ms (debugging)
  viewport_width: 1280        # Browser viewport width
  viewport_height: 720        # Browser viewport height
  browser_timeout: 30000      # Default timeout in ms
  screenshot_on_error: true   # Capture screenshots on failure
  use_isolated_browsers: false # Use separate browser instances vs contexts
```

**Notes:**
- Browser benchmarks require more resources than API benchmarks
- For high concurrency (50+), use headless mode and browser contexts
- Headed mode is useful for debugging UI issues
- The benchmark measures actual streaming response detection

### Channel Concurrency (`channels-api`)

Tests concurrent user capacity in Open WebUI Channels:

- Creates a test channel
- Progressively adds users (10, 20, 30, ... up to max)
- Each user sends messages at a configured rate
- Measures response times and error rates
- Identifies the maximum sustainable user count

**Configuration options:**

```yaml
channels:
  max_concurrent_users: 100  # Maximum users to test
  user_step_size: 10         # Increment users by this amount
  sustain_time: 30           # Seconds to run at each level
  message_frequency: 0.5     # Messages per second per user
```

### Channel WebSocket (`channels-ws`)

Tests WebSocket scalability for real-time message delivery in Channels:

- Establishes WebSocket connections for multiple users
- Tests real-time message broadcasting
- Measures message delivery latency
- Identifies WebSocket connection limits

## Configuration

Configuration files are located in `config/`:

- `benchmark_config.yaml` - Main benchmark settings
- `compute_profiles.yaml` - Resource profiles for Docker containers

### Environment Variables

All configuration can be set via environment variables (loaded from `.env` file):

| Variable | Description | Default |
|----------|-------------|---------|
| `OPEN_WEBUI_URL` | Open WebUI URL for benchmarking | `http://localhost:8080` |
| Variable | Description | Default |
|----------|-------------|---------|
| `OPEN_WEBUI_URL` | Open WebUI URL | `http://localhost:8080` |
| `OLLAMA_BASE_URL` | Ollama API URL | `http://host.docker.internal:11434` |
| `ENABLE_CHANNELS` | Enable Channels feature | `true` |
| `ADMIN_USER_EMAIL` | Admin email | - |
| `ADMIN_USER_PASSWORD` | Admin password | - |
| `MAX_CONCURRENT_USERS` | Max concurrent users | `50` |
| `USER_STEP_SIZE` | User increment step | `10` |
| `SUSTAIN_TIME_SECONDS` | Test duration per level | `30` |
| `MESSAGE_FREQUENCY` | Messages/sec per user | `0.5` |
| `OPEN_WEBUI_PORT` | Container port | `8080` |
| `CPU_LIMIT` | CPU limit | `2.0` |
| `MEMORY_LIMIT` | Memory limit | `8g` |
1. Create a new file in `benchmark/scenarios/`:

```python
from benchmark.core.base import BaseBenchmark
from benchmark.core.metrics import BenchmarkResult

class MyNewBenchmark(BaseBenchmark):
    name = "My New Benchmark"
    description = "Tests something new"
    version = "1.0.0"

    async def setup(self) -> None:
        # Set up test environment
        pass

    async def run(self) -> BenchmarkResult:
        # Execute the benchmark
        # Use self.metrics to record timings
        return self.metrics.get_result(self.name)

    async def teardown(self) -> None:
        # Clean up
        pass
```

2. Register the benchmark in `benchmark/cli.py`

3. Add configuration options if needed in `config/benchmark_config.yaml`

### Custom Metrics Collection

```python
from benchmark.core.metrics import MetricsCollector

metrics = MetricsCollector()
metrics.start()

# Time individual operations
with metrics.time_operation("my_operation"):
    await do_something()

# Or record manually
metrics.record_timing(
    operation="api_call",
    duration_ms=150.5,
    success=True,
)

metrics.stop()
result = metrics.get_result("My Benchmark")
```

## Understanding Results

### Key Metrics

| Metric | Description | Good Threshold |
|--------|-------------|----------------|
| `avg_response_time_ms` | Average response time | < 2000ms |
| `p95_response_time_ms` | 95th percentile response time | < 3000ms |
| `error_rate_percent` | Percentage of failed requests | < 1% |
| `requests_per_second` | Throughput | > 10 |

### Result Files

- `*.json` - Detailed results for each benchmark run
- `benchmark_results_*.csv` - Combined results in CSV format
- `summary_*.txt` - Human-readable summary

### Interpreting Chat UI Benchmark Results

The chat-ui benchmark in auto-scale mode reports:

- **max_sustainable_users**: Maximum users where P95 stays under threshold
- **levels_tested**: Performance data at each user count level
- **% of Threshold**: How close P95 is to the configured limit

Example auto-scale result:

```
                   Auto-Scale Results
┏━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ Users ┃ P95 (ms) ┃ Avg (ms) ┃ % of Threshold ┃ Errors ┃
┡━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│    10 │      731 │      662 │            37% │   0.0% │
│    30 │      881 │      748 │            44% │   0.0% │
│    50 │     1178 │     1064 │            59% │   0.0% │
│    70 │     2133 │     1854 │           107% │   0.8% │
└───────┴──────────┴──────────┴────────────────┴────────┘

P95 Threshold: 2000ms
Maximum Sustainable Users: 50
```

### Interpreting Channel Benchmark Results

The channel benchmark reports:

- **max_sustainable_users**: Maximum users where performance thresholds are met
- **results_by_level**: Performance at each user count level
- **tested_levels**: All user counts that were tested

Example result analysis:

```
Users: 10  | P95: 150ms  | Errors: 0%    | ✓ PASS
Users: 20  | P95: 280ms  | Errors: 0.1%  | ✓ PASS
Users: 30  | P95: 520ms  | Errors: 0.3%  | ✓ PASS
Users: 40  | P95: 1200ms | Errors: 0.8%  | ✓ PASS
Users: 50  | P95: 3500ms | Errors: 2.1%  | ✗ FAIL

Maximum sustainable users: 40
```

## Architecture

```
benchmark/
├── benchmark/
│   ├── core/           # Core framework
│   │   ├── base.py     # Base benchmark class
│   │   ├── config.py   # Configuration management
│   │   ├── metrics.py  # Metrics collection
│   │   └── runner.py   # Benchmark orchestration
│   ├── clients/        # API clients
│   │   ├── http_client.py      # HTTP/REST client
│   │   ├── websocket_client.py # WebSocket client
│   │   └── browser_client.py   # Playwright browser automation
│   ├── scenarios/      # Benchmark implementations
│   │   ├── channels.py # Channel benchmarks
│   │   └── chat_ui.py  # Browser-based chat benchmark
│   ├── utils/          # Utilities
│   │   └── docker.py   # Docker management
│   └── cli.py          # Command-line interface
├── config/             # Configuration files
├── docker/             # Docker Compose for benchmarking
└── results/            # Benchmark output organized by {benchmark}/{timestamp}/
```

## Dependencies

The benchmark suite reuses Open WebUI dependencies where possible:

**From Open WebUI:**
- `httpx` - HTTP client
- `aiohttp` - Async HTTP
- `python-socketio` - WebSocket client
- `pydantic` - Data validation
- `pandas` - Data analysis

**Benchmark-specific:**
- `playwright` - Browser automation for UI testing
- `locust` - Load testing (optional, for advanced scenarios)
- `rich` - Terminal output
- `docker` - Docker SDK
- `matplotlib` - Plotting results

## Troubleshooting

### Common Issues

1. **Connection refused**: Ensure Open WebUI is running and accessible
2. **Authentication errors**: Check admin credentials in config
3. **Docker resource errors**: Ensure Docker has enough resources allocated
4. **WebSocket timeout**: Increase `websocket_timeout` in config
5. **Browser launch failures**: Run `playwright install chromium` to install browsers
6. **Login timeout in browser tests**: Check that `.env` has correct `ADMIN_USER_NAME` (with quotes if it contains spaces)
7. **High browser concurrency fails**: Use `--headless` mode and ensure sufficient system resources

### Debug Mode

Set logging level to DEBUG:

```bash
export BENCHMARK_LOG_LEVEL=DEBUG
owb run channels
```

## Contributing

When adding new benchmarks:

1. Follow the `BaseBenchmark` interface
2. Add tests for the new benchmark
3. Update configuration schema if needed
4. Add documentation to this README

## License

MIT License - See LICENSE file