P6-7: Agent File Completion (34 files - 11 agents × 3 files + guides) - Added BOOTSTRAP.md, IDENTITY.md, TOOLS.md for all 11 agents - Created AGENT_CREATION_GUIDE.md P6-2: Per-Agent Model Configuration (9 files) - Agent model router and config library - YAML configs for arbiter, coder agents - Configuration documentation P6-3: Health Check Dashboard (20+ files) - Complete frontend React application - API endpoints, WebSocket server - Collectors for agents, resources, services - Alert management and configuration P6-4: LiteLLM Observability Integration (10 files) - LiteLLM metrics collector and API - Frontend components for model/budget tracking - Integration documentation P6-1: Non-Docker Deployment (16 files) - Bare metal and VM deployment docs - Systemd service files - Installation scripts for Ubuntu/RHEL - Migration guide and troubleshooting P6-6: Cloud-Native Deployments (45+ files) - AWS, Azure, GCP Terraform configurations - Kubernetes base deployments with Kustomize overlays - Cloud deployment documentation P6-5: Unified Deployment CLI (28 files) - Complete CLI with 12 commands - Deployers for Docker, Kubernetes, cloud, baremetal - Health checker, backup manager, config manager P6-8: Plugin Installation Guide (15 files) - Plugin development and installation guides - Plugin CLI documentation and registry - Templates for basic, skill, and tool plugins
Heretek OpenClaw Operations Documentation
Overview
This directory contains operational documentation for the Heretek OpenClaw system, including monitoring configurations, backup procedures, governance rules, and operational runbooks.
Directory Structure
docs/operations/
├── README.md # This file - Operations index
├── monitoring-config.json # Monitoring thresholds and alerting rules
├── backup-config.json # Backup schedules and retention policies
├── governance-quorum-rules.json # Governance and voting configuration
├── runbook-agent-restart.md # Agent restart procedures
├── runbook-service-failure.md # Service failure recovery
├── runbook-database-corruption.md # Database corruption handling
├── runbook-backup-restoration.md # Backup restoration procedures
├── runbook-emergency-shutdown.md # Emergency shutdown procedures
└── runbook-troubleshooting.md # General troubleshooting guide
Configuration Files
Monitoring Configuration (monitoring-config.json)
Defines health monitoring thresholds and alerting rules:
- Health Check Interval: 30 seconds
- Metrics Retention: 30 days
- Alert Cooldown: 15 minutes
Thresholds:
| Metric | Warning | Critical |
|---|---|---|
| CPU | 70% | 90% |
| Memory | 75% | 90% |
| Disk | 80% | 95% |
| Response Time | 5000ms | 15000ms |
| Agent Heartbeat Timeout | 120s | - |
Alert Channels: Console, File, Webhook (configurable), Email (configurable)
Backup Configuration (backup-config.json)
Automated backup schedules:
| Backup Type | Schedule | Retention | Min Backups |
|---|---|---|---|
| Database | Daily 2 AM | 30 days | 7 |
| Redis | Daily 3 AM | 7 days | 3 |
| Workspace | Daily 4 AM | 30 days | 7 |
| Agent State | Every 6 hours | 7 days | 4 |
| Full System | Weekly Sunday 5 AM | 90 days | 4 |
Backup Location: /root/.openclaw/backups/
Governance Quorum Rules (governance-quorum-rules.json)
Triad decision-making configuration:
Voting Thresholds:
- Simple Majority (>50%): Routine operational decisions, skill deployments
- Supermajority (>66.67%): Governance changes, new agents, security policies
- Unanimous (100%): Identity changes, core values, autonomy levels
Decision Categories:
- Operational: Simple majority, 5 min timeout
- Tactical: Supermajority, 10 min timeout, requires examiner review
- Strategic: Unanimous, 15 min timeout, requires sentinel approval
- Emergency: Simple majority, 1 min timeout
Operational Runbooks
Quick Reference
| Scenario | Runbook | Severity |
|---|---|---|
| Agent not responding | Agent Restart | Medium |
| Service failure | Service Failure | High |
| Database issues | Database Corruption | Critical |
| Need to restore data | Backup Restoration | High |
| Emergency situation | Emergency Shutdown | Critical |
| General problems | Troubleshooting | Variable |
Runbook Summaries
Agent Restart Procedures
Covers procedures for restarting individual agents or the entire collective:
- Single agent restart
- Force restart
- Full collective restart
- Clean state restart
- Rolling restart (zero downtime)
Service Failure Recovery
Recovery procedures for critical infrastructure failures:
- LiteLLM Gateway failure
- PostgreSQL failure
- Redis failure
- Ollama/GPU failure
- WebSocket Bridge failure
- Web Dashboard failure
Database Corruption Handling
Detection, diagnosis, and recovery for database corruption:
- Corruption detection methods
- Integrity checking procedures
- Recovery levels (1-5)
- Post-recovery verification
Backup Restoration Procedures
Step-by-step restoration from backups:
- Database restoration
- Redis restoration
- Workspace restoration
- Agent state restoration
- Full system restoration
- Restoration testing
Emergency Shutdown Procedures
Emergency shutdown at various levels:
- Level 1: Graceful (maintenance)
- Level 2: Controlled (degradation)
- Level 3: Rapid (security)
- Level 4: Immediate (instability)
- Level 5: Nuclear (containment)
Troubleshooting Guide
Systematic troubleshooting for common issues:
- Agent offline
- High CPU/memory usage
- Connection errors
- A2A protocol issues
- Vector operation failures
- GPU/Ollama issues
Scripts
Health Check Script
# Check all services
./scripts/health-check.sh
# Check specific service
./scripts/health-check.sh litellm
# Continuous monitoring
./scripts/health-check.sh --watch
Backup Script
# Full backup
./scripts/production-backup.sh --all
# Specific backup
./scripts/production-backup.sh --database
./scripts/production-backup.sh --redis
./scripts/production-backup.sh --workspace
# Restore
./scripts/production-backup.sh --restore latest
# List backups
./scripts/production-backup.sh --list
# Verify backup
./scripts/production-backup.sh --verify <file>
# Cleanup old backups
./scripts/production-backup.sh --cleanup
Cron Schedules
To enable automated backups, add to crontab (crontab -e):
# Heretek OpenClaw Automated Backups
# Database backup - Daily at 2 AM
0 2 * * * /root/heretek/heretek-openclaw/scripts/production-backup.sh --database >> /root/.openclaw/logs/backup-cron.log 2>&1
# Redis backup - Daily at 3 AM
0 3 * * * /root/heretek/heretek-openclaw/scripts/production-backup.sh --redis >> /root/.openclaw/logs/backup-cron.log 2>&1
# Workspace backup - Daily at 4 AM
0 4 * * * /root/heretek/heretek-openclaw/scripts/production-backup.sh --workspace >> /root/.openclaw/logs/backup-cron.log 2>&1
# Agent state backup - Every 6 hours
0 */6 * * * /root/heretek/heretek-openclaw/scripts/production-backup.sh --agent-state >> /root/.openclaw/logs/backup-cron.log 2>&1
# Full system backup - Weekly on Sunday at 5 AM
0 5 * * 0 /root/heretek/heretek-openclaw/scripts/production-backup.sh --full >> /root/.openclaw/logs/backup-cron.log 2>&1
# Backup cleanup - Daily at 6 AM
0 6 * * * /root/heretek/heretek-openclaw/scripts/production-backup.sh --cleanup >> /root/.openclaw/logs/backup-cron.log 2>&1
# Health check - Every 5 minutes
*/5 * * * * /root/heretek/heretek-openclaw/scripts/health-check.sh >> /root/.openclaw/logs/health-cron.log 2>&1
Dashboard Access
| Service | Port | URL |
|---|---|---|
| Web Dashboard | 3000 | http://localhost:3000 |
| LiteLLM UI | 4000 | http://localhost:4000 |
| WebSocket Bridge (HTTP) | 3002 | http://localhost:3002 |
| WebSocket Bridge (WS) | 3003 | ws://localhost:3003 |
Agent Ports
| Agent | Port | Container |
|---|---|---|
| Steward | 8001 | heretek-steward |
| Alpha | 8002 | heretek-alpha |
| Beta | 8003 | heretek-beta |
| Charlie | 8004 | heretek-charlie |
| Examiner | 8005 | heretek-examiner |
| Explorer | 8006 | heretek-explorer |
| Sentinel | 8007 | heretek-sentinel |
| Coder | 8008 | heretek-coder |
| Dreamer | 8009 | heretek-dreamer |
| Empath | 8010 | heretek-empath |
| Historian | 8011 | heretek-historian |
Log Locations
/root/.openclaw/logs/
├── alerts.log # Alert notifications
├── backup.log # Backup operations
├── health-cron.log # Health check logs
├── backup-cron.log # Cron backup logs
├── governance-events.log # Governance events
└── security-events.log # Security events
Emergency Contacts
| Role | Contact |
|---|---|
| System Administrator | [Configure] |
| Engineering Lead | [Configure] |
| Security Team | [Configure] |
| On-Call Engineer | [Configure] |
Version History
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2026-03-31 | Initial production hardening |