Files
John Doe 58653d5091 Deploy Phases 5-9: Autonomous Ops, Production Hardening, Memory Enhancement, Multi-User, Plugin Expansion
Phase 5 - Autonomous Operations:
- Add session keeper, thought-loop integration, dreamer agent
- modules/memory/dreamer-agent.js
- skills/dreamer-agent/

Phase 6 - Production Hardening:
- docs/operations/ runbooks, monitoring, backup config, governance quorum
- scripts/production-backup.sh

Phase 7 - Memory Enhancement:
- modules/memory/graph-rag-neo4j.js - Neo4j GraphRAG implementation
- modules/memory/episodic-consolidation-config.js
- modules/memory/semantic-promotion.js
- docs/memory/MEMORY_ENHANCEMENT_ARCHITECTURE.md

Phase 8 - Multi-User Setup:
- skills/user-rolodex/ - Identity resolution, relationship tracker
- docs/users/USER_MANAGEMENT.md

Phase 9 - Plugin Expansion:
- plugins/openclaw-hybrid-search-plugin/ - Hybrid search with fusion
- plugins/openclaw-multi-doc-retrieval/ - Multi-document retrieval
- plugins/openclaw-mcp-connectors/ - MCP connectors
- plugins/openclaw-skill-extensions/ - Skill extensions
- docs/plugins/PLUGIN_EXPANSION.md
2026-03-30 22:49:14 -04:00

8.2 KiB

Heretek OpenClaw Operations Documentation

Overview

This directory contains operational documentation for the Heretek OpenClaw system, including monitoring configurations, backup procedures, governance rules, and operational runbooks.

Directory Structure

docs/operations/
├── README.md                          # This file - Operations index
├── monitoring-config.json             # Monitoring thresholds and alerting rules
├── backup-config.json                 # Backup schedules and retention policies
├── governance-quorum-rules.json       # Governance and voting configuration
├── runbook-agent-restart.md           # Agent restart procedures
├── runbook-service-failure.md         # Service failure recovery
├── runbook-database-corruption.md     # Database corruption handling
├── runbook-backup-restoration.md      # Backup restoration procedures
├── runbook-emergency-shutdown.md      # Emergency shutdown procedures
└── runbook-troubleshooting.md         # General troubleshooting guide

Configuration Files

Monitoring Configuration (monitoring-config.json)

Defines health monitoring thresholds and alerting rules:

  • Health Check Interval: 30 seconds
  • Metrics Retention: 30 days
  • Alert Cooldown: 15 minutes

Thresholds:

Metric Warning Critical
CPU 70% 90%
Memory 75% 90%
Disk 80% 95%
Response Time 5000ms 15000ms
Agent Heartbeat Timeout 120s -

Alert Channels: Console, File, Webhook (configurable), Email (configurable)

Backup Configuration (backup-config.json)

Automated backup schedules:

Backup Type Schedule Retention Min Backups
Database Daily 2 AM 30 days 7
Redis Daily 3 AM 7 days 3
Workspace Daily 4 AM 30 days 7
Agent State Every 6 hours 7 days 4
Full System Weekly Sunday 5 AM 90 days 4

Backup Location: /root/.openclaw/backups/

Governance Quorum Rules (governance-quorum-rules.json)

Triad decision-making configuration:

Voting Thresholds:

  • Simple Majority (>50%): Routine operational decisions, skill deployments
  • Supermajority (>66.67%): Governance changes, new agents, security policies
  • Unanimous (100%): Identity changes, core values, autonomy levels

Decision Categories:

  • Operational: Simple majority, 5 min timeout
  • Tactical: Supermajority, 10 min timeout, requires examiner review
  • Strategic: Unanimous, 15 min timeout, requires sentinel approval
  • Emergency: Simple majority, 1 min timeout

Operational Runbooks

Quick Reference

Scenario Runbook Severity
Agent not responding Agent Restart Medium
Service failure Service Failure High
Database issues Database Corruption Critical
Need to restore data Backup Restoration High
Emergency situation Emergency Shutdown Critical
General problems Troubleshooting Variable

Runbook Summaries

Agent Restart Procedures

Covers procedures for restarting individual agents or the entire collective:

  • Single agent restart
  • Force restart
  • Full collective restart
  • Clean state restart
  • Rolling restart (zero downtime)

Service Failure Recovery

Recovery procedures for critical infrastructure failures:

  • LiteLLM Gateway failure
  • PostgreSQL failure
  • Redis failure
  • Ollama/GPU failure
  • WebSocket Bridge failure
  • Web Dashboard failure

Database Corruption Handling

Detection, diagnosis, and recovery for database corruption:

  • Corruption detection methods
  • Integrity checking procedures
  • Recovery levels (1-5)
  • Post-recovery verification

Backup Restoration Procedures

Step-by-step restoration from backups:

  • Database restoration
  • Redis restoration
  • Workspace restoration
  • Agent state restoration
  • Full system restoration
  • Restoration testing

Emergency Shutdown Procedures

Emergency shutdown at various levels:

  • Level 1: Graceful (maintenance)
  • Level 2: Controlled (degradation)
  • Level 3: Rapid (security)
  • Level 4: Immediate (instability)
  • Level 5: Nuclear (containment)

Troubleshooting Guide

Systematic troubleshooting for common issues:

  • Agent offline
  • High CPU/memory usage
  • Connection errors
  • A2A protocol issues
  • Vector operation failures
  • GPU/Ollama issues

Scripts

Health Check Script

# Check all services
./scripts/health-check.sh

# Check specific service
./scripts/health-check.sh litellm

# Continuous monitoring
./scripts/health-check.sh --watch

Backup Script

# Full backup
./scripts/production-backup.sh --all

# Specific backup
./scripts/production-backup.sh --database
./scripts/production-backup.sh --redis
./scripts/production-backup.sh --workspace

# Restore
./scripts/production-backup.sh --restore latest

# List backups
./scripts/production-backup.sh --list

# Verify backup
./scripts/production-backup.sh --verify <file>

# Cleanup old backups
./scripts/production-backup.sh --cleanup

Cron Schedules

To enable automated backups, add to crontab (crontab -e):

# Heretek OpenClaw Automated Backups
# Database backup - Daily at 2 AM
0 2 * * * /root/heretek/heretek-openclaw/scripts/production-backup.sh --database >> /root/.openclaw/logs/backup-cron.log 2>&1

# Redis backup - Daily at 3 AM
0 3 * * * /root/heretek/heretek-openclaw/scripts/production-backup.sh --redis >> /root/.openclaw/logs/backup-cron.log 2>&1

# Workspace backup - Daily at 4 AM
0 4 * * * /root/heretek/heretek-openclaw/scripts/production-backup.sh --workspace >> /root/.openclaw/logs/backup-cron.log 2>&1

# Agent state backup - Every 6 hours
0 */6 * * * /root/heretek/heretek-openclaw/scripts/production-backup.sh --agent-state >> /root/.openclaw/logs/backup-cron.log 2>&1

# Full system backup - Weekly on Sunday at 5 AM
0 5 * * 0 /root/heretek/heretek-openclaw/scripts/production-backup.sh --full >> /root/.openclaw/logs/backup-cron.log 2>&1

# Backup cleanup - Daily at 6 AM
0 6 * * * /root/heretek/heretek-openclaw/scripts/production-backup.sh --cleanup >> /root/.openclaw/logs/backup-cron.log 2>&1

# Health check - Every 5 minutes
*/5 * * * * /root/heretek/heretek-openclaw/scripts/health-check.sh >> /root/.openclaw/logs/health-cron.log 2>&1

Dashboard Access

Service Port URL
Web Dashboard 3000 http://localhost:3000
LiteLLM UI 4000 http://localhost:4000
WebSocket Bridge (HTTP) 3002 http://localhost:3002
WebSocket Bridge (WS) 3003 ws://localhost:3003

Agent Ports

Agent Port Container
Steward 8001 heretek-steward
Alpha 8002 heretek-alpha
Beta 8003 heretek-beta
Charlie 8004 heretek-charlie
Examiner 8005 heretek-examiner
Explorer 8006 heretek-explorer
Sentinel 8007 heretek-sentinel
Coder 8008 heretek-coder
Dreamer 8009 heretek-dreamer
Empath 8010 heretek-empath
Historian 8011 heretek-historian

Log Locations

/root/.openclaw/logs/
├── alerts.log              # Alert notifications
├── backup.log              # Backup operations
├── health-cron.log         # Health check logs
├── backup-cron.log         # Cron backup logs
├── governance-events.log   # Governance events
└── security-events.log     # Security events

Emergency Contacts

Role Contact
System Administrator [Configure]
Engineering Lead [Configure]
Security Team [Configure]
On-Call Engineer [Configure]

Version History

Version Date Changes
1.0.0 2026-03-31 Initial production hardening