Phase 5 - Autonomous Operations: - Add session keeper, thought-loop integration, dreamer agent - modules/memory/dreamer-agent.js - skills/dreamer-agent/ Phase 6 - Production Hardening: - docs/operations/ runbooks, monitoring, backup config, governance quorum - scripts/production-backup.sh Phase 7 - Memory Enhancement: - modules/memory/graph-rag-neo4j.js - Neo4j GraphRAG implementation - modules/memory/episodic-consolidation-config.js - modules/memory/semantic-promotion.js - docs/memory/MEMORY_ENHANCEMENT_ARCHITECTURE.md Phase 8 - Multi-User Setup: - skills/user-rolodex/ - Identity resolution, relationship tracker - docs/users/USER_MANAGEMENT.md Phase 9 - Plugin Expansion: - plugins/openclaw-hybrid-search-plugin/ - Hybrid search with fusion - plugins/openclaw-multi-doc-retrieval/ - Multi-document retrieval - plugins/openclaw-mcp-connectors/ - MCP connectors - plugins/openclaw-skill-extensions/ - Skill extensions - docs/plugins/PLUGIN_EXPANSION.md
11 KiB
Troubleshooting Guide
Overview
This guide provides systematic troubleshooting procedures for common issues in the Heretek OpenClaw system.
Diagnostic Tools
Health Check Script
# Full system health check
./scripts/health-check.sh
# Check specific service
./scripts/health-check.sh litellm
./scripts/health-check.sh postgres
./scripts/health-check.sh redis
# Continuous monitoring
./scripts/health-check.sh --watch
Backup Script
# List backups
./scripts/production-backup.sh --list
# Verify backup
./scripts/production-backup.sh --verify <backup-file>
# Cleanup old backups
./scripts/production-backup.sh --cleanup
Docker Commands
# List all containers
docker ps -a --filter "name=heretek-"
# View logs
docker logs heretek-<service-name> --tail 100
# Follow logs in real-time
docker logs -f heretek-<service-name>
# Check resource usage
docker stats heretek-<service-name>
# Inspect container
docker inspect heretek-<service-name>
# Execute command in container
docker exec -it heretek-<service-name> bash
Common Issues
Issue: Agent Offline
Symptoms: Dashboard shows agent as offline, health check fails
Diagnosis:
# Check container status
docker ps | grep heretek-<agent-name>
# Check logs
docker logs heretek-<agent-name> --tail 100
# Check health endpoint
curl -f http://localhost:800X/health
Solutions:
# Restart agent
docker restart heretek-<agent-name>
# If fails, check resources
docker stats --no-stream heretek-<agent-name>
# Check network connectivity
docker network inspect heretek-network
Issue: High CPU Usage
Symptoms: System slow, fans running high, docker stats shows high CPU
Diagnosis:
# Identify high CPU container
docker stats --no-stream
# Check which process is using CPU
docker exec heretek-<service-name> ps aux --sort=-%cpu | head -5
Solutions:
# 1. Check for runaway processes
docker top heretek-<service-name>
# 2. Restart the service
docker restart heretek-<service-name>
# 3. If persistent, check for infinite loops in logs
docker logs heretek-<service-name> --tail 1000 | grep -i "loop\|retry\|error"
# 4. Consider scaling resources or limiting CPU
docker update --cpus=2.0 heretek-<service-name>
Issue: High Memory Usage
Symptoms: System swapping, OOM kills, slow performance
Diagnosis:
# Check memory usage
docker stats --no-stream
# Check for memory leaks
docker exec heretek-<service-name> free -m
# Check OOM events
dmesg | grep -i "out of memory"
Solutions:
# 1. Restart memory-heavy services
docker restart heretek-litellm heretek-<agents>
# 2. Clear Redis cache if needed
docker exec heretek-redis redis-cli FLUSHDB
# 3. Limit container memory
docker update --memory=2g heretek-<service-name>
# 4. Check for memory leaks in application logs
docker logs heretek-<service-name> | grep -i "memory\|leak"
Issue: Database Connection Errors
Symptoms: Agents can't connect to database, queries failing
Diagnosis:
# Check PostgreSQL status
docker exec heretek-postgres pg_isready -U heretek
# Check connection count
docker exec heretek-postgres psql -U heretek -d heretek -c \
"SELECT count(*) FROM pg_stat_activity;"
# Check for locks
docker exec heretek-postgres psql -U heretek -d heretek -c \
"SELECT * FROM pg_locks WHERE NOT granted;"
Solutions:
# 1. Restart PostgreSQL
docker restart heretek-postgres
# 2. Check max connections
docker exec heretek-postgres psql -U heretek -d heretek -c \
"SHOW max_connections;"
# 3. Kill idle connections
docker exec heretek-postgres psql -U heretek -d heretek -c \
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle';"
# 4. Check database integrity
docker exec heretek-postgres psql -U heretek -d heretek -c "VACUUM ANALYZE;"
Issue: Redis Connection Errors
Symptoms: Cache misses, pub/sub not working, rate limiting failing
Diagnosis:
# Check Redis status
docker exec heretek-redis redis-cli ping
# Check memory usage
docker exec heretek-redis redis-cli info memory
# Check connected clients
docker exec heretek-redis redis-cli client list
Solutions:
# 1. Restart Redis
docker restart heretek-redis
# 2. Clear memory if full
docker exec heretek-redis redis-cli MEMORY PURGE
# 3. Check maxmemory setting
docker exec heretek-redis redis-cli CONFIG GET maxmemory
# 4. Evict keys if needed
docker exec heretek-redis redis-cli MEMORY DOCTOR
Issue: LiteLLM Gateway Errors
Symptoms: API returning errors, A2A not working, models unavailable
Diagnosis:
# Check health
curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" http://localhost:4000/health
# Check models
curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" http://localhost:4000/v1/models
# Check agents
curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" http://localhost:4000/v1/agents
# Check logs
docker logs heretek-litellm --tail 100
Solutions:
# 1. Restart LiteLLM
docker restart heretek-litellm
# 2. Check configuration
docker exec heretek-litellm cat /app/config.yaml
# 3. Check database connectivity
docker exec heretek-litellm python3 -c \
"import psycopg2; psycopg2.connect('postgresql://heretek:heretek@postgres:5432/heretek')"
# 4. Rebuild if needed
docker compose build litellm
docker compose up -d litellm
Issue: WebSocket Connection Failing
Symptoms: Dashboard not updating in real-time, no live data
Diagnosis:
# Check WebSocket bridge
docker ps | grep websocket-bridge
# Check HTTP health
curl -f http://localhost:3002/health
# Test WebSocket connection
wscat -c ws://localhost:3003
# Check logs
docker logs heretek-websocket-bridge --tail 100
Solutions:
# 1. Restart WebSocket bridge
docker restart heretek-websocket-bridge
# 2. Check Redis connectivity
docker exec heretek-websocket-bridge redis-cli -h redis ping
# 3. Check port binding
netstat -tlnp | grep 3003
# 4. Rebuild if needed
docker compose build websocket-bridge
docker compose up -d websocket-bridge
Issue: Dashboard Not Loading
Symptoms: Port 3000/7000 not responding, blank page, connection error
Diagnosis:
# Check web container
docker ps | grep heretek-web
# Check health
curl -f http://localhost:3000/api/agents
# Check logs
docker logs heretek-web --tail 100
# Check port binding
netstat -tlnp | grep 3000
Solutions:
# 1. Restart web service
docker restart heretek-web
# 2. Check backend connectivity
docker exec heretek-web curl -f http://litellm:4000/health
# 3. Rebuild frontend
docker compose build web
docker compose up -d web
# 4. Clear browser cache (client-side)
Issue: A2A Protocol Not Working
Symptoms: Agents not communicating, messages not delivered
Diagnosis:
# Check registered agents
curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
http://localhost:4000/v1/agents | jq .
# Check agent health
for port in 8001 8002 8003 8004 8005 8006 8007 8008 8009 8010 8011; do
echo -n "Port $port: "
curl -sf http://localhost:$port/health && echo "OK" || echo "FAILED"
done
# Check LiteLLM logs
docker logs heretek-litellm --tail 100 | grep -i "a2a\|agent"
Solutions:
# 1. Restart LiteLLM gateway
docker restart heretek-litellm
# 2. Re-register agents
for agent in steward alpha beta charlie examiner explorer sentinel coder dreamer empath historian; do
docker restart heretek-$agent
sleep 2
done
# 3. Check Redis pub/sub
docker exec heretek-redis redis-cli PUBSUB CHANNELS "*"
# 4. Verify agent configuration
docker exec heretek-steward cat /app/agent/openclaw.json
Issue: Vector Operations Failing
Symptoms: RAG not working, embedding errors, similarity search failing
Diagnosis:
# Check pgvector extension
docker exec heretek-postgres psql -U heretek -d heretek -c \
"SELECT extname, extversion FROM pg_extension WHERE extname='vector';"
# Test vector operation
docker exec heretek-postgres psql -U heretek -d heretek -c \
"SELECT '[1,2,3]'::vector <-> '[4,5,6]'::vector;"
# Check vector tables
docker exec heretek-postgres psql -U heretek -d heretek -c \
"SELECT tablename FROM pg_tables WHERE schemaname='public' ORDER BY tablename;"
Solutions:
# 1. Reinstall pgvector extension
docker exec heretek-postgres psql -U heretek -d heretek << 'EOF'
DROP EXTENSION IF EXISTS vector CASCADE;
CREATE EXTENSION vector;
EOF
# 2. Restart PostgreSQL
docker restart heretek-postgres
# 3. Reindex vector columns
docker exec heretek-postgres psql -U heretek -d heretek -c \
"REINDEX DATABASE heretek;"
Issue: Ollama/GPU Not Working
Symptoms: Local LLM failing, GPU not accessible, embedding generation errors
Diagnosis:
# Check Ollama status
curl -f http://localhost:11434/
# Check GPU access
docker exec heretek-ollama rocm-smi
# Check models
docker exec heretek-ollama ollama list
# Check logs
docker logs heretek-ollama --tail 100
Solutions:
# 1. Restart Ollama
docker restart heretek-ollama
# 2. Pull required models
docker exec heretek-ollama ollama pull nomic-embed-text-v2-moe
# 3. Check GPU device access
ls -la /dev/kfd /dev/dri
# 4. Verify ROCm configuration
docker exec heretek-ollama cat /etc/os-release
docker exec heretek-ollama rocminfo
Log Analysis
Common Error Patterns
# Search for errors across all containers
for container in $(docker ps --filter "name=heretek-" --format "{{.Names}}"); do
echo "=== $container ==="
docker logs "$container" --tail 50 | grep -i error | head -5
done
# Find stack traces
docker logs heretek-<service> | grep -A 10 "Traceback\|Exception\|Error:"
# Find connection issues
docker logs heretek-<service> | grep -i "connection\|timeout\|refused"
# Find memory issues
docker logs heretek-<service> | grep -i "memory\|oom\|heap"
Real-time Log Monitoring
# Monitor all logs
docker compose logs -f
# Monitor specific service
docker logs -f heretek-<service-name>
# Monitor with grep
docker logs -f heretek-<service-name> 2>&1 | grep -i error
Performance Tuning
Database Optimization
# Analyze tables
docker exec heretek-postgres psql -U heretek -d heretek -c "ANALYZE;"
# Vacuum tables
docker exec heretek-postgres psql -U heretek -d heretek -c "VACUUM;"
# Check slow queries
docker exec heretek-postgres psql -U heretek -d heretek -c \
"SELECT query, calls, total_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"
Redis Optimization
# Check memory policy
docker exec heretek-redis redis-cli CONFIG GET maxmemory-policy
# Set appropriate eviction policy
docker exec heretek-redis redis-cli CONFIG SET maxmemory-policy allkeys-lru
# Check slow log
docker exec heretek-redis redis-cli SLOWLOG GET 10
Escalation
If issues persist after troubleshooting:
-
Gather Information
- All relevant logs
- System state (docker inspect, docker ps)
- Resource usage (docker stats)
- Timeline of events
-
Document Attempts
- List all troubleshooting steps attempted
- Note any changes observed
- Record error messages
-
Contact Engineering Team
- Provide gathered information
- Include documentation of attempts
- Note system impact