- Add HERETEK_README.md: Comprehensive guide with architecture, configuration, and troubleshooting - Add RUNBOOK_A2A_DEBUG.md: A2A protocol debugging procedures - Add RUNBOOK_AGENT_OFFLINE.md: Agent offline detection and recovery - Add RUNBOOK_GATEWAY_FAILURE.md: Gateway failure recovery procedures - Add RUNBOOK_APPROVAL_ISSUES.md: Liberation plugin and approval bypass troubleshooting - Add DEPLOYMENT_HERETEK.md: Docker Compose, Kubernetes, and bare metal deployment guides - Add MIGRATION_FROM_UPSTREAM.md: Upstream to Heretek migration guide - Update docs/operations/README.md: Add Heretek-specific runbooks and quick reference table All runbooks include: - Clear step-by-step procedures - CLI commands with expected output - Escalation paths - References to source files
9.3 KiB
Runbook: A2A Protocol Debugging
Document ID: RUNBOOK-A2A-001
Version: 1.0.0
Last Updated: 2026-04-01
Severity Levels: Medium (degraded) to High (complete failure)
Overview
This runbook provides systematic procedures for diagnosing and resolving Agent-to-Agent (A2A) communication failures in the Heretek OpenClaw system.
A2A Architecture Summary
Agent → Gateway (ws:18789) → LiteLLM (http:4000) → Target Agent
↓
Redis Pub/Sub (6379)
Key Components:
- OpenClaw Gateway - WebSocket RPC router at
openclaw-gateway.js - Agent Client - A2A client library at
agent-client.js - LiteLLM - LLM proxy with agent registry
- Redis - Pub/sub messaging backbone
Symptoms of A2A Failures
Complete Failure
- Agents cannot communicate with each other
- Messages show "delivery failed" or timeout
- Gateway
/agent-statusshows all agents offline - LiteLLM
/v1/agentsreturns empty list
Partial/Degraded Operation
- Some agents can communicate, others cannot
- Intermittent message delivery
- High latency in agent responses (>5 seconds)
- Heartbeat gaps (>60 seconds between pings)
Silent Failures
- Messages appear sent but never received
- Agent shows online but doesn't respond
- Triad voting stalls indefinitely
- Discovery requests return no results
Diagnostic Steps
Step 1: Verify Gateway Status
# Check Gateway is running
curl -f http://localhost:18789/health
# Expected output:
# {"status":"ok","timestamp":"2026-04-01T12:00:00.000Z","connectedAgents":11}
# Check agent registration status
curl -s http://localhost:18789/agent-status | jq .
# Expected output:
# {
# "timestamp": "2026-04-01T12:00:00.000Z",
# "totalAgents": 11,
# "onlineCount": 11,
# "offlineCount": 0,
# "agents": [...]
# }
If Gateway is not responding:
- Proceed to
RUNBOOK_GATEWAY_FAILURE.md
Step 2: Verify LiteLLM Agent Registry
# Check registered agents in LiteLLM
curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
http://localhost:4000/v1/agents | jq .
# Expected: List of all 11 agents with their models
# Check LiteLLM health
curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
http://localhost:4000/health | jq .
If agents not registered:
- Agents may not have completed handshake with LiteLLM
- Check agent logs for registration errors
Step 3: Check Individual Agent Health
# Quick health check for all agents
for port in 8001 8002 8003 8004 8005 8006 8007 8008 8009 8010 8011; do
echo -n "Port $port: "
curl -sf http://localhost:$port/health && echo "OK" || echo "FAILED"
done
Expected output:
Port 8001: OK # Steward
Port 8002: OK # Alpha
Port 8003: OK # Beta
Port 8004: OK # Charlie
Port 8005: OK # Examiner
Port 8006: OK # Explorer
Port 8007: OK # Sentinel
Port 8008: OK # Coder
Port 8009: OK # Dreamer
Port 8010: OK # Empath
Port 8011: OK # Historian
Step 4: Verify WebSocket Connectivity
# Install wscat if not available
npm install -g wscat
# Test WebSocket connection to Gateway
wscat -c ws://localhost:18789
# Expected: Connection opens, send handshake:
# {"type":"handshake","agent":"test-client"}
# Should receive:
# {"type":"handshake-ack","accepted":true}
Step 5: Check Redis Pub/Sub
# Connect to Redis CLI
docker exec heretek-redis redis-cli
# In Redis CLI:
# List all channels
PUBSUB CHANNELS "*"
# Check for A2A channels (should see agent-related channels)
# Test pub/sub connectivity
PUBLISH test-channel "test-message"
# Exit Redis CLI
EXIT
Step 6: Analyze Gateway Logs
# Check Gateway logs for A2A errors
docker logs heretek-gateway --tail 200 | grep -i "a2a\|agent\|error\|websocket"
# Look for specific patterns:
docker logs heretek-gateway --tail 500 | grep -E "Failed|Error|Timeout|Disconnected"
Key error patterns to look for:
Failed to register agent- Registration issueWebSocket connection closed- Connectivity issueMessage delivery failed- Routing issueHeartbeat timeout- Agent not responding
Step 7: Analyze Agent Logs
# Check specific agent logs (example: Steward)
docker logs heretek-steward --tail 200 | grep -i "gateway\|a2a\|connect"
# Check for connection errors
docker logs heretek-steward --tail 500 | grep -E "Failed|Error|Timeout|Reconnect"
Recovery Procedures
Procedure 1: Restart A2A Infrastructure
Use when: Complete A2A failure, multiple agents offline
cd /root/heretek/heretek-openclaw-core
# 1. Restart Redis (pub/sub backbone)
docker restart heretek-redis
sleep 5
# 2. Restart LiteLLM (agent registry)
docker restart heretek-litellm
sleep 10
# 3. Restart Gateway (A2A router)
docker restart heretek-gateway
sleep 5
# 4. Rolling restart of all agents
for agent in steward alpha beta charlie examiner explorer sentinel coder dreamer empath historian; do
echo "Restarting $agent..."
docker restart heretek-$agent
sleep 3
done
# 5. Verify recovery
sleep 15
curl -s http://localhost:18789/agent-status | jq '.onlineCount'
Expected result: onlineCount should be 11
Procedure 2: Re-register Agents
Use when: Agents show online but can't communicate
# Force agent re-registration by restarting Gateway connection
for agent in steward alpha beta charlie examiner explorer sentinel coder dreamer empath historian; do
docker exec heretek-$agent pkill -f "agent-client" || true
sleep 2
done
# Agents should auto-reconnect within 30 seconds
# Verify registration
sleep 35
curl -s http://localhost:18789/agent-status | jq '.agents[] | {agentId, status}'
Procedure 3: Clear Redis State
Use when: Stale agent data in Redis causing routing issues
# WARNING: This clears all Redis data
# Backup first if needed
docker exec heretek-redis redis-cli BGSAVE
# Clear all keys
docker exec heretek-redis redis-cli FLUSHALL
# Restart all services to re-populate
docker restart heretek-litellm heretek-gateway
# Wait for agents to re-register
sleep 30
# Verify
curl -s http://localhost:18789/agent-status | jq '.onlineCount'
Procedure 4: Debug Single Agent Communication
Use when: One specific agent has issues
AGENT_NAME="steward" # Change to affected agent
AGENT_PORT="8001" # Corresponding port
# 1. Check agent container status
docker ps | grep heretek-$AGENT_NAME
# 2. Check agent logs for errors
docker logs heretek-$AGENT_NAME --tail 100
# 3. Test agent health endpoint
curl -v http://localhost:$AGENT_PORT/health
# 4. Check agent's Gateway connection
docker exec heretek-$AGENT_NAME cat /proc/$(pidof node)/fd | wc -l
# 5. Force restart this agent only
docker restart heretek-$AGENT_NAME
# 6. Monitor reconnection
watch -n 2 "curl -s http://localhost:18789/agent-status/$AGENT_NAME | jq .status"
Prevention Measures
Monitoring Setup
# Add to crontab for continuous monitoring
# Check A2A health every 5 minutes
*/5 * * * * curl -sf http://localhost:18789/agent-status | jq -e '.onlineCount >= 10' || \
echo "A2A DEGRADED: $(date)" >> /root/.openclaw/logs/a2a-alerts.log
Heartbeat Configuration
Ensure heartbeat is enabled in agent-client.js:
// Default heartbeat interval: 30 seconds
this.heartbeatInterval = config.heartbeatInterval || 30000;
Gateway Health Endpoint
Monitor the Gateway's /agent-status endpoint:
# Continuous monitoring
watch -n 5 'curl -s http://localhost:18789/agent-status | jq "{online: .onlineCount, offline: .offlineCount}"'
Log Aggregation
Set up log collection for A2A debugging:
# Create A2A log aggregation script
cat > /root/heretek/scripts/a2a-logs.sh << 'EOF'
#!/bin/bash
echo "=== Gateway Logs ==="
docker logs heretek-gateway --tail 50 2>&1 | grep -i "a2a\|agent"
echo ""
echo "=== LiteLLM Logs ==="
docker logs heretek-litellm --tail 50 2>&1 | grep -i "agent\|register"
EOF
chmod +x /root/heretek/scripts/a2a-logs.sh
Escalation Path
Level 1: On-Call Engineer
- When: Initial diagnosis and recovery procedures fail
- Actions: Execute all diagnostic steps, document findings
- Contact: [Configure your on-call contact]
Level 2: Engineering Lead
- When: Recovery procedures don't restore A2A within 30 minutes
- Actions: Review logs, consider rollback to last known good state
- Contact: [Configure engineering lead contact]
Level 3: System Architect
- When: Complete A2A failure with data loss or corruption
- Actions: Full system audit, potential rollback to backup
- Contact: [Configure architect contact]
Reference
| Document | Purpose |
|---|---|
RUNBOOK_GATEWAY_FAILURE.md |
Gateway-specific failures |
RUNBOOK_AGENT_OFFLINE.md |
Individual agent offline |
AGENT-MONITORING.md |
Monitoring overview |
agent-client.js |
A2A client implementation |
openclaw-gateway.js |
Gateway implementation |
Version History
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2026-04-01 | Initial version |