mirror of
https://github.com/openclaw/clawdinators.git
synced 2026-07-01 20:24:02 -04:00
e869c7b5a7
- drop AWS SDK from control api - fetch status via AWS CLI in fleet control - update control plane docs
189 lines
7.4 KiB
Markdown
189 lines
7.4 KiB
Markdown
# Control Plane
|
||
|
||
Goal: manage CLAWDINATOR host lifecycle (create/recreate/replace) from **CLAWDINATOR chat** (Telegram/Discord) using an out‑of‑band control API. CLAWDINATOR agents can edit IaC, but **deploys run OOB** with no AWS creds inside agents.
|
||
|
||
## Goals
|
||
- **Plane‑safe control** from CLAWDINATOR chat (chat‑only).
|
||
- OOB execution (no CLAWDINATOR agent has infra creds).
|
||
- Repo is the source of truth for fleet state.
|
||
- Static fleet (Discord token pool constraint).
|
||
- Simple, auditable deploy flow.
|
||
|
||
## Non‑Goals
|
||
- Task routing, agent scheduling, or tool execution.
|
||
- Elastic scaling (no arbitrary cattle instances).
|
||
- Runtime config changes (agents handle their own work).
|
||
|
||
## Constraints
|
||
- Each CLAWDINATOR instance requires a unique Discord bot token.
|
||
- Fleet size == token pool size (static list).
|
||
- Persistent changes must land in repo + AMI.
|
||
- Infra state must be out‑of‑band and locked.
|
||
|
||
## Control Plane Components (KISS)
|
||
- **Control API (AWS Lambda)**
|
||
- Authenticated by a shared control token.
|
||
- Dispatches GitHub Actions workflows (deploy only).
|
||
- **Fleet status**
|
||
- Fetched locally via AWS CLI using control invoker credentials.
|
||
- **Fleet Control Skill** (runs inside CLAWDINATOR)
|
||
- Calls the Control API via `scripts/fleet-control.sh` (AWS IAM invoke).
|
||
- Enforces policy (no self‑deploy) before calling.
|
||
- **GitHub Actions** (execution)
|
||
- Runs OpenTofu apply.
|
||
- **OpenTofu** (infra state)
|
||
- Remote state in S3 + Dynamo lock table.
|
||
- **Instance Registry** (desired state)
|
||
- `nix/instances.json` (authoritative map).
|
||
- **Bootstrap + Secrets**
|
||
- S3 bootstrap prefix per instance.
|
||
- Agenix secrets per instance token.
|
||
|
||
## Control API Auth
|
||
- Shared control token stored as `clawdinator-control-token.age`.
|
||
- Control API is invoked via AWS IAM using a **minimal invoker key**:
|
||
- `clawdinator-control-aws-access-key-id.age`
|
||
- `clawdinator-control-aws-secret-access-key.age`
|
||
- Token is injected into instances via bootstrap and read from `/run/agenix/clawdinator-control-token`.
|
||
|
||
## Control API Env (Lambda)
|
||
- `CONTROL_API_TOKEN`
|
||
- `GITHUB_TOKEN`
|
||
- `GITHUB_REPO` (default `openclaw/clawdinators`)
|
||
- `GITHUB_WORKFLOW` (default `fleet-deploy.yml`)
|
||
- `GITHUB_REF` (default `main`)
|
||
|
||
## Desired State (Fleet Registry)
|
||
`nix/instances.json` is the fleet map (single source of truth for infra + host configs).
|
||
|
||
Example:
|
||
```json
|
||
{
|
||
"clawdinator-1": {
|
||
"host": "clawdinator-1",
|
||
"instanceType": "t3.large",
|
||
"bootstrapPrefix": "bootstrap/clawdinator-1",
|
||
"discordTokenSecret": "clawdinator-discord-token-1"
|
||
},
|
||
"clawdinator-2": {
|
||
"host": "clawdinator-2",
|
||
"instanceType": "t3.large",
|
||
"bootstrapPrefix": "bootstrap/clawdinator-2",
|
||
"discordTokenSecret": "clawdinator-discord-token-2"
|
||
}
|
||
}
|
||
```
|
||
|
||
## Command Semantics (Minimal)
|
||
### `/fleet deploy <target>`
|
||
- **Target required** (no implicit default): `all` or `<id>`.
|
||
- Always runs `tofu apply`.
|
||
- `all`: replace all instances using **latest successful AMI**.
|
||
- `<id>`: replace only that instance using latest successful AMI.
|
||
- Also creates new instances if present in desired state.
|
||
|
||
### `/fleet status`
|
||
- Returns live fleet status via AWS CLI (EC2 describe by tag).
|
||
|
||
## Access Control (Policy)
|
||
- Shared control token authorizes calls to the Control API.
|
||
- Policy enforced by the fleet-control skill:
|
||
- Humans: deploy any target (including `all`).
|
||
- Bots: deploy **only the other instance** (no self‑deploy).
|
||
- Control API also rejects `target == caller` when `caller` is provided.
|
||
|
||
## Lifecycle Flows
|
||
### Add a new instance (static token pool)
|
||
1) Create Discord bot token → `clawdinator-discord-token-2.age`.
|
||
2) Add entry to `nix/instances.json`.
|
||
3) Add host file `nix/hosts/clawdinator-2.nix`.
|
||
4) Run `/fleet deploy all` or `/fleet deploy clawdinator-2`.
|
||
5) Host boots, pulls its bootstrap prefix, starts CLAWDINATOR.
|
||
|
||
### Recreate a single instance
|
||
- `/fleet deploy clawdinator-2` (forces replace for that host).
|
||
|
||
### Roll the fleet
|
||
- `/fleet deploy all` replaces every host with latest AMI.
|
||
|
||
## Self‑Recycle (Out‑of‑Band)
|
||
- Agents call the Control API (no AWS creds) via the fleet-control skill.
|
||
- Control API dispatches GitHub Actions; AWS creds live in CI only.
|
||
|
||
## State + Audit
|
||
- **Desired state**: Git repo (`nix/instances.json`).
|
||
- **Actual state**: OpenTofu S3 backend.
|
||
- **Audit trail**: Git + Actions logs.
|
||
|
||
## AMI Selection (KISS)
|
||
- Use latest AMI tagged `clawdinator=true`.
|
||
- Optional override via workflow input `ami_override` for rollback.
|
||
|
||
## Deploy Execution (Workflow)
|
||
- Single workflow `fleet-deploy.yml`.
|
||
- Inputs: `target`, `ami_override` (optional).
|
||
- Concurrency group `fleet-deploy` (no overlaps).
|
||
- `target=all` runs `tofu apply` normally.
|
||
- `target=<id>` runs `tofu apply -replace aws_instance.clawdinator["<id>"]` (implementation detail).
|
||
|
||
## Bootstrap (Per‑Instance)
|
||
- Upload per instance:
|
||
- `bootstrap/clawdinator-1`
|
||
- `bootstrap/clawdinator-2`
|
||
- Each bundle contains **only that instance’s** Discord token.
|
||
|
||
## EC2 User-Data (Instance Boot)
|
||
- OpenTofu renders a per-instance user‑data script.
|
||
- Script writes `/etc/clawdinator/bootstrap-prefix`.
|
||
- Script writes `/etc/clawdinator/control-api-url`.
|
||
- Script starts `clawdinator-bootstrap.service` + `clawdinator-repo-seed.service`.
|
||
- Script runs `nixos-rebuild switch --flake /var/lib/clawd/repos/clawdinators#<host>`.
|
||
|
||
## Plane Ops Runbook (Chat‑only)
|
||
### Preflight (before flight)
|
||
1) Control API Lambda exists; URL is written to `/etc/clawdinator/control-api-url`.
|
||
2) Control secrets exist in `nix-secrets` and are in bootstrap bundles:
|
||
- `clawdinator-control-token.age`
|
||
- `clawdinator-control-aws-access-key-id.age`
|
||
- `clawdinator-control-aws-secret-access-key.age`
|
||
3) GitHub Action `fleet-deploy.yml` exists and can be dispatched.
|
||
4) `nix/instances.json` includes all desired instances.
|
||
5) Discord tokens are encrypted in `nix-secrets` and synced to S3 `age-secrets/`.
|
||
6) Latest AMI build succeeded (tagged `clawdinator=true`).
|
||
7) `/fleet status` returns the current fleet.
|
||
|
||
### On the plane
|
||
- `/fleet status` → verify fleet + AMI.
|
||
- `/fleet deploy clawdinator-2` → bring up new host.
|
||
- `/fleet deploy all` → roll the fleet to latest AMI.
|
||
- If rollback needed: rerun deploy with `ami_override` (exact AMI id).
|
||
|
||
## Implementation Checklist (From Design → Works)
|
||
1) Add `nix/instances.json` (clawdinator‑1 + clawdinator‑2).
|
||
2) Add `nix/hosts/clawdinator-2.nix` and wire host configs to read registry values.
|
||
3) Update OpenTofu:
|
||
- multi‑instance `for_each` using `nix/instances.json`.
|
||
- S3 backend + Dynamo lock table.
|
||
- Control API Lambda.
|
||
- Control invoker IAM user (lambda invoke only).
|
||
4) Add control secrets to `nix-secrets` and include in bootstrap bundles:
|
||
- `clawdinator-control-token.age`
|
||
- `clawdinator-control-aws-access-key-id.age`
|
||
- `clawdinator-control-aws-secret-access-key.age`
|
||
5) Add workflow `fleet-deploy.yml`:
|
||
- inputs: `target`, `ami_override` (optional).
|
||
- resolves latest AMI by tag when override not set.
|
||
- runs `tofu apply` (replace when target != all).
|
||
6) Add fleet-control skill + script (`scripts/fleet-control.sh`).
|
||
7) Validate:
|
||
- `/fleet status`
|
||
- `/fleet deploy clawdinator-2`
|
||
- verify new host in AWS + CLAWDINATOR service active.
|
||
|
||
## Decisions
|
||
- Control endpoint: AWS Lambda (Function URL).
|
||
- OpenTofu state: S3 backend + Dynamo lock table.
|
||
- Control auth: shared bearer token (`clawdinator-control-token.age`).
|
||
- Plane ops: CLAWDINATOR chat → fleet-control skill → Control API.
|
||
- Deploy command requires explicit target.
|