Files
clawdinators/docs/DEPLOYMENT_MODEL.md
joshp123 280744ce0c infra: slim clawdinators aws footprint
What:
- bound CLAWDINATOR image artifact retention with S3 lifecycle, AMI pruning, and import provenance tags
- reduce the AWS fleet to Babelfish-only and make GitHub credentials opt-in per host
- disable the AMI build, nix-openclaw bump, and release workflows by moving them out of .github/workflows/
- update operator docs for the new explicit build and deploy model

Why:
- stop unbounded S3 and snapshot growth from image builds
- remove unattended resurrection paths and shut down the unused t3.large instances
- keep the remaining Babelfish host running without GitHub App credentials or sync timers

Tests:
- `nix shell nixpkgs#shellcheck nixpkgs#shfmt -c bash scripts/lint-shell.sh` (pass)
- `nix build .#nixosConfigurations.clawdinator-babelfish.config.system.build.toplevel .#nixosConfigurations.clawdinator-1.config.system.build.toplevel .#nixosConfigurations.clawdinator-2.config.system.build.toplevel` (pass)
- `AWS_PROFILE=homelab-admin TF_VAR_aws_region=eu-central-1 TF_VAR_ami_id=ami-0a9abe17feeee0079 TF_VAR_ssh_public_key="$(cat ~/.ssh/id_ed25519.pub)" nix shell nixpkgs#opentofu -c sh -lc 'tofu fmt -check && tofu validate'` (pass)
- live AWS apply: destroyed `clawdinator-1` and `clawdinator-2`, replaced Babelfish, and verified only `Fleet Deploy` remains active in GitHub Actions
2026-04-03 15:38:57 +02:00

2.0 KiB

Deployment model (fast + declarative)

This repo uses a two-lane delivery model:

  • Lane A: Base AMI (slow path, rare)

    • Purpose: reliable boot substrate (Nix + systemd + networking + EFS + SSM + bootstrap services).
    • Built by: explicit operator flow. The old .github/workflows/image-build.yml workflow is intentionally disabled under .github/workflows-disabled/.
    • Tradeoff: EC2 VM Import is slow/variable; do not run per-commit.
  • Lane B: Release + Fleet switch (fast path, manual)

    • Purpose: ship config/app changes quickly while staying reproducible.
    • Built by: explicit operator flow. The old .github/workflows/release.yml workflow is intentionally disabled under .github/workflows-disabled/.
    • Steps:
      1. Fail-fast eval of NixOS configs.
      2. Upload bootstrap bundles to S3 (repo seeds, workspace, secrets references).
      3. Deploy via SSM: nixos-rebuild switch --flake github:openclaw/clawdinators/<rev>#<host>.

Primitives

  • Source of truth: git SHA + flake.lock.
  • Artifact: NixOS system closure for each host config.
  • Distribution: Nix substituters + S3 bootstrap bundle.
  • Activation: nixos-rebuild switch.
  • Rollout: canary order (clawdinator-1 then clawdinator-2).
  • Rollback: redeploy an older git SHA.

Tradeoffs

  • Pros:

    • Fast deploys (minutes) vs AMI import (tens of minutes).
    • Cattle-friendly: hosts stay disposable; state lives on EFS.
    • Reproducible: deploys are pinned to a git SHA.
  • Cons:

    • nixos-rebuild switch restarts services; expect brief bot downtime per release.
    • Requires AWS SSM permissions for the CI user (see infra/opentofu/aws/main.tf).
    • If Nix caches miss, deploys can be slower (still typically faster than AMI import).

Infra requirement: CI SSM permissions

The old release.yml workflow used aws ssm send-command; that path is intentionally disabled now.

After pulling these changes, run tofu apply in infra/opentofu/aws (with admin creds) so the CI IAM policy includes the FleetDeploySSM statement.