Skip to content

Latest commit

 

History

History
1084 lines (809 loc) · 74.7 KB

File metadata and controls

1084 lines (809 loc) · 74.7 KB

Cloud bootstrap — AgentKeys

Audience: the operator standing up a brand-new cloud account to host AgentKeys for the first time, or porting the deployment to a new cloud provider (AliCloud, GCP, Tencent Cloud). Scope: the per-account, run-once provisioning that has to happen before the broker host can come up (§§3–8 of this doc), followed by the per-broker OIDC federation activation (§9), broker host bring-up (§10), and tear-down (§11). Identifiers (DNS names, IAM principals, mail backend, object store, initial bucket policy) + runtime activation in one place. FAQ + troubleshooting: wiki/cloud-setup-faq.md.

After this doc is run, the operator returns here ONLY when:

  • Switching cloud providers (e.g. AWS → AliCloud)
  • Adding a second AWS account (test instance, regional shard)
  • Re-bootstrapping after a teardown
  • Auditing the identity surface (the security-audit checklist in §7)

The day-to-day broker re-deploys live in §10 below (setup-broker-host.sh); they re-run that section without touching §§1–9.

Quick start — five steps to a running stack

Tight five-step flow. Explanation + per-step reasoning are in §1–§11 below; the same flow works for prod (no flag), the CI/test stack (--ci, alias --test, swaps in -test identifiers everywhere + targets the test broker EIP agentkeys-broker-eip-test), or any test-fleet slot N ≥ 2 (--ci --slot N, issue #265 — -test-N identifiers + EIP agentkeys-broker-eip-test-N; see §0.3). The orchestrator scripts/setup-cloud.sh is idempotent — re-running is safe. Prod and every test broker are SEPARATE EC2 machines with SEPARATE EIPs--ci (+ --slot N) is what keeps them apart; never mix the flags.

1. Get the EC2 + EIP (manual, ~5 min per stack)

For each stack (prod and test) you stand up SEPARATELY:

  • Launch an EC2 — t3.small minimum (Ubuntu 22.04 LTS recommended). t3.micro runs the OS but its 1 GB RAM gets OOM-killed compiling aws-sdk-s3 during setup-broker-host.sh. If you already have a t3.micro you can resize: aws ec2 stop-instancesmodify-instance-attribute --instance-type t3.smallstart-instances (EIP stays attached, INSTANCE_ID unchanged).
  • Allocate an EIP (or reuse one) and attach it to the EC2.
  • Open SG ports 22 (SSH), 80 (certbot HTTP-01 challenge), 443 (TLS) to 0.0.0.0/0. All three are required — port 80 is needed for Let's Encrypt to validate domain ownership during cert issuance (step 5b), even though steady-state traffic only flows over 443. Verify with aws ec2 describe-security-groups --group-ids <sg-id> --query 'SecurityGroups[].IpPermissions[].[FromPort,IpRanges[].CidrIp]' — you should see all three ports.
  • Generate or import an SSH key pair (the .pem you'll keep as the fallback when EC2 Instance Connect is down). Confirm SSH works: ssh -i your.pem ubuntu@<EIP>.
  • The default ubuntu user is enough for now — the agentkey SSH login user (used by EC2 Instance Connect later) is created automatically by setup-broker-host.sh in step 5, along with the ec2-instance-connect package.
  • Note INSTANCE_ID + EIP — both go into the env files in step 2.

2. Fill in the env files (one-time per environment)

Two files per environment: {operator-workstation, broker} × {prod, test, test-2, …, test-N}. The operator-workstation files carry account-wide identifiers; the broker files carry per-machine identifiers (INSTANCE_ID + EIP). The base 2×2 (prod + test) ships in-repo; each additional test-fleet slot adds its operator-workstation.test-N.env + broker.test-N.env pair (copy from the slot-2 files — see §0.3).

Both operator-workstation files are pre-populated with litentry.org / account 429071895007 defaults, and every derived value uses bash ${VAR} substitution off of ACCOUNT_ID / BROKER_HOST / ZONE. The script writes 2 values back automatically — operator never hand-edits them:

  • EIP=… persisted to broker env file by step 4 (after allocate-or-adopt)
  • DATA_ROLE_ARN=… persisted to operator env file by step 11 (after data role create)
File Operator edits What to set
scripts/operator-workstation.env None if your account is litentry.org / 429071895007. 5 keys if you're forking: ACCOUNT_ID, BROKER_HOST, ZONE, PARENT_ZONE_ID, MAIL_DOMAIN (the other ~20 keys all derive). account-wide identifiers
scripts/operator-workstation.test.env None in the same case. Same 5 keys (or just ZONE + PARENT_ZONE_ID) for a fork. -test variants pre-derived
scripts/broker.env INSTANCE_ID=i-… EIP is written by the script
scripts/broker.test.env INSTANCE_ID=i-… EIP is written by the script

In practice: paste INSTANCE_ID into the two broker env files. Done.

3. Run setup-cloud.sh (~3 min, idempotent)

awsp agentkeys-admin

# Prod stack (no env flag):
bash scripts/setup-cloud.sh --yes

# CI/test stack — --ci (alias --test) auto-selects scripts/operator-workstation.test.env
# + scripts/broker.test.env, suffixes IAM identifiers with -test, and targets the
# test broker EIP (tag agentkeys-broker-eip-test):
bash scripts/setup-cloud.sh --ci --yes

# Test-fleet slot N >= 2 (issue #265) — --slot N selects the slot's env files
# (operator-workstation.test-N.env + broker.test-N.env), suffixes every
# identifier with -test-N, and targets EIP tag agentkeys-broker-eip-test-N:
bash scripts/setup-cloud.sh --ci --slot 2 --yes

# Base prod stack (#282 dual-stack, §0.4) — --base selects
# operator-workstation.base.env + broker.base.env, suffixes every identifier
# with -base, and targets EIP tag agentkeys-broker-eip-base:
bash scripts/setup-cloud.sh --base --yes

The orchestrator walks 15 idempotent steps (cloud-side AWS resources + IAM users + per-data-class roles + bucket policies + DNS UPSERTs). Steps 10 (agentkeys-daemon[-test]) and 12 (agentkeys-broker[-test]) print access keys to copy off — they're shown ONCE.

4. Configure local credentials + shell aliases (paste, one-time)

Append the two access-key blocks from step 3 to ~/.aws/credentials:

[agentkeys-daemon-test]
aws_access_key_id     = AKIA…
aws_secret_access_key = …
region                = us-east-1

[agentkeys-broker-test]
aws_access_key_id     = AKIA…
aws_secret_access_key = …
region                = us-east-1

(Drop the -test suffix for the prod variants. Account-owner agentkeys-admin is shared — no -test variant.)

Add to ~/.zshenv (works in zsh + bash):

export AGENTKEYS_REPO="$HOME/Projects/agentKeys"
alias ssh-agentkeys='bash $AGENTKEYS_REPO/scripts/ssh-broker.sh prod'
alias ssh-agentkeys-test='bash $AGENTKEYS_REPO/scripts/ssh-broker.sh test'
alias ssh-agentkeys-fallback='bash $AGENTKEYS_REPO/scripts/ssh-broker.sh prod --fallback'
alias ssh-agentkeys-test-fallback='bash $AGENTKEYS_REPO/scripts/ssh-broker.sh test --fallback'
# One pair per additional test-fleet slot (§0.3) — e.g. slot 2:
alias ssh-agentkeys-test-2='bash $AGENTKEYS_REPO/scripts/ssh-broker.sh test-2'
alias ssh-agentkeys-test-2-fallback='bash $AGENTKEYS_REPO/scripts/ssh-broker.sh test-2 --fallback'

source ~/.zshenv. The fallback aliases use the .pem key + ubuntu user; the non-fallback ones use EC2 Instance Connect + the agentkey user (which comes online in step 5).

5. SSH in + run setup-broker-host.sh on the EC2

First-time SSH: use the fallback path (the agentkey user doesn't exist yet — setup-broker-host.sh creates it):

ssh-agentkeys-test-fallback   # ssh -i ~/.ssh/your.pem ubuntu@<test EIP>

# On the EC2 (~10-15 min on t3.small):
git clone https://github.com/litentry/agentKeys.git
cd agentKeys

sudo bash scripts/setup-broker-host.sh --ci --yes

Two flags. --ci (alias --test) triggers the -test suffix on every derived hostname / bucket / email; --issuer-url + --account-id auto-derive from ZONE + ACCOUNT_ID in scripts/operator-workstation.env (which the repo clone ships with). Override any flag explicitly if you need a non-conventional name. For prod, drop --ci:

sudo bash scripts/setup-broker-host.sh --yes

What --ci derives automatically:

  • signer-test.${ZONE}, audit-test.${ZONE}, email-test.${ZONE}, cred-test.${ZONE}, memory-test.${ZONE}, config-test.${ZONE}
  • agentkeys-vault-test-${ACCOUNT_ID}, agentkeys-memory-test-${ACCOUNT_ID}, agentkeys-config-test-${ACCOUNT_ID}
  • noreply-test@bots-test.${ZONE}
  • https://test-broker.${ZONE} for the OIDC issuer URL

When the script finishes (~10-15 min on t3.small cold; ~30-60s on re-runs), it does three things at the end so steady-state operator work is one keystroke from your laptop:

  1. Creates the agentkey SSH login user (separate from the agentkeys daemon system user).
  2. Installs ec2-instance-connect + writes the sshd AuthorizedKeysCommand config so EC2 Instance Connect can push ephemeral keys to agentkey.
  3. Relocates the repo /home/ubuntu/agentKeys/home/agentkey/agentKeys (chowned to agentkey) so re-runs + ongoing edits happen as the steady-state user.

Then exit the ubuntu session and reconnect as agentkey for everything from here on:

exit                       # leave the ubuntu fallback session
ssh-agentkeys-test         # Instance Connect, no .pem needed
cd ~/agentKeys             # → /home/agentkey/agentKeys, files visible

Subsequent re-runs (git pull + sudo bash scripts/setup-broker-host.sh --ci --yes) happen from /home/agentkey/agentKeys — step 10's relocation is idempotent (existence check skips when already in place). The cargo build cache survives the move (it's inside target/). The Rust toolchain is KEPT across runs by default so re-deploys skip the slow rustup + crate-registry re-download (and sccache caches the compilations) — a no-source-change re-run drops to ~30-60s. Pass --reclaim-toolchain on a final deploy to delete /root/.cargo + /root/.rustup and free ~1.5 GB.

For prod, the same flow applies — drop --test everywhere and the relocation moves the repo from whichever home dir you bootstrapped in to /home/agentkey/.

Optional: install rustup for the agentkey user (dev-loop cargo). If you want to run cargo clippy / cargo test interactively as agentkey (e.g., to mirror the CI Linux env locally and catch cfg(target_os = "linux") clippy lints that don't fire on macOS), install rustup under your own $HOME once after reconnecting as agentkey:

ssh-agentkeys-test
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs \
  | sh -s -- -y --default-toolchain stable --profile minimal
source "$HOME/.cargo/env"
echo 'source "$HOME/.cargo/env"' >> ~/.bashrc   # persist for future sessions

cd ~/agentKeys
cargo --version    # resolves the repo's rust-toolchain.toml pin — the exact CI toolchain
cargo clippy --workspace --all-targets -- -D warnings   # same toolchain + lint set as CI

This is optional; the broker itself runs from compiled binaries, not from a live toolchain. Operators who only manage the deployed broker (no compile-in-place dev work) can skip this.

5b. TLS certs — auto-issued by setup-broker-host.sh (re-run if a host was added later)

setup-broker-host.sh (step 7b) auto-issues the Let's Encrypt cert for every co-located vhost whose DNS already resolves to this host, then flips nginx onto :443 — no manual certbot step. Issuance is DNS-dependent (HTTP-01 validates against the hostname's public A record), so a host whose A record isn't live yet is skipped and left HTTP-only (503 "TLS cert not yet issued"; the OIDC federation step in ci-setup.md §1 also needs the broker cert). The order: DNS first (setup-cloud.sh), then this script — a host added to DNS later (e.g. the classify worker, #207) just needs a re-run.

Symptom of a skipped host: curl https://<host>/healthzSSL: no alternative certificate subject name matches target host name — nginx served the broker's default cert because <host> has no :443 block. Fix: repair the A record, then re-run the script on the host.

Re-run to converge any missing certs:

# On the broker host (ssh-agentkeys; `curl -s ifconfig.me` must print the broker EIP):
cd ~/agentKeys
sudo bash scripts/setup-broker-host.sh --ref main      # prod (--ci --ref main = test broker)
# Optional LE registration email: --certbot-email ops@litentry.org (default: none)

The hostname env vars come from /etc/agentkeys/broker.env (which setup-broker-host.sh wrote at step 5). For test: BROKER_HOST=test-broker.${ZONE}, SIGNER_HOST=signer-test.${ZONE}, etc. For prod: drop the -test suffix.

Manual fallback — issue one cert by hand (step 7b failed or a one-off)

⚠️ Run ON THE BROKER HOST (ssh-agentkeys; curl -s ifconfig.me must print the broker EIP). The CA validates against the hostname's public IP, so certbot run on a laptop — especially behind a VPN that rewrites ${ZONE} — writes the challenge to the wrong machine and the CA gets a 404.

# PRE-CHECK (avoids burning LE's rate-limited attempts): the RUNNING nginx must
# serve the ACME path for <host> — `nginx -T` showing the vhost is not enough.
sudo nginx -t && sudo systemctl reload nginx
sudo mkdir -p /var/www/certbot/.well-known/acme-challenge
echo probe-ok | sudo tee /var/www/certbot/.well-known/acme-challenge/probe >/dev/null
curl -s http://localhost/.well-known/acme-challenge/probe -H "Host: <host>"   # → probe-ok (404 ⇒ reload)
sudo rm -f /var/www/certbot/.well-known/acme-challenge/probe

sudo certbot certonly --webroot -w /var/www/certbot -d <host> \
  --agree-tos --register-unsafely-without-email --non-interactive

# Flip nginx Phase A → B (the renderer picks B when /etc/letsencrypt/live/<host>/ exists):
cd ~/agentKeys
sudo bash scripts/setup-broker-host.sh --ref main      # or --ci --ref main (test broker)

Verify the cert is live (bypass laptop DNS, which may be rewritten by WARP / Zscaler / Tailscale to 198.18.x.y for ${ZONE}):

# DoH lookup — proves Route 53 has the right EIP, not your laptop's local resolver
curl -sS "https://dns.google/resolve?name=${BROKER_HOST}&type=A" | jq -r '.Answer[].data'
# → should be your EIP, not 198.18.x.y

# TLS handshake against the real EIP:
echo | openssl s_client -servername "${BROKER_HOST}" -connect "$(curl -sS "https://dns.google/resolve?name=${BROKER_HOST}&type=A" | jq -r '.Answer[0].data'):443" 2>&1 \
  | grep -E "subject="
# → subject=/CN=<your-BROKER_HOST>

If openssl s_client returns no peer certificate available, certbot didn't finish or nginx isn't on Phase B yet. Check:

  • sudo ls /etc/letsencrypt/live/ — should list all 8 hostnames as subdirs (broker + signer + 6 workers: audit/email/cred/memory/config/classify).
  • sudo ss -tlnp | grep ':443' — nginx should be on 0.0.0.0:443.
  • sudo tail /var/log/letsencrypt/letsencrypt.log for the actual certbot failure.

Common failures + fixes:

  • Connection timeout to … port 80 — the SG is missing port 80 ingress. Re-check step 1's SG requirements (you need 22, 80, and 443).
  • DNS problem: NXDOMAIN — Route 53 doesn't have the A record yet, or DNS hasn't propagated. Wait 1-2 min, then retry. Quick check: curl -sS "https://dns.google/resolve?name=<host>&type=A" (do NOT rely on dig — local resolver may be lying).
  • No such file or directory: /var/www/certbot — Phase A nginx render didn't complete; re-run sudo bash scripts/setup-broker-host.sh --test --yes first.
  • A worker's cert fails but the broker's works (or the CA hits an IP that isn't your broker) — the worker A records point at a different EIP than the broker. Workers co-locate with the broker, so every worker A record MUST equal broker.${ZONE}'s. This bit us when the account had both a prod and a test broker EIP and the worker records were pointed at the test EIP while broker/signer stayed on prod. Check via DoH (never laptop DNS — a VPN rewrites it): for h in broker audit email cred memory config classify; do echo "$h → $(curl -s "https://dns.google/resolve?name=$h.${ZONE}&type=A" | jq -r '.Answer[0].data')"; done. If any worker differs from broker, re-run bash scripts/dns-upsert-workers.sh from your laptop (agentkeys-admin) — it derives the EIP from broker.${ZONE}'s own A record, so all six workers mirror the broker (pass --eip <broker-EIP> to be explicit).
  • unauthorized / Invalid response … /.well-known/acme-challenge/… : 404 — the CA reached the host but couldn't fetch the challenge file. Two causes: (1) certbot ran on the wrong machine--webroot wrote the challenge to that box's /var/www/certbot, but the CA validates against the hostname's public IP (the broker). Run certbot ON THE BROKER (see the ⚠️ above). (2) a newly-added worker's vhost isn't livenginx -T shows it on disk but the running process wasn't reloaded; sudo nginx -t && sudo systemctl reload nginx, then re-run the PRE-CHECK probe. Confirm locally on the broker (not a VPN'd laptop): echo ok | sudo tee /var/www/certbot/.well-known/acme-challenge/probe >/dev/null && curl -s http://localhost/.well-known/acme-challenge/probe -H "Host: <host>" must print ok.

The rest of this doc explains why each step exists and how to recover from failures. Operators following the quick start above can skip to docs/chain-setup.md once step 5b completes.

§1  Identities         — four IAM principals; concept first, then provider commands
§2  Domain + DNS       — subdomain ownership; parent-zone confirmation
§3  Email backend      — SES domain identity + receipt rule + S3 inbound bucket
§4  IAM users + roles  — agentkeys-{admin,broker,daemon} + agentkeys-data-role
§5  Bucket policy      — static-IAM variant (pre-OIDC; replaced in §9 below)
§6  Instance profile   — agentkeys-broker-host (optional, EC2-only)
§7  Security audit     — strip legacy over-broad attached policies
§8  Cloud portability  — AWS → AliCloud / GCP / Tencent Cloud mapping
§9  OIDC federation    — per-broker security upgrade after broker is reachable
§10 Broker host        — what setup-broker-host.sh does
§11 Cleanup            — full account teardown

Surgical re-run of any single step: bash scripts/setup-cloud.sh --only-step N (with --test for test).

Env files reference (2 per environment + CI runner)

Two env files per environment — {operator, broker} × {prod, test, test-N…} (the test fleet adds a pair per slot, §0.3). The GitHub Actions runner doesn't get its own file — it materializes the operator-workstation env inline at job start from TEST_* secrets.

File Lives on Scope Sourced by
scripts/operator-workstation.env operator laptop prod every helper script + setup-cloud.sh + setup-heima.sh + harness/run.sh
scripts/operator-workstation.test.env operator laptop test (slot 1) same scripts, via --env-file <path> or --ci
scripts/operator-workstation.test-2.env operator laptop test slot 2 (#265; template for slots ≥ 2) same scripts, via --ci --slot 2
scripts/broker.env prod broker host at /etc/agentkeys/broker.env prod the broker process at boot (also setup-broker-host.sh writes equivalent systemd Environment= lines)
scripts/broker.test.env test broker host at /etc/agentkeys/broker.env test (slot 1) same
scripts/broker.test-2.env slot-2 test broker host test slot 2 same
GitHub Actions runner ephemeral runner per job test harness-ci.yml writes scripts/operator-workstation.env inline from TEST_* secrets (see docs/ci-setup.md §7)

Operator env — prod vs test side-by-side

Variable Prod Test Purpose
ACCOUNT_ID 429071895007 429071895007 (same) every cloud step
REGION us-east-1 us-east-1 regional API calls
ZONE litentry.org litentry.org (same) parent DNS zone
PARENT_ZONE_ID Route 53 zone ID same DNS UPSERTs
BROKER_HOST broker.${ZONE} test-broker.${ZONE} OIDC issuer hostname (byte-for-byte distinct → distinct IAM OIDC provider ARN)
MAIL_DOMAIN bots.${ZONE} bots-test.${ZONE} SES inbound subdomain
BUCKET / MAIL_BUCKET agentkeys-mail-${ACCT} agentkeys-mail-test-${ACCT} inbound mail bucket
VAULT_BUCKET agentkeys-vault-${ACCT} agentkeys-vault-test-${ACCT} credentials bucket (arch.md §17)
MEMORY_BUCKET agentkeys-memory-${ACCT} agentkeys-memory-test-${ACCT} memory bucket
DATA_ROLE_ARN …:role/agentkeys-data-role …:role/agentkeys-data-role-test OIDC-federated data role
VAULT_ROLE_ARN …:role/agentkeys-vault-role …:role/agentkeys-vault-role-test per-data-class vault role
MEMORY_ROLE_ARN …:role/agentkeys-memory-role …:role/agentkeys-memory-role-test per-data-class memory role
OIDC_PROVIDER_ARN …:oidc-provider/${BROKER_HOST} …:oidc-provider/test-broker.${ZONE} derived from BROKER_HOST
SIGNER_HOST + worker hosts signer.${ZONE} etc. signer-test.${ZONE} etc. per-service public hostnames
BROKER_EMAIL_FROM_ADDRESS noreply@bots.${ZONE} noreply-test@bots-test.${ZONE} SES verified sender
Heima contract *_HEIMA addresses one set a DIFFERENT set (same chain, different deployer key) per-deploy pinned addresses

Broker env — prod vs test side-by-side

Variable Prod Test
ACCOUNT_ID same same
BROKER_DATA_ROLE_ARN …:role/agentkeys-data-role …:role/agentkeys-data-role-test
BROKER_AWS_REGION us-east-1 us-east-1
BROKER_OIDC_ISSUER https://broker.${ZONE} https://test-broker.${ZONE}
BROKER_OIDC_KEYPAIR_PATH /home/ubuntu/.agentkeys/broker/oidc-keypair.json same
BROKER_SESSION_KEYPAIR_PATH /home/ubuntu/.agentkeys/broker/session-keypair.json same
BROKER_AUTH_METHODS wallet_sig,email_link same
BROKER_AUDIT_ANCHORS sqlite same
BROKER_EMAIL_SENDER ses ses
BROKER_EMAIL_FROM_ADDRESS noreply@bots.${ZONE} noreply-test@bots-test.${ZONE}

The broker process never reads operator-workstation env vars directly — separation prevents a laptop value from silently shadowing the broker's own config (per scripts/broker.env header comment).

CI runner

The runner doesn't ship with a checked-in env file. harness-ci.yml writes one inline at job start, mapping TEST_* repo secrets into scripts/operator-workstation.env:

TEST secret Maps to operator var
TEST_ACCOUNT_ID ACCOUNT_ID
TEST_AWS_REGION REGION
TEST_BROKER_HOST BROKER_HOST
TEST_VAULT_BUCKET / TEST_MEMORY_BUCKET VAULT_BUCKET / MEMORY_BUCKET
TEST_DATA_ROLE_ARN / TEST_VAULT_ROLE_ARN / TEST_MEMORY_ROLE_ARN DATA_ROLE_ARN / VAULT_ROLE_ARN / MEMORY_ROLE_ARN
TEST_HEIMA_DEPLOYER_KEY written to ~/.agentkeys/heima-deployer.key
TEST_*_HEIMA contract addresses *_HEIMA
TEST_OIDC_AWS_ROLE_ARN the GH Actions OIDC role (gate; not a runtime var)

Full list + activation flow: docs/ci-setup.md §7. setup-cloud.sh validates required keys at step 2 and dies with a precise pointer if missing.

§0.1 Manual prereqs (must exist before setup-cloud.sh runs)

setup-cloud.sh consumes already-existing identifiers — it does NOT register your domain, create a Route 53 hosted zone, or launch the EC2. Those are operator decisions (instance type, region, key pair, DNS provider choice) and don't belong in an automated script. Three manual prereqs before the orchestrator works:

1. Domain + Route 53 hosted zone

You own a domain (e.g. litentry.org). If not, register one with any registrar (Namecheap, GoDaddy, Route 53 Domains, etc.) — fully manual, out of scope here.

Create a Route 53 hosted zone for the domain (idempotent at the caller-reference level, but safe to skip if the zone already exists):

aws route53 create-hosted-zone \
  --name "$ZONE" \
  --caller-reference "agentkeys-$(date +%s)"

Look up the zone ID (strip the /hostedzone/ prefix):

aws route53 list-hosted-zones \
  --query 'HostedZones[?Name==`'"$ZONE"'.`].Id' --output text \
  | awk -F/ '{print $NF}'
# → Z09723983CFJOHAE3VC65

Paste it into operator-workstation.env as PARENT_ZONE_ID=Z….

Delegation: Route 53 outputs 4 NS records when you create the zone (visible via aws route53 get-hosted-zone --id $PARENT_ZONE_ID --query 'DelegationSet.NameServers'). Copy them into your registrar's DNS settings as the authoritative nameservers. Verify after propagation (usually <1h):

dig +short NS "$ZONE"
# Should return 4 ns-XX.awsdns-YY.{com,net,org,co.uk} entries.

If dig returns the registrar's default nameservers instead, delegation hasn't propagated. All downstream DNS UPSERTs in §6 will silently miss until it does.

Non-Route 53 DNS providers: setup-cloud.sh step 6 hardcodes Route 53 API calls. To use Cloudflare / DigitalOcean / etc., skip step 6 (--to-step 5) and replicate the same 12 records manually — see §6 below for the canonical record set. Test isolation works identically: a test-broker.${ZONE} A record under any DNS provider is the same byte-for-byte trust scope as under Route 53.

2. EC2 instance (or any Linux host)

setup-broker-host.sh runs on any Linux box with sudo, systemd, public-internet egress, ports 22/80/443 open inbound. The host is your choice:

Setting Prod Test
Instance type t3.small minimum t3.micro is fine
AMI Ubuntu 22.04 LTS or Amazon Linux 2023 same
Security group 22 (SSH), 80 (certbot HTTP-01), 443 (broker + workers TLS), all from 0.0.0.0/0 same (AWS validates OIDC JWKS over public TLS from AWS IPs that aren't pinnable)
Key pair SSH key, EC2 Instance Connect, or SSM Session Manager same

Launch via AWS console, aws ec2 run-instances, or your IaC tool. The script doesn't care which.

Getting the IP — three workflows:

Both INSTANCE_ID and EIP live in the env file (scripts/operator-workstation.env or …test.env) — set them there once, not on the shell every run. The test stack is selected by --env-file <path> + the explicit --test flag (or auto-detected when the env-file name contains "test").

Workflow 0 (you already have EC2 + EIP attached): step 4 adopts the existing EIP

If the EC2 is already running with an EIP attached (whether allocated via the AWS Console, Terraform, or a previous setup-cloud.sh run), there's no need to allocate or re-associate. Step 4's precedence ladder detects it:

# 1. Find the existing EC2's instance id:
aws ec2 describe-instances --region "$REGION" \
  --filters "Name=ip-address,Values=<YOUR-EXISTING-EIP>" \
  --query 'Reservations[].Instances[].InstanceId' --output text

# 2. Paste it into the env file (one line edit):
echo 'INSTANCE_ID=i-0123…' >> scripts/operator-workstation.env

# 3. Run setup-cloud.sh — step 4 prints:
#      "skip  EIP <ip> already attached to <instance-id> (adopting; no allocation)"
#      "ok    tagged existing EIP as agentkeys-broker-eip (idempotency for re-runs)"
#    No new EIP is allocated. No re-association. The existing EIP gets
#    retroactively tagged so future re-runs find it via tag-lookup too.
AWS_PROFILE=agentkeys-admin bash scripts/setup-cloud.sh --yes

The precedence inside step 4 is: A adopt EIP attached to $INSTANCE_IDB reuse tagged EIP → C use $EIP from env file → D allocate fresh. First match wins; no later branch fires if an earlier one resolves. Fully idempotent re-runs even when the operator pre-provisioned EC2 + EIP outside the script.

Workflow A (recommended): EC2-first, then attach via env-file edit + re-run

# 1. Launch EC2 → note INSTANCE_ID
aws ec2 run-instances --instance-type t3.small --image-id <ami> --key-name <key> ...

# 2. Paste INSTANCE_ID into the env file (one line edit):
echo 'INSTANCE_ID=<from-step-1>' >> scripts/operator-workstation.env
#    (or for test: scripts/operator-workstation.test.env)

# 3. Bootstrap (allocates EIP + attaches to INSTANCE_ID + persists EIP back to env)
AWS_PROFILE=agentkeys-admin bash scripts/setup-cloud.sh --yes
# Test stack:
AWS_PROFILE=agentkeys-admin bash scripts/setup-cloud.sh \
  --env-file scripts/operator-workstation.test.env --test --yes

# 4. SSH (EIP is now in the env file as EIP=…)
ssh ubuntu@$(grep ^EIP= scripts/operator-workstation.env | cut -d= -f2)

Workflow B: EIP-first, attach manually

# 1. Allocate EIP (printed at §14 summary; persisted to env file as EIP=…)
AWS_PROFILE=agentkeys-admin bash scripts/setup-cloud.sh --yes

# 2. Launch EC2
aws ec2 run-instances ...

# 3. Attach the EIP
aws ec2 associate-address --region "$REGION" \
  --instance-id <new-instance-id> \
  --public-ip $(grep ^EIP= scripts/operator-workstation.env | cut -d= -f2)

A is one fewer command; B is sometimes necessary when an existing EC2 needs to be repointed at the EIP later. For test, swap in --env-file scripts/operator-workstation.test.env --test everywhere — the EIP will be tagged agentkeys-broker-eip-test (the test env file has the test placeholders pre-populated).

2a. SSH into the broker host

Once the EC2 is launched + the EIP attached, SSH access goes through scripts/ssh-broker.sh — single entry point that reads INSTANCE_ID + EIP from scripts/broker.env or scripts/broker.test.env so it stays in lockstep with whatever setup-cloud.sh persisted.

# Prod broker via EC2 Instance Connect (no .pem needed):
bash scripts/ssh-broker.sh

# Test broker:
bash scripts/ssh-broker.sh test

# Fallback via .pem key (when EC2 Instance Connect is down):
bash scripts/ssh-broker.sh prod --fallback
bash scripts/ssh-broker.sh test --fallback

Default AWS profiles per stack (least-privilege, one-shot to provision):

Stack Default profile Trust
prod agentkeys-broker ec2-instance-connect:SendSSHPublicKey on the prod instance ARN only
test agentkeys-broker-test same, scoped to the test instance ARN

If agentkeys-broker or agentkeys-broker-test doesn't exist yet, setup-cloud.sh step 12 creates it idempotently (scoped to whatever INSTANCE_ID is set in the corresponding broker env file):

# Test stack — creates agentkeys-broker-test, scopes ec2-instance-connect
# to INSTANCE_ID from broker.test.env, mints an access key ONCE if none
# active. Re-run is a no-op once the user + policy + key already exist.
AWS_PROFILE=agentkeys-admin bash scripts/setup-cloud.sh \
  --env-file scripts/operator-workstation.test.env --test --only-step 12

# Prod stack (the canonical `agentkeys-broker` user from AGENTS.md):
AWS_PROFILE=agentkeys-admin bash scripts/setup-cloud.sh --only-step 12

The script prints the access key once (paste into ~/.aws/credentials as [agentkeys-broker] / [agentkeys-broker-test]) — it never re-mints on subsequent runs because the operator already holds the secret. If INSTANCE_ID is unset in the broker env file, step 12 skips with a pointer to paste it first.

Shell wrappers (drop in ~/.zshrc) make the common case one keystroke:

AGENTKEYS_REPO="$HOME/Projects/agentKeys"
alias ssh-prod='bash $AGENTKEYS_REPO/scripts/ssh-broker.sh prod'
alias ssh-test='bash $AGENTKEYS_REPO/scripts/ssh-broker.sh test'

3. agentkeys-admin AWS profile

A long-lived IAM user with IAMFullAccess + AmazonS3FullAccess + AmazonSESFullAccess + AmazonRoute53FullAccess permissions. Already provisioned per AGENTS.md "AWS local-profile ↔ remote-IAM mapping". Switch to it before any bootstrap call:

awsp agentkeys-admin
aws sts get-caller-identity   # → arn:aws:iam::…:user/agentkeys-admin

The bootstrap script intentionally doesn't auto-create the admin user — bootstrapping IAM root credentials onto disk is the kind of thing you only do once, by hand, with the IAM Console open.

§0.2 IAM isolation matrix (prod ↔ test, same AWS account)

Same AWS account is fine — isolation comes from the -test suffix on every identifier, not from the account boundary. Cross-trust is structurally impossible because the trust policy on every test role lists ONLY the test OIDC provider ARN (which is bound byte-for-byte to test-broker.${ZONE}, never broker.${ZONE}).

Resource Prod name Test name Created by
IAM user (daemon) agentkeys-daemon agentkeys-daemon-test setup-cloud.sh step 10 (suffixed when --test flag is passed, or env-file path matches *test* as an ergonomic auto-detect)
IAM role (data) agentkeys-data-role agentkeys-data-role-test setup-cloud.sh step 11 (same suffix logic)
IAM role (vault) agentkeys-vault-role agentkeys-vault-role-test provision-vault-role.sh reads VAULT_ROLE_ARN from the active env file
IAM role (memory) agentkeys-memory-role agentkeys-memory-role-test provision-memory-role.sh (same env-driven pattern)
IAM OIDC provider …oidc-provider/broker.${ZONE} …oidc-provider/test-broker.${ZONE} manual aws iam create-open-id-connect-provider per §9.2 (one per broker URL — AWS validates byte-for-byte)
EC2 instance profile agentkeys-broker-host agentkeys-broker-host-test §6 (optional)
EIP (tag) agentkeys-broker-eip agentkeys-broker-eip-test setup-cloud.sh step 4
Mail bucket agentkeys-mail-${ACCT} agentkeys-mail-test-${ACCT} setup-cloud.sh step 7 (from BUCKET env var)
Vault bucket agentkeys-vault-${ACCT} agentkeys-vault-test-${ACCT} provision-vault-bucket.sh (from VAULT_BUCKET env var)
Memory bucket agentkeys-memory-${ACCT} agentkeys-memory-test-${ACCT} provision-memory-bucket.sh (from MEMORY_BUCKET env var)
SES sender noreply@bots.${ZONE} noreply-test@bots-test.${ZONE} ses-verify-sender.sh (from BROKER_EMAIL_FROM_ADDRESS)
Heima contracts one set of 6 addresses a different set of 6 (same chain, different deployer key) setup-heima.sh per deployer key

Cross-trust isolation enforced by:

  1. OIDC provider URL is the trust scope. Each role's trust policy names exactly one provider ARN. The provider ARN derives from the broker URL. broker.${ZONE} and test-broker.${ZONE} produce distinct ARNs, so the test OIDC provider literally cannot mint JWTs that prod roles accept.
  2. PrincipalTag scoping (§9.4) layers on top. Even if a test JWT somehow reached a prod role, the bucket policy condition s3:prefix=bots/${aws:PrincipalTag/agentkeys_actor_omni}/* would still scope reads/writes by actor.
  3. Per-data-class bucket separation. Vault role's IAM grants reference vault bucket only; memory role references memory bucket only. Even within one stack, vault creds in the memory bucket → AccessDenied (defense-in-depth for the cap-mint layer).

setup-cloud.sh validates required env keys at step 2 and dies with a precise pointer if missing.

§0.3 Test-broker fleet — multiple test brokers (issue #265)

The account can host N test brokers in parallel (one EC2 per slot) so concurrent CI runs stop trampling each other's deploys. Prod stays single-broker. Slots are selected everywhere by --ci --slot N (or AGENTKEYS_TEST_SLOT=N, or auto-detected from a *test-N* env-file path); a bare --ci is slot 1.

Naming scheme. Slot 1 predates the fleet and keeps its names (the OIDC issuer URL is byte-frozen against the registered IAM provider — renaming would break every trust policy). Slots ≥ 2 use a uniform -test-N suffix:

Identifier Prod Test slot 1 (grandfathered) Test slot N ≥ 2
Broker host / OIDC issuer broker.${ZONE} test-broker.${ZONE} broker-test-N.${ZONE}
MCP host mcp.${ZONE} test-mcp.${ZONE} mcp-test-N.${ZONE}
Signer + 6 workers signer.${ZONE} signer-test.${ZONE} signer-test-N.${ZONE}
EIP tag agentkeys-broker-eip agentkeys-broker-eip-test agentkeys-broker-eip-test-N
IAM users/roles agentkeys-daemon agentkeys-daemon-test agentkeys-daemon-test-N
Buckets agentkeys-vault-${ACCT} agentkeys-vault-test-${ACCT} agentkeys-vault-test-N-${ACCT}
Mail domain / sender bots.${ZONE} bots-test.${ZONE} / noreply-test@ bots-test-N.${ZONE} / noreply-test-N@
SES receipt rule agentkeys-inbound agentkeys-inbound-test agentkeys-inbound-test-N
SSM instance profile agentkeys-broker-host (§6) agentkeys-test-broker-ssm agentkeys-test-broker-ssm-N
Env files …workstation.env + broker.env ….test.env + broker.test.env ….test-N.env + broker.test-N.env
SSH ssh-broker.sh prod ssh-broker.sh test ssh-broker.sh test-N

Shared across the fleet vs replicated per slot (mirrors the issue #265 table):

  • Shared: the AWS account, parent DNS zone, SES rule set name (agentkeys — rules are per-slot), the Heima TEST contract set (Registry/Scope/EntryPoint/factory are multi-tenant, keyed by omni/account), the GH-Actions deploy role github-actions-agentkeys-deploy (its SendCommand scope covers the whole fleet).
  • Replicated per slot: EC2 + EIP (+ tag), the full DNS name set (9 A records), OIDC issuer + IAM provider, daemon/SSH users, data + per-data-class roles, all four buckets (mail/vault/memory/config — decision (a): per-slot buckets, zero cross-slot blast radius), SES domain identity + sender + receipt rule, SSM instance profile (per-machine — never shared, per #265's isolation requirement), and (phase 3) the slot's own Heima deployer wallet → distinct omnis + master account.

Cross-slot isolation is structural, same mechanism as prod↔test (§0.2): each slot's roles trust ONLY that slot's OIDC provider ARN, which derives from the slot's broker URL — slot 2's JWTs cannot assume slot 1's roles, and vice versa.

Adding slot N (checklist):

  1. Launch the EC2 (t3.small+; SG ports 22/80/443 open — the existing launch-wizard-2 SG is fine) and attach/allocate an EIP. Tag conventions are applied by the script, not by hand.
  2. Create the two env files by copying the slot-2 pair: scripts/operator-workstation.test-2.env….test-N.env (replace every test-2 with test-N; keep the SHARED contract addresses verbatim; leave DEPLOYER_ADDR_HEIMA empty until the slot's deployer key exists) and scripts/broker.test-2.envbroker.test-N.env (paste the new INSTANCE_ID + EIP).
  3. bash scripts/setup-cloud.sh --ci --slot N --yes --to-step 12 (laptop, agentkeys-admin). Step 13's per-data-class roles need the slot's OIDC provider to exist, so it runs later (step 6 of this checklist).
  4. On the EC2 (first time via bash scripts/ssh-broker.sh test-N --fallback): clone the repo, sudo bash scripts/setup-broker-host.sh --ci --slot N --yes, then issue certs per quick-start §5b (all 9 hostnames) and re-run the script to flip nginx onto :443. --slot (and even --ci) matter only on this FIRST bootstrap of the virgin host — once the broker unit exists, every re-run self-identifies the environment + slot from the deployed unit's BROKER_OIDC_ISSUER (so CI's flagless-slot SSM re-deploys can't cross-wire slots), and an explicit --slot that contradicts the deployed identity is a hard error.
  5. Register the slot's OIDC provider per §9.2 (issuer https://broker-test-N.${ZONE}).
  6. bash scripts/setup-cloud.sh --ci --slot N --yes --only-step 13 — per-data-class buckets + roles + bucket policies now succeed (their trust policies name the provider from step 5).
  7. bash scripts/provision-ci-deploy-role.sh --fix-ssm (no instance flag — it auto-discovers every broker.test*.env slot, re-scopes the CI deploy role to the whole fleet, and creates + associates the slot's agentkeys-test-broker-ssm-N instance profile).
  8. Verify: ENV_FILE=scripts/operator-workstation.test-N.env bash scripts/verify-workers.sh and curl -sf https://broker-test-N.${ZONE}/healthz.

CI slot routing (issue #265 phase 4) is wired but dormant by default. harness-ci.yml shards the deploy + harness concurrency per slot (slot 1 keeps the legacy group string heima-test-deployer-nonce so runs from pre-phase-4 branches still mutually exclude; slots ≥ 2 get heima-test-slot-N) and selects the slot's deployer key / instance id / env; the slot itself is picked least-loaded by the run's dependency-free slot-claim job (static PR# % N + 1 as tiebreak/fallback — spec/ci-parallel-test-fleet.md §4b). N is the repo variable AGENTKEYS_TEST_SLOT_COUNT, default 1 (every run on slot 1 = the pre-fleet behavior). Turn on parallelism with gh variable set AGENTKEYS_TEST_SLOT_COUNT --body N once each slot's CI is proven (validate a slot first via gh workflow run harness-ci.yml -f slot_override=N). Watch the fleet + run→slot mapping live with bash scripts/fleet-status.sh --watch (#279 v1). Full runbook: ci-setup.md "Activate parallel CI"; design + why the chain contract set is SHARED: spec/ci-parallel-test-fleet.md.

Live fleet inventory (re-verify with aws ec2 describe-addresses --region "$REGION" --filters Name=tag:Name,Values='agentkeys-broker-eip-test*' — never hardcode the IPs downstream; the env files are the source of truth):

Slot EC2 Name tag Env files Stood up
1 agentkeys-broker-test *.test.env pre-#265 (the original CI broker)
2 agentkeys-broker-test-2 *.test-2.env 2026-06-11 (#265 phase 2)

Why jq -n --arg and not cat > file.json <<EOF: jq --arg passes values outside shell parameter expansion, sidestepping the zsh modifier bug ($VAR:r etc.) that silently corrupts ARNs. JSON is validated on construction, command substitution feeds straight into --policy-document, no file lands on disk. The orchestrator + every helper script applies this convention.

§0.4 Add a prod chain stack (spawn the second prod broker — #282)

Production runs one full broker stack per chain: the Heima stack (broker.${ZONE}, the consumer free tier, plain flags) and the Base stack (-base, the permissioned B2B2C partner tier — #282 D5 dual-stack), each on its own EC2. This is the same spawn-another-broker pattern as §0.3 / ci-setup.md "Add a test-fleet slot" — instantiated on the chain axis (one machine per chain, prod posture) instead of the slot axis (N machines, one chain). The chain migration itself (contracts, ceremonies, phases) is plan/chain/base-migration.md; this section is the end-to-end operator journey for the per-chain cloud + host substrate, traps inline.

Selection everywhere is --base (or auto-detect from a *base* env-file path). A PROD stack — mutually exclusive with --ci; real SES sender, hard-fail accept drift guard. Host re-runs are flagless (self-identify from the deployed unit's issuer; a contradicting --ci/--base is a hard cross-wiring error).

Naming — the §0.3 worker-style suffix scheme with -base on every identifier:

Identifier Base prod stack
Broker host / OIDC issuer broker-base.${ZONE}
MCP host mcp-base.${ZONE}
Signer + 6 workers signer-base.${ZONE}, audit/email/cred/memory/config/classify-base.${ZONE}
EIP tag agentkeys-broker-eip-base
IAM users/roles agentkeys-daemon-base, agentkeys-broker-base, agentkeys-data-role-base, per-class agentkeys-{vault,memory,config}-role-base
Buckets agentkeys-{mail,vault,memory,config}-base-${ACCT}
Mail domain / sender / receipt rule bots-base.${ZONE} / noreply-base@ / agentkeys-inbound-base
EC2 instance profile agentkeys-broker-host-base (§6 with -base substituted)
Env files / SSH *.base.env / ssh-broker.sh base

Shared vs replicated: shared with the Heima prod stack = the AWS account + parent DNS zone, nothing else. Everything in the table is replicated per stack. Isolation is structural, same mechanism as prod↔test (§0.2): each stack's roles trust only that stack's OIDC provider ARN.

The chain dimension (the one thing the slot pattern doesn't have): the env file pins AGENTKEYS_CHAIN=base; setup-broker-host.sh templates that chain into every unit + worker env it writes — env keys carry the _BASE suffix per the Rust env_profile convention, RPC default from chain-profiles/base.json. Until #282 Phase 4 deploys the Base contract set, the *_BASE addresses are empty → the stack runs healthz-green but chain-degraded (#241 posture: accept / cap-mint / chain-verify answer actionable 5xx; OIDC + IAM + S3 + SES fully live). Deployer key: ~/.agentkeys/deployer-base.key (0600; symlink base-deployer.key → it for the bring-up's <chain>-deployer.key path), address pinned as DEPLOYER_ADDR_BASE.

Add the stack (checklist — live since 2026-06-12 on EC2 agentkeys-broker-base; broker.base.env is the instance/EIP source of truth):

  1. Launch the EC2 (t3.medium; SG ports 22/80/443 — the broker SGs qualify) and attach an EIP. Paste INSTANCE_ID + EIP into scripts/broker.base.env; tags are applied by the script.
  2. Create the two env files. For Base they're committed (operator-workstation.base.env + broker.base.env); for a future chain <c>, copy that pair with s/base/<c>/, keep the *_<C> contract keys EMPTY, and pin the chain's deployer address.
  3. Cloud provision (laptop, agentkeys-admin): bash scripts/setup-cloud.sh --base --yes --to-step 12. Step 4 adopts + retro-tags the attached EIP; append the two printed access-key blocks to ~/.aws/credentials as [agentkeys-daemon-base] / [agentkeys-broker-base]. Trap: step 9's sender verify can exit non-zero while actually succeeding — re-run --only-step 9; "already verified" = done.
  4. Instance profile agentkeys-broker-host-base per §6 with -base substituted (SES grant scoped to the bots-base.${ZONE} identities + AmazonSSMManagedInstanceCore), associated to the instance.
  5. Broker host bring-up (first time via bash scripts/ssh-broker.sh base --fallback): clone the repo, sudo bash scripts/setup-broker-host.sh --base --yes. TLS for the 8 hostnames auto-issues in the same run (§5b — DNS from step 3 must be live first; mcp-base stays DNS-only until #152); if any host was skipped, re-run flagless after fixing its A record.
  6. Register the stack's OIDC provider (§9.2, issuer https://broker-base.${ZONE}), then bash scripts/setup-cloud.sh --base --yes --only-step 13 (the per-class roles' trust policies need the provider to exist first), then a full bash scripts/setup-cloud.sh --base --yes converge (all ok/skip).
  7. Verify: ENV_FILE=scripts/operator-workstation.base.env bash scripts/wait-stack-healthy.sh (8/8 green over public TLS) + OIDC discovery at https://broker-base.${ZONE}/.well-known/openid-configuration serves the issuer with a populated JWKS.

What this does NOT yet wire (deferred to #282 Phases 2–5): the Base contract set + P-256 wrapper (#170) + paymaster/EntryPoint wiring, binding ceremonies, harness coverage, web-app chain copy. No CI deploy role either — prod stacks are operator-deployed via ssh-broker.sh base, never SSM. fleet-status.sh (#279/#280) does not yet discover *.base.env.

§1 Identities — mental model

Cloud-agnostic. The four principals exist in every cloud the broker runs on; the cloud changes only which API creates them.

Identity Type Holds Purpose
agentkeys-admin privileged user Long-lived access key One-shot provisioning. Runs every command in this doc. IAM-admin scope.
agentkeys-broker scoped user Long-lived access key Operator's SSH-into-EC2 path via EC2 Instance Connect (AWS) / SSH key (other clouds). No data-plane access.
agentkeys-daemon runtime user Long-lived access key The broker process uses this at runtime. Only permission: assume the data role.
agentkeys-data-role assumed role (none — assumed) Holds the actual storage + email permissions. Trusted by the runtime user (Stage 6) or by the OIDC provider (Stage 7).
agentkeys-broker-host instance profile (optional) (none — bound to a VM) If the broker runs on a managed VM, attach this so the daemon never sees a static key. Runtime creds come from IMDS / metadata server.

Why "data role" and not "agent role": the project word "agent" already means three things (the AI agent, the AgentKeys product, an IAM role). The role holds data-plane permissions. The broker still accepts the legacy BROKER_AGENT_ROLE_ARN env var for backwards compatibility.

§2 Domain + DNS

Six subdomains under the operator's parent zone (substitute ${ZONE} everywhere):

Host Purpose Provisioned in
${MAIL_DOMAIN} (e.g. bots.${ZONE}) SES / email backend inbound §3
${BROKER_HOST} (e.g. broker.${ZONE}) Broker public reverse proxy §10.1 below
signer.${ZONE} Signer service (issue #74 step 1b) §10.1 below
audit.${ZONE} / email.${ZONE} / cred.${ZONE} / memory.${ZONE} / config.${ZONE} Service workers (issue #90; config = #201 master-only taxonomy) §10.1 below (dev co-location on broker EIP today)

Confirm the parent zone is reachable before any record changes (AWS Route 53 example; the same get-hosted-zone shape exists on AliCloud DNS + Cloud DNS):

aws route53 get-hosted-zone --id "$PARENT_ZONE_ID" \
  --query 'HostedZone.{name:Name, private:Config.PrivateZone}'
# → {"name": "${ZONE}.", "private": false}

The bulk service-worker A-record creation is automated by scripts/dns-upsert-workers.sh (AWS Route 53 today). For other providers, replicate the same shape — the hostnames are the migration seam.

§3 Email backend

§3.1 Verify the SES domain identity (AWS)

aws sesv2 create-email-identity \
  --region "$REGION" --email-identity "$MAIL_DOMAIN" \
  --dkim-signing-attributes NextSigningKeyLength=RSA_2048_BIT

Then publish DKIM + SPF + DMARC + MX records in one DNS change. AWS Route 53:

read -r T1 T2 T3 <<<"$(aws sesv2 get-email-identity --region "$REGION" \
  --email-identity "$MAIL_DOMAIN" --query 'DkimAttributes.Tokens' --output text)"

aws route53 change-resource-record-sets --hosted-zone-id "$PARENT_ZONE_ID" \
  --change-batch "$(jq -n \
    --arg domain "$MAIL_DOMAIN" --arg region "$REGION" \
    --arg t1 "$T1" --arg t2 "$T2" --arg t3 "$T3" '{
      Comment: "AgentKeys email infra for \($domain)",
      Changes: [
        {Action:"UPSERT", ResourceRecordSet:{Name:"\($t1)._domainkey.\($domain)", Type:"CNAME", TTL:300, ResourceRecords:[{Value:"\($t1).dkim.amazonses.com"}]}},
        {Action:"UPSERT", ResourceRecordSet:{Name:"\($t2)._domainkey.\($domain)", Type:"CNAME", TTL:300, ResourceRecords:[{Value:"\($t2).dkim.amazonses.com"}]}},
        {Action:"UPSERT", ResourceRecordSet:{Name:"\($t3)._domainkey.\($domain)", Type:"CNAME", TTL:300, ResourceRecords:[{Value:"\($t3).dkim.amazonses.com"}]}},
        {Action:"UPSERT", ResourceRecordSet:{Name:$domain, Type:"MX",  TTL:300, ResourceRecords:[{Value:"10 inbound-smtp.\($region).amazonaws.com"}]}},
        {Action:"UPSERT", ResourceRecordSet:{Name:$domain, Type:"TXT", TTL:300, ResourceRecords:[{Value:"\"v=spf1 include:amazonses.com -all\""}]}},
        {Action:"UPSERT", ResourceRecordSet:{Name:"_dmarc.\($domain)", Type:"TXT", TTL:300, ResourceRecords:[{Value:"\"v=DMARC1; p=quarantine; rua=mailto:dmarc@\($domain)\""}]}}
      ]
    }')"

Wait ~5 min for DKIM propagation, then verify:

aws sesv2 get-email-identity --region "$REGION" --email-identity "$MAIL_DOMAIN" \
  --query '{verified: VerifiedForSendingStatus, dkim: DkimAttributes.Status}'
# → {"verified": true, "dkim": "SUCCESS"}

DKIM key custody: in this interim setup, the email service holds the private DKIM key (AWS-internal on SES, AliCloud-internal on DirectMail, etc.). Trust surface = provider could forge mail signed as us → bounded blast radius (reputation, not user-data custody). Migration target is TEE-held BYODKIM — track in docs/spec/heima-gaps-vs-desired-architecture.md §4. Do not intermediate-step to "BYODKIM with file-stored key" (strictly worse than provider-managed).

§3.2 Create the S3 bucket for inbound mail

aws s3api create-bucket \
  --region "$REGION" --bucket "$BUCKET" \
  $([ "$REGION" != "us-east-1" ] && echo "--create-bucket-configuration LocationConstraint=$REGION")

aws s3api put-public-access-block --region "$REGION" --bucket "$BUCKET" \
  --public-access-block-configuration BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true

# 30-day TTL on inbound objects (throwaway-inbox model)
aws s3api put-bucket-lifecycle-configuration --region "$REGION" --bucket "$BUCKET" \
  --lifecycle-configuration "$(jq -n '{
    Rules: [{ID:"inbound-30d-ttl", Status:"Enabled", Filter:{Prefix:"inbound/"}, Expiration:{Days:30}}]
  }')"

§3.3 Create the SES receipt rule

aws ses create-receipt-rule-set --rule-set-name agentkeys --region "$REGION" 2>/dev/null || true
aws ses create-receipt-rule --region "$REGION" --rule-set-name agentkeys \
  --rule "$(jq -n --arg domain "$MAIL_DOMAIN" --arg bucket "$BUCKET" '{
    Name: "agentkeys-inbound", Enabled: true, ScanEnabled: true, TlsPolicy: "Optional",
    Recipients: [$domain],
    Actions: [{S3Action: {BucketName: $bucket, ObjectKeyPrefix: "inbound/"}}]
  }')"
aws ses set-active-receipt-rule-set --rule-set-name agentkeys --region "$REGION"

Inbound MIME lands at s3://$BUCKET/inbound/<msg_id>. First object: AMAZON_SES_SETUP_NOTIFICATION (provider's "I successfully wrote to your bucket" marker). Real mail follows.

Sandbox vs production sending: inbound is unaffected by SES sandbox; outbound to arbitrary addresses needs Console → Support → "SES Sending Limits" → "Request Production Access".

§4 IAM users + roles

§4.1 agentkeys-daemon — broker runtime user

aws iam create-user --user-name agentkeys-daemon
aws iam create-access-key --user-name agentkeys-daemon
# → save AccessKeyId + SecretAccessKey to your secret manager. NEVER to git.

aws iam put-user-policy --user-name agentkeys-daemon \
  --policy-name agentkeys-daemon-assume-role \
  --policy-document "$(jq -n --arg acct "$ACCOUNT_ID" '{
    Version:"2012-10-17",
    Statement:[{
      Effect:"Allow", Action:"sts:AssumeRole",
      Resource:"arn:aws:iam::\($acct):role/agentkeys-data-role"
    }]
  }')"

The daemon user can do exactly one thing: assume agentkeys-data-role. Any storage / email action goes through the role's permissions, never the user's.

§4.2 agentkeys-data-role (static-IAM-user trust variant)

The role's trust policy starts with the static-IAM-user variant. After the broker is publicly reachable, docs/cloud-bootstrap.md §4 swaps it for the OIDC-federated variant.

aws iam create-role --role-name agentkeys-data-role \
  --assume-role-policy-document "$(jq -n --arg acct "$ACCOUNT_ID" '{
    Version:"2012-10-17",
    Statement:[{
      Effect:"Allow",
      Principal:{AWS:"arn:aws:iam::\($acct):user/agentkeys-daemon"},
      Action:"sts:AssumeRole"
    }]
  }')"

aws iam put-role-policy --role-name agentkeys-data-role \
  --policy-name agentkeys-data-role-inline \
  --policy-document "$(jq -n \
    --arg bucket "$BUCKET" --arg region "$REGION" \
    --arg acct "$ACCOUNT_ID" --arg domain "$MAIL_DOMAIN" '{
      Version:"2012-10-17",
      Statement:[
        {Effect:"Allow", Action:"s3:ListBucket", Resource:"arn:aws:s3:::\($bucket)"},
        {Effect:"Allow", Action:"s3:GetObject",  Resource:"arn:aws:s3:::\($bucket)/*"},
        {Effect:"Allow", Action:["ses:SendEmail","ses:GetEmailIdentity"],
         Resource:["arn:aws:ses:\($region):\($acct):identity/\($domain)",
                   "arn:aws:ses:\($region):\($acct):identity/*@\($domain)"]}
      ]
    }')"

export ROLE_ARN=$(aws iam get-role --role-name agentkeys-data-role --query 'Role.Arn' --output text)
echo "ROLE_ARN=$ROLE_ARN"

§4.3 Per-data-class roles (agentkeys-vault-role, agentkeys-memory-role)

Per arch.md §17.2: separate roles for credentials + memory data classes. Same trust shape as §4.2, distinct inline policies + PrincipalTag scoping. Provisioned by per-data-class helpers (idempotent):

bash scripts/provision-vault-bucket.sh        # agentkeys-vault-${ACCOUNT_ID}
bash scripts/provision-vault-role.sh          # agentkeys-vault-role
bash scripts/apply-vault-bucket-policy.sh     # v3 split-statement PrincipalTag policy

bash scripts/provision-memory-bucket.sh
bash scripts/provision-memory-role.sh
bash scripts/apply-memory-bucket-policy.sh

bash scripts/cleanup-mail-bucket-policy.sh    # restore email-only grants on $BUCKET

These scripts are the source of truth for the policy shape — read them, don't transcribe.

§4.4 agentkeys-admin, agentkeys-broker (already provisioned)

If you reached this section, agentkeys-admin exists (you're using it). agentkeys-broker is whatever IAM user you SSH into the broker host with — its perms are out of scope (ec2-instance-connect:SendSSHPublicKey on the host's instance ID is sufficient for AWS Instance Connect).

§5 S3 bucket policy (initial, static-IAM variant)

aws s3api put-bucket-policy --region "$REGION" --bucket "$BUCKET" \
  --policy "$(jq -n --arg bucket "$BUCKET" --arg acct "$ACCOUNT_ID" '{
    Version:"2012-10-17",
    Statement:[
      {
        Sid:"AllowSESWriteInbound", Effect:"Allow",
        Principal:{Service:"ses.amazonaws.com"},
        Action:"s3:PutObject",
        Resource:"arn:aws:s3:::\($bucket)/*",
        Condition:{StringEquals:{"aws:Referer":$acct}}
      },
      {
        Sid:"AllowDaemonRead", Effect:"Allow",
        Principal:{AWS:"arn:aws:iam::\($acct):role/agentkeys-data-role"},
        Action:["s3:GetObject","s3:ListBucket"],
        Resource:["arn:aws:s3:::\($bucket)","arn:aws:s3:::\($bucket)/*"]
      }
    ]
  }')"

The PrincipalTag-scoped federated variant (which replaces this once OIDC federation is up) lives in docs/cloud-bootstrap.md §4.4.

§6 agentkeys-broker-host instance profile (EC2-only, optional)

If the broker runs on AWS EC2, attach this so the daemon never holds a static key. Runtime creds come from IMDS.

ROLE=agentkeys-broker-host

aws iam create-role --role-name "$ROLE" \
  --assume-role-policy-document "$(jq -n '{
    Version:"2012-10-17",
    Statement:[{Effect:"Allow", Principal:{Service:"ec2.amazonaws.com"}, Action:"sts:AssumeRole"}]
  }')"

aws iam put-role-policy --role-name "$ROLE" --policy-name BrokerAssumeData \
  --policy-document "$(jq -n --arg acct "$ACCOUNT_ID" '{
    Version:"2012-10-17",
    Statement:[{Effect:"Allow", Action:"sts:AssumeRole",
                Resource:"arn:aws:iam::\($acct):role/agentkeys-data-role"}]
  }')"

aws iam create-instance-profile --instance-profile-name "$ROLE"
aws iam add-role-to-instance-profile --instance-profile-name "$ROLE" --role-name "$ROLE"
aws ec2 associate-iam-instance-profile --region "$REGION" \
  --instance-id "$INSTANCE_ID" \
  --iam-instance-profile Name="$ROLE"

Caller-region trap: agentkeys-admin profile defaults to us-west-2; the broker EC2 usually lives in us-east-1. Without --region "$REGION", describe-instances silently returns empty and downstream put-role-policy runs with --role-name "". Pass --region explicitly on every regional call. See AGENTS.md "AWS local-profile ↔ remote-IAM mapping".

§6.1 ses:SendEmail grant on the runtime role

The broker calls SES v2 SendEmail with its own runtime credentials (instance profile), not via the assumed agentkeys-data-role. Without ses:SendEmail on the broker's role, the operator hits:

broker rejected /v1/auth/email/request: status=502 body=
{"error":"backend_unreachable","message":"… ses SendEmail:
 unhandled error (AccessDeniedException)"}

The IAM action is ses:SendEmail (sesv2), NOT ses:SendRawEmail (v1; different code path the broker doesn't use). The grant lives on the broker's runtime role (agentkeys-broker-host on EC2; the user agentkeys-daemon otherwise) — see docs/cloud-bootstrap.md §3.3 for the exact statement.

§7 Security audit — strip legacy over-broad attached policies

Some early deploys ship with AmazonS3FullAccess (or similar wide permissions) attached to the broker's runtime role. The broker at runtime ONLY uses aws-sdk-sts (the GetCallerIdentity startup probe) + aws-sdk-sesv2 (the §6.1 grant) — it never accesses S3 with its own creds. Per-user S3 is via JWT-assumed agentkeys-{data,vault,memory}-role, not the broker's runtime role.

A broker compromise with AmazonS3FullAccess would expose every inbound email in the SES bucket (verification tokens, magic links). Strip it:

# Discover the actual role attached to the broker host (canonical name:
# agentkeys-broker-host; some early deploys landed on different names):
INSTANCE_PROFILE_ARN=$(aws ec2 describe-instances --region "$REGION" \
  --filters "Name=ip-address,Values=$EIP" \
  --query 'Reservations[].Instances[].IamInstanceProfile.Arn' --output text)

ROLE=$(aws iam get-instance-profile \
  --instance-profile-name "${INSTANCE_PROFILE_ARN##*/}" \
  --query 'InstanceProfile.Roles[0].RoleName' --output text)
echo "broker runtime role: $ROLE"

# Audit attached policies:
aws iam list-attached-role-policies --role-name "$ROLE"

# Detach AmazonS3FullAccess if present:
aws iam detach-role-policy --role-name "$ROLE" \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

# Verify only the narrow inline policy (BrokerSendEmail + AssumeDataRole) remains:
aws iam list-role-policies --role-name "$ROLE"
aws iam list-attached-role-policies --role-name "$ROLE"

§8 Cloud-provider portability

Every layer in §3–§5 has a 1:1 analog on the major providers. The provisioning shape carries; only the API endpoints + JSON dialects differ.

Layer AWS (current) AliCloud (in progress) GCP Tencent Cloud
Privileged user IAM user with IAMFullAccess RAM user with AliyunRAMFullAccess IAM service account with roles/iam.securityAdmin CAM user with AdministratorAccess
Runtime user IAM user + access key RAM user + AK/SK Service account + key file (or Workload Identity) CAM user + SecretId/SecretKey
Data role IAM role + assume policy RAM role + assume policy Service account + IAM bindings CAM role + assume policy
Federation IAM OIDC provider RAM IDaaS / OIDC provider Workload Identity Pool CAM OIDC provider
Object store S3 + bucket policy OSS + bucket policy Cloud Storage + IAM bindings COS + bucket policy
Email backend SES + S3 receipt rule DirectMail / SimpleDM + OSS event notification SendGrid / Mailgun (no GCP-native) SimpleDM + COS
TLS termination nginx + Let's Encrypt nginx + Let's Encrypt nginx + Let's Encrypt nginx + Let's Encrypt
Compute (broker host) EC2 + EIP ECS + EIP Compute Engine + external IP CVM + EIP
DNS Route 53 AliCloud DNS Cloud DNS DNSPod / Cloud DNS
Secrets storage Secrets Manager / SSM Parameter Store KMS Secrets Manager Secret Manager KMS

Migration playbook (cloud → cloud):

  1. Re-bind operator-workstation.env to the new provider's identifiers (account ID, region, role ARNs, bucket name).
  2. Re-run this doc top-to-bottom against the new provider.
  3. Re-run §9 (OIDC federation activation) — substitute the provider's OIDC API.
  4. Re-run scripts/setup-broker-host.sh on the new host (the script doesn't care which cloud — it consumes already-provisioned identifiers).
  5. Re-run scripts/setup-heima.sh — the chain side is cloud-agnostic.
  6. Re-run the harness scripts to validate end-to-end.

The boundary is sharp: the broker process itself contains zero cloud-specific code — it talks STS-compatible OIDC + S3-compatible PutObject/GetObject + SMTP-compatible SendEmail. Every cloud above offers all three primitives. The provisioner-scripts/email-backends/ directory documents the email-backend trait; a new backend slots in as tencent-simpledm-cos (or similar) with the same upstream API as ses-s3.

§9 OIDC federation activation (after broker is publicly reachable)

The broker mints OIDC JWTs that AWS STS validates via the broker's public JWKS endpoint. Three one-shot steps per account, run AFTER setup-broker-host.sh finishes and the broker is reachable at https://${BROKER_HOST} over public TLS.

§9.1 Prereqs

  • https://${BROKER_HOST}/.well-known/openid-configuration returns 200 with the expected issuer + jwks_uri.
  • https://${BROKER_HOST}/.well-known/jwks.json returns at least one ES256 key.
  • curl -sf "https://${BROKER_HOST}/healthz" returns 200.

§9.2 Register the OIDC provider

# DoH-resolved EIP (immune to local DNS interception; see §5b verify steps):
broker_ip=$(curl -sS "https://dns.google/resolve?name=${BROKER_HOST}&type=A" | jq -r '.Answer[0].data')

# -sha1 is REQUIRED. macOS LibreSSL 3.3 + OpenSSL 3.x default to SHA256
# (64 hex chars) but AWS IAM CreateOpenIDConnectProvider rejects anything
# that isn't exactly 40 hex chars (SHA1).
thumb=$(echo | openssl s_client -servername "$BROKER_HOST" \
                                 -connect "${broker_ip}:443" 2>/dev/null \
          | openssl x509 -fingerprint -sha1 -noout \
          | awk -F'=' '{print $2}' | tr -d ':' | tr 'A-Z' 'a-z')
[ ${#thumb} -eq 40 ] || { echo "thumb length ${#thumb} != 40 — check -sha1 flag" >&2; return 1; }

aws iam create-open-id-connect-provider \
  --url "https://${BROKER_HOST}" \
  --client-id-list "sts.amazonaws.com" \
  --thumbprint-list "$thumb"

AWS validates the issuer URL byte-for-byte against the JWT iss claim. Once registered, the URL is effectively immutable — switching means a new provider ARN + new trust policy + new federated grants.

§9.3 Trust policy (federated variant)

Apply to each of the three data roles. Use $ROLE{agentkeys-data-role, agentkeys-vault-role, agentkeys-memory-role} (or the -test variants when bootstrapping the CI test instance).

aws iam update-assume-role-policy --role-name "$ROLE" --policy-document "$(jq -n \
  --arg acct "$ACCOUNT_ID" --arg host "$BROKER_HOST" '{
    Version:"2012-10-17",
    Statement:[{
      Effect:"Allow",
      Principal:{Federated:"arn:aws:iam::\($acct):oidc-provider/\($host)"},
      Action:"sts:AssumeRoleWithWebIdentity",
      Condition:{StringEquals:{"\($host):aud":"sts.amazonaws.com"}}
    }]
  }')"

§9.4 PrincipalTag-scoped bucket policy

Per AGENTS.md "Per-actor + per-data-class isolation invariants": every S3 read/write is scoped to bots/${aws:PrincipalTag/agentkeys_actor_omni}/{credentials,memory}/*. The split-statement v3 bucket policy is applied by scripts/apply-{vault,memory}-bucket-policy.sh — those scripts are the source of truth for the policy shape.

After §9.3 + §9.4, strip the broad-bucket inline grant from the role's policy (the bucket-side policy enforces; defense in depth means no app-side grant):

aws iam delete-role-policy --role-name "$ROLE" --policy-name "${ROLE}-inline"

§9.5 End-to-end proof

Run harness/v2-stage3-demo.sh (or bash harness/run.sh --stage 3) — it mints session JWT → OIDC JWT → STS creds, then proves both POSITIVE (own prefix) and NEGATIVE (cross-actor prefix → AccessDenied) writes for both data classes plus the cross-role isolation matrix. Walks the full §17.2 isolation table from AGENTS.md.

§10 Broker host bring-up: setup-broker-host.sh

§§3–8 set up identifiers. This step stands up the actual processes — broker + mock-server + signer + 5 service workers (audit/email/cred/memory/config) — on the EC2 host (or any Linux box with public-internet egress + the broker's hostname).

§10.1 Prereqs

  • Fresh Linux host with sudo, systemd, public-internet egress, ports 80 + 443 open inbound (for certbot + nginx).
  • DNS A records for ${BROKER_HOST} + signer.${ZONE} + audit.${ZONE} + email.${ZONE} + cred.${ZONE} + memory.${ZONE} + config.${ZONE} all pointing at the host's public IP (provisioned by setup-cloud.sh step 6 — broker/signer/mcp inline, the 5 service workers via dns-upsert-workers.sh).
  • AWS credentials in /etc/agentkeys/broker.env (the script writes the template; operator pastes the agentkeys-daemon access key from §4.1).

§10.2 Run

# Bootstrap a fresh host:
sudo bash scripts/setup-broker-host.sh \
  --issuer-url "https://${BROKER_HOST}" \
  --account-id "${ACCOUNT_ID}" \
  --signer-host "signer.${ZONE}" \
  --audit-host  "audit.${ZONE}" \
  --email-host  "email.${ZONE}" \
  --cred-host   "cred.${ZONE}" \
  --memory-host "memory.${ZONE}" \
  --config-host "config.${ZONE}" \
  --yes

# After a `git pull`, the same command re-deploys:
sudo bash scripts/setup-broker-host.sh --yes

The script:

  • Builds agentkeys-broker-server (+ auth-email-link feature), agentkeys-mock-server, the 5 service workers (audit/email/cred/memory/config), and the signer. Compilations are cached with sccache (a content-addressed compiler cache, auto-installed best-effort) so re-deploys + --ref branch switches reuse the cache instead of recompiling — even when git checkout churns mtimes or target/ is cold. The build prints sccache stats afterward (re-deploys should be mostly cache hits). Opt out with AGENTKEYS_NO_SCCACHE=1; pin a version with SCCACHE_VERSION=vX.Y.Z.
  • Creates the agentkeys system user + state dir /var/lib/agentkeys/.
  • Writes the dev_key_service master secret (one-shot at first boot, never rotated — rotation invalidates every previously-derived wallet).
  • Writes per-worker env files at /etc/agentkeys/worker-{audit,email,creds,memory,config}.env.
  • Writes systemd units for broker + signer + each worker, enables + starts.
  • Configures nginx vhosts for ${BROKER_HOST} + signer.${ZONE} + 6 worker hosts (audit/email/cred/memory/config/classify) (skip via --without-nginx). Vhost is rendered in two phases: Phase A (HTTP-only on :80, with the ACME challenge path under /.well-known/acme-challenge/ and a 503 placeholder on /) when no cert is on disk; Phase B (HTTPS on :443, broker proxy on /) when /etc/letsencrypt/live/<host>/fullchain.pem exists.
  • Installs certbot AND auto-issues certs (step 7b) for every co-located vhost whose DNS resolves to this host, flipping nginx Phase A → B in the same run; an unresolved host is skipped (fix DNS, re-run). Optional --certbot-email <addr> (default: no email). Details + manual fallback: quick-start §5b.
  • Mints broker keypairs (oidc + session) under /var/lib/agentkeys/keys/.

Auto-detects bootstrap vs upgrade by reading the existing systemd unit's Environment= lines. Pass --ref <branch> to opt into an in-script git fetch + pull.

§10.3 Verify

curl -sf "https://${BROKER_HOST}/healthz"                  # → 200
curl -sf "https://${BROKER_HOST}/.well-known/openid-configuration" | jq .
curl -sf "https://${BROKER_HOST}/.well-known/jwks.json"    | jq '.keys | length'
curl -sf "https://audit.${ZONE}/healthz"                   # → 200 (and friends)

For full E2E (broker + workers + chain + AWS), run bash harness/run.sh — see docs/chain-setup.md for the chain side and docs/ci-setup.md for the automated path.

§11 Cleanup (full account teardown)

Tear down the whole AgentKeys footprint in one account. Use only when retiring the deployment.

# Drain the buckets
for b in "$BUCKET" "agentkeys-vault-${ACCOUNT_ID}" "agentkeys-memory-${ACCOUNT_ID}"; do
  aws s3 rm "s3://$b" --recursive 2>/dev/null || true
  aws s3api delete-bucket --bucket "$b" --region "$REGION" 2>/dev/null || true
done

# Roles
for r in agentkeys-data-role agentkeys-vault-role agentkeys-memory-role agentkeys-broker-host; do
  for p in $(aws iam list-role-policies --role-name "$r" --query 'PolicyNames[]' --output text 2>/dev/null); do
    aws iam delete-role-policy --role-name "$r" --policy-name "$p"
  done
  aws iam delete-role --role-name "$r" 2>/dev/null || true
done

# OIDC provider
aws iam delete-open-id-connect-provider \
  --open-id-connect-provider-arn "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${BROKER_HOST}"

# Daemon user
for k in $(aws iam list-access-keys --user-name agentkeys-daemon --query 'AccessKeyMetadata[].AccessKeyId' --output text); do
  aws iam delete-access-key --user-name agentkeys-daemon --access-key-id "$k"
done
aws iam delete-user-policy --user-name agentkeys-daemon --policy-name agentkeys-daemon-assume-role 2>/dev/null || true
aws iam delete-user --user-name agentkeys-daemon

# SES + DNS
aws ses set-active-receipt-rule-set --rule-set-name "" --region "$REGION" 2>/dev/null || true
aws sesv2 delete-email-identity --email-identity "$MAIL_DOMAIN" --region "$REGION" 2>/dev/null || true
# DNS records: operator-managed (Route 53 / your DNS provider) — delete by hand.

# EC2 + EIP: manual via console or aws ec2 CLI

For the test instance, substitute -test on every identifier above; for a test-fleet slot N ≥ 2 (§0.3), substitute -test-N (and delete that slot's OIDC provider …oidc-provider/broker-test-N.${ZONE}, its agentkeys-test-broker-ssm-N instance profile/role, and its agentkeys-broker-test-N SSH user); for the Base prod stack (§0.4), substitute -base (its OIDC provider is …oidc-provider/broker-base.${ZONE}, its instance profile agentkeys-broker-host-base). Tearing down one stack must not touch any other stack's resources — every identifier is stack-suffixed precisely so the blast radius stays per-stack.

Related