Skip to content

Fleet status dashboard — which server/CI-slot is running (read-only; grow into DevOps dashboard) #279

@hanwencheng

Description

@hanwencheng

Context

The test-broker fleet is growing (#265): prod + N CI test slots, each a full broker stack (broker, signer, 6 workers, bundler, nginx) on its own EC2 + EIP, plus the local daemon and the chain layer. Today an operator answers "which server is running, and which CI run is on which slot?" by hand — aws ec2 describe-addresses by tag, curl …/healthz per host, ssh-broker.sh test-N 'systemctl status …', GitHub Actions tab for run→slot. There's no single view.

Ask

A read-only fleet status dashboard — start narrow (which broker/slot is up, healthy, and what CI run currently holds it), designed to grow into a general DevOps dashboard later. Do not implement yet — this issue is to scope + design.

v1 scope (status board)

Per environment (prod, test slot 1..N):

  • Machine: EC2 instance id + state, EIP (by tag agentkeys-broker-eip[-test[-N]]), instance type, uptime.
  • Stack health: broker + signer + 6 workers /healthz (green/red/degraded), TLS cert expiry per host, nginx :443 up.
  • CI occupancy: which GitHub Actions run (if any) currently holds the slot's concurrency group (heima-test-slot-N once Parallel CI envs: multi-broker architecture, one EC2 per env (max N) #265 phase 4 lands; heima-test-deployer-nonce today), queue depth behind it.
  • Chain identity: the slot's deployer address + HEI balance (feeds the existing check-wallet-balances.sh), master account registration state.
  • Drift flags: worker A record ≠ broker A record (the Config data class + lazy, config-driven memory list (Phases 1–5) #201 co-location trap), broker BROKER_OIDC_ISSUER ≠ the slot it's tagged as, stale binary vs origin/main sha.

Data sources (all already exist, read-only)

  • AWS: ec2 describe-addresses/describe-instances (by env-aware tag), Route 53 record sets, ACM/letsencrypt cert via TLS probe.
  • Per-host: the /healthz endpoints + scripts/wait-stack-healthy.sh logic; ssh-broker.sh test-N for systemd/journald.
  • GitHub: Actions API for run→concurrency-group→slot mapping.
  • Chain: cast balance + SidecarRegistry.operatorMasterWallet(omni); scripts/check-wallet-balances.sh.
  • SoT for the fleet inventory: scripts/broker.test*.env + the slot env files (instance ids, EIPs, hostnames).

Future (the "DevOps dashboard" expansion — out of v1 scope, listed so v1 doesn't paint into a corner)

Deploy history + rollback, per-worker log tail, cost/idle-stop controls (#265 phase 6), alerting on health flaps, prod broker too (not just test fleet), the cloud (AWS/IAM/DNS) and chain-contract layers.

Design questions to settle first

  • Surface: static page regenerated by a scheduled job (cheap, no new always-on service) vs a small live service. Lean static-first to avoid adding an always-on component beyond the broker.
  • Auth: it reads AWS + GH + chain — where does it run and whose creds (operator laptop one-shot? a locked-down read-only IAM principal?).
  • Where it lives: a new viz/ page (there's already a viz/ dashboard pattern in-repo) vs a separate tool.

References

Scope/design only — do not implement until the design questions above are answered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/infraDeployment, broker host, scripts/setup-*.sh, AWS / chain provisioning

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions