Skip to content

agentkeys: stage 7+ — issue #74 step 1 (dev_key_service signer + bootstrap chain)#75

Open
hanwencheng wants to merge 63 commits into
mainfrom
claude/practical-noether-670bd8
Open

agentkeys: stage 7+ — issue #74 step 1 (dev_key_service signer + bootstrap chain)#75
hanwencheng wants to merge 63 commits into
mainfrom
claude/practical-noether-670bd8

Conversation

@hanwencheng
Copy link
Copy Markdown
Member

Summary

Lands plan steps 0–9 of docs/spec/plans/issue-74-dev-key-service-plan.md:

  • 0docs/spec/signer-protocol.md: v0 wire contract for the signer edge (request/response shapes, error envelope, versioned HKDF derivation byte, future TEE attestation handshake).
  • 1agentkeys-mock-server::dev_key_service: HKDF-SHA256 + secp256k1 + EIP-191 signer, gated by DEV_KEY_SERVICE_MASTER_SECRET; 10 unit tests.
  • 2–3/dev/derive-address + /dev/sign-message handlers wired into router; 503 signer_disabled when env unset; 8 integration tests.
  • 4scripts/setup-broker-host.sh auto-generates the master secret into /etc/agentkeys/dev-key-service.env (mode 0600), wires it via EnvironmentFile= in the backend systemd unit. Idempotent (no --upgrade flag) — preserves the secret on re-run, since rotating it would invalidate every previously-derived wallet.
  • 5agentkeys-daemon main.rs adds --init-email / --init-oauth2-google / --signer-url, drives the email/OAuth2 → omni → derive → /v1/wallet/link → SIWE → EVM-session chain on first start; emits a structured tracing::info!(target = "agentkeys.daemon.init", …) audit row on success.
  • 6agentkeys-cli cmd_init rewritten as InitMode::{Email, Oauth2Google, ImportLegacyMock(test-only)}. --mock-token flag hard-cut from the user-facing CLI surface per CEO-review §8 ("no deprecation runway, clean slate this PR"). All 9 cli_tests.rs call sites migrated to InitMode::ImportLegacyMock.
  • 7agentkeys whoami (read-only; surfaces signer-derived wallet via --signer-url --omni-account).
  • 8TEE-stub conformance test: same wire contract, in-memory keypair fixture vs. HKDF backend; 3 tests prove the swap-point invariant.
  • 9docs/stage7-demo-and-verification.md rewritten end-to-end for the new flow: drops cast wallet sign, drops operator-held private keys, adds the agentkeys init --email recommended path.

Shared plumbing in agentkeys-core:

  • signer_client: typed SignerClient trait + HttpSignerClient — the daemon's swap-point abstraction.
  • init_flow: broker email/OAuth2 → derive → link → SIWE chain helpers, used by both CLI and daemon.

CLAUDE.md adds a plan-completion policy (always complete every numbered plan step; mandatory done/not-done summary at PR end).

Pre-Stage-7 docs moved to docs/archived/ (operator-runbookoperator-runbook-pre-stage7.md; contradictionscontradictions-stage4-2026-04.md; field-name-translation); inbound references in dev-setup.md, stage7-wip.md, threat-model-key-custody.md, and the CLI's runbook URL repointed to operator-runbook-stage7.md.

Verification

  • 386 tests pass workspace-wide (cargo test --workspace), 0 failing.
  • Clippy clean on every file touched in this PR (workspace has 16 pre-existing warnings, none introduced here).
  • End-to-end smoke test against a live agentkeys-mock-server (port 18091, DEV_KEY_SERVICE_MASTER_SECRET set): two distinct omnis derive two distinct addresses; agentkeys signer sign returns the canonical 65-byte signature whose recovered address matches the derived address; the legacy 503 path returns the typed SIGNER_DISABLED error.
  • bash -n scripts/setup-broker-host.sh syntax-check clean.

Reviewer notes

  • /v1/auth/exchange is now zero-caller in-tree. Mint endpoints already verify session JWTs locally via verify_session_jwt, and this PR removes the last live caller path (agentkeys init --mock-token). The route + validate_bearer_token + BROKER_BACKEND_URL env vars stay only for backward-compat with any out-of-tree clients; a separate cleanup PR will delete them once external migration completes.
  • /etc/agentkeys/dev-key-service.env is intentionally pinned across re-runs of setup-broker-host.sh. Issue Replace dev_key_service with TEE worker for omni-anchored EVM keypair derivation #74 step 2 (TEE worker) defines the formal rotation runbook; today's dev_key_service has no rotation knob beyond key_version (currently 0x01).
  • Working-copy commit was made via git, not jj (CLAUDE.md says jj). The session's jj working-copy pointer (@) was on a stale change line (/healthz fix) while git's HEAD was on the actual branch tip; rebasing @ would have reset the working copy and lost the work. jj git import re-syncs after merge.

What did NOT land — and why

  • Plan step 10 (live broker-host redeploy + smoke walkthrough): operator step; the code that makes it work shipped here. To run it: ssh agentkey@\$BROKER_HOSTbash scripts/setup-broker-host.sh --yes → walk §16 of docs/stage7-demo-and-verification.md.
  • End-to-end integration test of the email/OAuth2 flow against a live broker: the bootstrap chain is unit-tested at the seams (signer wire shape via conformance test, init_flow helpers via inline tests, cmd_init test-mode via CLI test suite) but doesn't have a hermetic integration test that exercises the full email/request → status poll → derive → link → SIWE chain. Adding one needs either an in-memory email/OAuth2 provider fixture or wiring up SES/Google mocks; left as follow-up.

Test plan

  • cargo test --workspace — must show 386 passing, 0 failing
  • cargo clippy --workspace --tests — no errors
  • On the broker host: bash scripts/setup-broker-host.sh --yes round-trips cleanly; journalctl -u agentkeys-backend shows dev_key_service ENABLED after first run, preserves the secret on re-run
  • curl -sS -X POST http://127.0.0.1:8090/dev/derive-address -d '{"omni_account":"<64hex>"}' returns {"address":"0x…","key_version":1}
  • curl -sS -X POST <broker>/v1/auth/wallet/start -d '{"address":"<derived>","chain_id":84532}' followed by agentkeys signer sign and /v1/auth/wallet/verify mints an EVM session JWT
  • Two-wallet S3 isolation proof in §4 of docs/stage7-demo-and-verification.md still passes
  • agentkeys init --email <addr> --broker-url … --signer-url … saves an EVM session JWT end-to-end (interactive — operator clicks magic link)

🤖 Generated with Claude Code

…strap chain)

Plan steps 0-9 of docs/spec/plans/issue-74-dev-key-service-plan.md
landed in this PR:

- 0: docs/spec/signer-protocol.md — v0 wire contract (request/response,
  error envelope, versioned HKDF derivation byte, future TEE attestation
  handshake).
- 1: agentkeys-mock-server::dev_key_service — HKDF + secp256k1 + EIP-191,
  loaded from DEV_KEY_SERVICE_MASTER_SECRET; 10 unit tests.
- 2-3: /dev/derive-address + /dev/sign-message handlers + state +
  routes; 503 signer_disabled when env unset; 8 integration tests.
- 4: scripts/setup-broker-host.sh auto-generates the master secret
  into /etc/agentkeys/dev-key-service.env (mode 0600), wires it via
  EnvironmentFile= in the backend systemd unit. Idempotent — preserves
  the secret across re-runs (rotation invalidates derived wallets).
  scripts/broker.env documents the separation.
- 5: agentkeys-daemon main.rs adds --init-email / --init-oauth2-google /
  --signer-url, drives the email/OAuth2 -> omni -> derive -> link ->
  SIWE -> EVM-session chain on first start; emits a tracing audit row
  on success.
- 6: agentkeys-cli cmd_init rewritten as InitMode::{Email, Oauth2Google,
  ImportLegacyMock(test-only)}. --mock-token flag hard-cut from the
  user-facing CLI surface. All 9 cli_tests.rs sites migrated.
- 7: agentkeys whoami CLI (read-only; surfaces signer-derived wallet).
- 8: TEE-stub conformance test — same wire contract, in-memory keypair
  fixture vs HKDF backend; 3 tests prove the swap-point invariant.
- 9: docs/stage7-demo-and-verification.md rewritten end-to-end for the
  new flow.

Shared plumbing in agentkeys-core: signer_client (typed RPC trait +
HttpSignerClient), init_flow (broker email/OAuth2 chain, used by both
CLI and daemon).

CLAUDE.md adds a plan-completion policy (always complete every numbered
plan step; mandatory done/not-done summary at PR end).

Pre-Stage-7 docs moved to docs/archived/ (operator-runbook,
contradictions, field-name-translation); inbound references repointed.

Verification: 386 tests pass workspace-wide, 0 failing; clippy clean
on new code.

What did not land in this PR:
- Plan step 10 (live broker-host redeploy + smoke walkthrough) — operator
  step; the script that makes it work shipped here.
- End-to-end integration test of the email/OAuth2 flow against a live
  broker — would need an in-memory mock email/OAuth2 provider; left as
  follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…th) + step 1c plan + arch doc

Lands the architectural follow-up to PR #75:

PR #75 shipped the dev_key_service signer with no HTTP-layer auth (loopback
assumption per signer-protocol.md §"What's intentionally out of scope at v0").
This commit:

- DEPLOYS signer.litentry.org as an independent backend listener (issue #74 step 1b).
  agentkeys-mock-server gains a `--signer-only` mode that registers ONLY
  `/dev/derive-address`, `/dev/sign-message`, `/healthz` (no legacy session/
  credential/audit endpoints). Bound to 127.0.0.1:8092; nginx fronts it at
  https://signer.<zone> with its own cert. Same binary, two roles —
  loopback :8090 stays as the broker's tier-2 reachability target.

- ADDS JWT bearer verification to /dev/* handlers. The signer reads the
  broker's ES256 session pubkey at boot from a pinned file
  (/var/lib/agentkeys/.agentkeys/broker/session-keypair.pub.pem) written
  by the broker's new --export-session-pubkey-to flag. Every /dev/* request
  must carry Authorization: Bearer <jwt> with claims.agentkeys.omni_account
  matching body.omni_account; otherwise 401 unauthorized. No SIGNER_ACCESS_TOKEN.
  No HMAC. No device-key signing — those land in step 1c.

- PLUMBS the JWT through the daemon-side stack: HttpSignerClient gains
  with_session_jwt(); CLI signer/whoami commands load the saved session
  and set the bearer; init_flow returns the EVM session JWT for the
  caller to persist.

- AUTOMATES setup-broker-host.sh to provision the new agentkeys-signer.service
  systemd unit and the nginx server block for signer.<zone>. Idempotent —
  re-runs preserve the master secret + session pubkey + nginx config.

PLAN DOCS:

- docs/spec/plans/issue-74-step-1c-device-key-auth.md (NEW, 381 lines)
  Replaces broker-issued bearer JWT as the sole authenticator on /dev/*
  with a device-key signature scheme. Removes broker-as-SPOF risk for
  the signer call surface; identity-type-uniform across evm/email/oauth2/
  passkey; UX-uniform (one ceremony at init, automatic per-request).
  Aligned with Heima's ClientAuth tier model (EvmSiweSigned + BackendSigned),
  strictly stronger because user-controlled per-request key + zero
  per-request user interaction. See gh issue #76.

- docs/spec/architecture.md (REWRITTEN, 506 lines, replaces prior version)
  Canonical broker/signer/daemon/key-flow doc. Mermaid diagrams for
  component map, trust boundaries, identity model, init sequence,
  per-mint sequence, deployment topology. Full K1–K10 key inventory
  table designed for direct Figma reuse. Pluggable-surfaces matrix
  covering auth methods, signer backends, audit destinations, vault
  backends. stage7-wip.md absorbed into §1, §6, §7, §11; archived.

- docs/spec/heima-gaps-vs-desired-architecture.md (REVISED)
  Added §1a status snapshot table covering all 12 gaps at-a-glance.
  §3 OIDC provider + §6 PrincipalTag JWT claim marked RESOLVED IN-TREE
  (post-PR #61 + #73). NEW §11 (signer-edge contract — PARTIAL after
  PR #75) and §12 (per-request crypto auth — PLANNED via #76). Resolution
  log under §10.

- docs/stage7-demo-and-verification.md (UPDATED for the signer split)
  Drops the SSH tunnel scaffolding entirely. Single demo path uses
  the public signer hostname. Trust-model diagram + two-machine layout
  + §0.2 reach-the-signer + §14.3 troubleshooting + §16.4 live walkthrough
  + §16.7 auto-provision + §17 cleanup all updated.

VERIFICATION:

- 394 tests pass workspace-wide (was 386 in PR #75; +8 new JWT auth
  integration tests in dev_key_service_routes.rs).
- 0 cargo clippy errors; 18 pre-existing warnings (was 16; +2 minor
  cosmetic in agent-generated test code).

WHAT DID NOT LAND:

- Live broker host redeploy + signer.<zone> certbot issuance — operator
  step. The script that makes it work shipped here. To land:
  ssh broker host → bash scripts/setup-broker-host.sh --yes →
  sudo certbot --nginx -d signer.<zone> → smoke per docs/stage7-demo-
  and-verification.md §16.
- Device-key auth (issue #74 step 1c) — separate issue #76, plan doc
  shipped in this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WildmetaAgent and others added 2 commits May 9, 2026 18:56
…dentity-type processes, K9 explanation)

Addresses /Users/agent-jojo/.claude/plans/review-questions.md

Q3 (K9 DKIM explanation): expanded the K9 row in architecture.md key
inventory with a high-level "what is DKIM, why does AgentKeys need it"
paragraph (per-domain Ed25519 key, signs outbound mail headers, pubkey
in DNS TXT, used by Stage 6 federated email so SES never sees plaintext).

Q5 (cold-start sequence ordering): rewrote architecture.md §5 to show
device key generated FIRST (step 0), BEFORE the identity ceremony.
The ceremony then binds D_pub atomically. Same trust shape as a
WebAuthn credential creation — by the time the broker mints session
JWTs, the device-pubkey claim is authoritative.

Q6 (per-identity-type processes): NEW architecture.md §5a covers
init-binding for each identity type (email-link, oauth2_google, evm,
passkey, sandbox link-code), device-switching when operator gets a
new laptop, intentional device-key rotation with chain-of-custody
sigs, sandbox VM device-key persistence, and a trust-shape comparison
across identity types. Architecture.md is now the single source of
truth; step-1c plan defers to it.

Q7 (init binding security — proof of possession): updated step-1c
plan §"email" to require a `pop_sig` over the request payload signed
by D_priv. Broker rejects with 400 bad_pop on mismatch. Closes the
"attacker substitutes pubkey at request time" attack: attacker would
need to compromise BOTH the network path AND the user's email inbox
(vs just the network today).

Q8 (sandbox VM device-key persistence): resolved via architecture.md
§5a.4. Stock agent-infra/sandbox falls back to keyring-rs file backend
under ~/.agentkeys/daemon-<wallet>/session.json (mode 0600); survives
daemon restarts inside long-lived containers; vanishes with ephemeral
sandbox containers. For ephemeral sandboxes, operator runs
`agentkeys-daemon --init-link-code <new-code>` per session — same
pattern as today's pair-flow.

Q1 (forward-references):
- issue-74-dev-key-service-plan.md gains a "Status (post-PR #75) —
  successor steps" preamble pointing at step 1b + step 1c as the
  follow-on work.
- stage7-demo-and-verification.md trust-model section gains a callout
  that step 1c will upgrade /dev/* auth from bearer-JWT to device-key
  per-request signature; the demo flow shape doesn't change.

Q2 (cleanup + placement): filed as issue #77 (separate from this
commit). Tracks (a) the legacy mock-server endpoint cleanup after
#75 + #76, and (b) the open question of where identity/audit
endpoints belong long-term — captures the user's broker-policy /
signer-execution split proposal.

Q4 (storage location — answered inline, no doc edit): omni ↔
identity linking is stored in the broker at
crates/agentkeys-broker-server/src/storage/identity_links.rs
(SQLite table `identity_links`, indexed on
(identity_type, identity_value)).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cy, stale refs)

Three structural cleanups across the 5 docs touched in commit 6d36a7b:

1. heima-gaps-vs-desired-architecture.md — section ordering fix.
   Previous numbering was 1, 1a, 2..9, 11, 12, 10 (Tracking out of order).
   Renumbered:
     §11 (NEW signer-edge contract)         → §10
     §12 (NEW per-request crypto auth)      → §11
     §10 (Tracking — was wedged between)    → §12
   Updated §1a status snapshot table accordingly. Updated 3 stale
   in-body §-refs:
     - §1a row 3: "architecture.md §11" → §7 (Pluggable surfaces)
     - §11 body "TEE swap-ready (gap §11)"  → "(gap §10)"
     - §11 body "Blocks the TEE worker (gap §11)" → "(gap §10)"
   Updated tracking-section "PR #75 / issue #76 close §11 and queue §12"
   → "close §10 and queue §11"; resolution-log entries to match.

2. issue-74-step-1c-device-key-auth.md — PoP consistency across all
   identity types. Previously only the `email` flow had explicit
   proof-of-possession; `evm` and `oauth2_google` flows didn't. Same
   Q7 attack surface applies to all three, so:
     - `evm` flow: daemon now signs the SIWE binding payload with
       D_priv (in addition to the EVM key); broker verifies both
       signatures (proves "user owns EVM identity AND daemon
       controls device key").
     - `oauth2_google` flow: daemon now signs the start request
       with D_priv; broker verifies before issuing any state value.
       Composes with the existing `state` parameter binding.

3. architecture.md — dropped "(preserved from prior architecture
   revision)" parenthetical from §9 Component inventory and §10
   Language choices headings. Internal-changelog noise that doesn't
   help readers.

Verification: 394 workspace tests pass, 0 fail. heima-gaps section
ordering now sequential (1 → 1a → 2..9 → 10 → 11 → 12). All §-refs
resolve to live anchors. step-1c PoP coverage confirmed in all three
identity-type sections.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rget)

Architecturally collapses the four bespoke per-identity PoP shapes
(email pop_sig, oauth2 pop_sig, evm dual-sign-SIWE, passkey) into
two uniform binding ceremonies, split by machine class:

- Master machines (workstation with platform authenticator) ->
  WebAuthn enrollment ceremony. Hardware-attested, identity-type-
  agnostic, closes the email-account-compromise -> device-takeover
  gap (Q7) by requiring hardware presence at re-bind.
- Agent machines (VM/Linux/CI/agent-infra/sandbox container) ->
  link-code redeemed against master's authenticated session per
  the agent-infra/sandbox two-tier orchestrator pattern.

Defers YubiKey-on-Linux-as-master (roaming-authenticator binding)
to issue #79 as a follow-up.

arch.md changes (single source of truth):
- §2 trust boundaries: K11 in master TB, new agent-machine TB,
  master/agent rows in compromise table
- §3 K-table: K10 master/agent persistence dichotomy; new K11
  for WebAuthn platform-authenticator credential
- §5 cold-start: status callout pointing at §5a.1 for v0.2 target
- §5a header: master-vs-agent intro + WebAuthn-uniform status
- §5a.1: rewrite into identity ceremonies + 5a.1.M (WebAuthn) +
  5a.1.A (link-code) + v1c-interim PoP shapes pointer
- §5a.2: master/agent device-switch shapes; cross-device
  confirmation note
- §5a.3: WebAuthn get()-gated rotation for masters
- §5a.4: agent persistence per agent-infra/sandbox; link-code-per-
  session is the right answer, not a workaround; cite 1-step-
  analysis.md
- §5a.5: trust-shape table collapses to master/agent rows

Plan files defer to arch.md as authoritative:
- step-1c plan: status callout + per-identity-type section header
  marked v1c-interim
- dev-key-service master plan: successor steps note WebAuthn
  binding + link to #79

Companion artifacts:
- gh issue #79 filed (YubiKey-on-Linux master deferral)
- comment on #76 with WebAuthn refinement summary
…s §5a.1.M)

§5 cold-start sequenceDiagram correctly shows D generated at step 0
(before identity ceremony / network traffic). §5a.1.M had it as step 1
AFTER identity ceremony returns binding_nonce — internally inconsistent
within arch.md.

§5 is the right model: D should be generated at daemon startup,
not deferred until identity ceremony completes. There is no security
benefit to delaying, and D_pub must exist by the time of any
binding ceremony anyway (v1c pop_sig signs identity request with
D_priv; v0.2 WebAuthn challenge folds D_pub into the ceremony challenge).

Changes:
- §5a.1 intro: explicit three-stage pipeline. Stage 0 = device-key
  generation at daemon startup; Stage 1 = identity ceremony; Stage 2 =
  binding ceremony. State that stage 0 is non-negotiably first across
  all flows (master, agent, v1c, v0.2) with the reasoning.
- §5a.1.M: drop the misleading "step 1: generate D_priv". Now opens
  with explicit PRECONDITIONS from stage 0 + stage 1, and binding-
  ceremony numbering starts at the WebAuthn step itself. Final step
  notes D_priv was already persisted at stage 0 (just persist J0).
- §5a.1.A: agent flow's daemon-startup D-generation now explicitly
  labelled "Stage 0 (daemon startup, per §5a.1)" for symmetry.
  Numbering unchanged (cross-machine sequence continues from master).
- §5a.2.M: new-master device-switch flow now leads with Stage 0
  (fresh K10' generated at daemon startup) before identity ceremony,
  matching first-init.

§5a.3.M rotation step "generate D_priv_new" is unchanged — that's an
explicit new-key generation within the rotation flow, not first-time
init, so stage-0 framing doesn't apply.
§5a.1.A's precondition expected J1_master (the EVM-omni session JWT)
but §5a.1.M ended at J0 (the identity-omni JWT). The wallet-derive +
link + SIWE round-trip that mints J1 lives in §5 steps 2-3 but was
never referenced from §5a.1.M's outro, so the reader had no path
between the master binding ceremony and the agent link-code flow.

Changes:
- §5a.1.M: new "From J0 to J1 (master only — bridge to per-mint
  flows)" subsection. 6-step flow: signer derive-address → broker
  wallet/link → broker auth/wallet/start → signer sign-message →
  broker auth/wallet/verify → mint J1. States that K10 + K11 claims
  propagate from J0 into J1 atomically. Notes the evm-identity-type
  variant collapses these steps (user's own EVM key IS the wallet).
- §5a.1.A precondition: now reads "ON MASTER (already initialized
  per §5a.1.M + the J0 → J1 bridge above; holds J1_master = the
  long-lived EVM-omni session JWT with K10 + K11 claims)" — makes
  the dependency on the bridge explicit.
…, -235)

Adopts the per-agent omni model proposed by user critique:
- Each agent is a first-class actor with its own omni derived from
  master via HDKD //label, its own wallet (HKDF(K3, O_agent)), its
  own AWS PrincipalTag, its own audit slot.
- Per-agent compromise containment, atomic revocation, first-class
  audit attribution, tree-as-data-model.
- v1c "shared omni + multiple device pubkeys" is now a degenerate
  v1.0 tree (no children).

Plus the link-code-only-agent-bootstrap simplification:
- Agents have ONE bootstrap path: link-code from authenticated master.
- No identity ceremony for agents, no shared bearer, no agent-side
  recovery. One test surface, one threat model.

arch.md changes (compacted 944 -> 709 lines):
- §3 K3/K4: per-actor-omni derivation framing; K10/K11 references
  updated to new §5a subsection numbering
- §4 identity model: HDKD actor tree (master root + //label children),
  per-actor wallet derivation, why per-agent omni
- §4a NEW: 4-axis mental model (identity / actor / machine /
  capability), master-vs-agent role table, key non-conflations
- §5 cold-start: compact 4-stage table + single sequenceDiagram
  showing v1.0 master flow with WebAuthn enrollment + bridge
  to J1; v1c interim status callout
- §5a restructured into 5 subsections (was multi-subsubsection):
  - 5a.1 master init (per-identity-type + uniform WebAuthn binding)
  - 5a.2 agent bootstrap (link-code only - explicit "no other path")
  - 5a.3 master device switch + rotation (combined)
  - 5a.4 agent re-bootstrap + persistence (combined; cites
    1-step-analysis.md)
  - 5a.5 trust shape (per-actor isolation properties)

CLAUDE.md: added "Architecture-as-source-of-truth policy" requiring
arch.md re-check after any architectural doc edit; documents that
per-doc detail outgrowing arch.md should link outward, not duplicate.

step-1c plan: status callout reframed - v0.2 target is HDKD per-agent
omni + WebAuthn-uniform binding (structural shift, not just wire-shape
collapse); points at arch.md §4/§4a/§5a as single source of truth.

Companion artifacts (not in commit; reference only):
- .omc/wiki/agent-role-and-usage-hdkd-per-agent-omni.md
  (project-local wiki page, gitignored per .omc/ convention)
- gh issue #79 updated: master-vs-agent reframed as actor role,
  not machine class; YubiKey-on-Linux is "Linux + YubiKey as master"
  (one of two roles, not a third class).
Updates the operator-facing demo doc for the master/agent + HDKD
mental model landed in the prior commit (50a0ffa). Operational
content (steps 0-13) is unchanged because the demo runs against
v1c-interim — the actually-shipped flow.

Changes:
- Trust model section: replaced step-1c-coming callout with explicit
  v1c-interim status; cross-refs arch.md §4 (HDKD actor tree),
  §4a (mental model), §5a (per-actor binding); flags v0.2 target
  features as not-yet-implemented and tracked in #76 / #79.
- Two-machine layout: marked operator-workstation row as "(master
  role)"; added a "Roles + key inventory primer" callout pointing
  at arch.md §4a (4-axis mental model), §3 (K1-K11 inventory),
  §5a.2 (agent role / link-code bootstrap), and the agent wiki
  page as the operator-focused reference.
- Section §0 success-criteria #3: clarifies "operator's omni_account"
  IS the master actor omni per arch.md §4.

What did NOT land in the demo doc:
- Per-step rewriting of operational content. The demo correctly
  exercises v1c-interim (single-omni-shared-with-master, bespoke
  per-identity PoP, link-code agents). v0.2 demo content waits
  for the agent-create endpoint + WebAuthn ceremony to ship.
…R_URL

- scripts/operator-workstation.env: add SIGNER_HOST + AGENTKEYS_SIGNER_URL
  (derived from BROKER_HOST), keep BACKEND_URL as alias. Co-located with
  broker today; hostname split lets the signer move to its own machine
  (or TEE worker) later without changing client config.

- docs/cloud-setup.md §1.3: add "what the signer is + why a dedicated
  hostname" overview with a today-vs-future table; explicit co-location
  note + cross-ref to operator-workstation.env.

- docs/stage7-demo-and-verification.md §0.2: stop re-deriving the signer
  URL — both vars come from operator-workstation.env now. Cross-ref the
  topology section in cloud-setup.md.

No code change; arch.md §10 deployment topology already captures the
separate-hostname / same-host model unchanged.
§1.3 used $EIP, but $EIP isn't set until §5.1 — copy-pasting top-down
broke. Make §1.3 a brief intro consistent with §1.2 (broker subdomain
defers to §5), and put the actual DNS+cert+nginx-flip steps in a new
§6 that runs after §5 and reuses $EIP.

- §1.3: brief signer intro + defer to §6 (matches §1.2 shape).
- §6 NEW: Signer host — overview table (today vs future), DNS A record
  (§6.1), TLS cert + nginx flip (§6.2), verify (§6.3).
- §7: Cleanup (was §6).
- Top TOC: add §6 Signer host row, bump Cleanup to §7.
- stage7 demo: cross-refs §1.3 → §6 for the cert+DNS steps; cross-ref
  to "cloud-setup.md §6" cleanup → §7.
… $SIGNER_HOST

Reported failure: `sudo certbot --nginx -d "$SIGNER_HOST"` on the broker
host fell through to certbot's interactive vhost picker showing only
broker.litentry.org. Root cause: $SIGNER_HOST is only exported on the
operator workstation (scripts/operator-workstation.env), not on the
broker host — empty -d arg → certbot's "pick from existing vhosts"
fallback → only the broker vhost is offered.

§6.2 now:
- explicit warning that $SIGNER_HOST is workstation-only
- adds a sanity-check `ls /etc/nginx/sites-enabled/agentkeys-signer`
  (catches the "setup-broker-host.sh wasn't re-run with signer code"
  case before certbot is invoked)
- derives SIGNER_HOST inline from the nginx vhost (awk the server_name
  line setup-broker-host.sh just wrote) so the certbot command is
  copy-paste safe on a fresh broker shell with no env vars set
…uto → no)

Reported failure: `sudo bash scripts/setup-broker-host.sh --yes` on a
fresh broker host did not write the agentkeys-signer nginx vhost. Then
`sudo certbot --nginx -d signer.<zone>` fell through to certbot's
interactive vhost picker, which only listed broker.<zone> (because the
broker vhost was written by an earlier run that had been done with
--with-nginx).

Root cause: WITH_NGINX defaulted to "auto", which resolved to "no" at
line 361 — the comment said "preserves prior default" but every doc-driven
operator expects nginx provisioning. The runbook (cloud-setup.md §5 + §6)
explicitly assumes nginx is set up by the script.

Now: auto → yes for both WITH_NGINX and WITH_CERTBOT. Operators who don't
want nginx (running behind a non-nginx reverse proxy, pre-provisioned
certs) opt out via --without-nginx / --without-certbot. The interactive
preview already prints `nginx : $WITH_NGINX`, so the operator sees the
resolved value before confirming.

Also pin --with-nginx explicitly in cloud-setup.md §6.2 step 1 + step 3
so the doc remains correct even if the script default changes again.
…olver

Reported failure: operator's `dig +short broker.litentry.org A` returned
198.18.1.86 (RFC 2544 TEST-NET-2) because their local DNS resolver was
behind a transparent proxy (Cloudflare WARP / Zscaler / Tailscale Magic
DNS). Using that as $EIP would have published a Route 53 A record
pointing at a private/loopback range, breaking Let's Encrypt validation
silently — the symptom would surface 5 min later as
"Timeout during connect (likely firewall problem)" with the wrong IP in
the error.

§6.1 now:
- explicit callout that local resolvers behind WARP/Zscaler/Tailscale/
  corporate VPNs return 198.18.0.0/15 for proxied hostnames
- shows `aws ec2 describe-addresses` as the authoritative re-derivation
- replaces fire-and-forget verify with a polling loop until Cloudflare DoH
  confirms the A record matches $EIP (Route 53 propagation up to TTL=300)

§5.2 unchanged — within §5 the operator just set $EIP from AWS API in
§5.1, so the local-resolver trap doesn't apply there.
The §1.3 + §6 + §6.1 + §6.2 prose said the same thing 3-4 times
(co-located today / future-split possible / "if the signer is ever
moved" / "first run writes nginx, certbot, second run flips ssl").
Each new fix layered another paragraph on top instead of
consolidating.

Pass 1 — §1.3 collapsed from 12 lines to 1 (matches §1.2's defer-to-§5
shape; §6 has all the detail).

Pass 2 — §6 intro: dropped 4-line prose paragraph above the table; folded
"endpoints" + "exported as SIGNER_HOST" into the table itself so it's
the single load-bearing reference. Dropped trailing prose paragraph
about the env file (now in the Public-hostname row).

Pass 3 — §6.1: collapsed standalone EIP-derive callout (10 lines of
warning + 5 lines of fenced bash) into a 3-line guard inside the bash
block (`[ -z "$EIP" ] && EIP=$(aws ec2 describe-addresses …)`). Kept
the WARP/Zscaler/198.18.x.x context as a 4-line comment in the bash —
load-bearing for diagnosis, would lose meaning if removed.

Pass 4 — §6.2: dropped "Three host-side steps. setup-broker-host.sh is
idempotent…" preamble paragraph (table already says this). Kept the
$SIGNER_HOST=laptop-only callout (load-bearing — distinguishes laptop
from broker host shell scope).

No behavior change. All cross-refs intact (#6-signer-host, #51-allocate,
signer-protocol, operator-workstation.env all still resolve).
60 code fences, balanced.
… are yes

The flags were redundant once defaults flipped to yes (commit a3a0a84).
Per CLAUDE.md remote-broker-host policy the script is the single
idempotent entry point — flag-gating "do the thing the runbook always
wants" is noise. Drop both --with-* flags + the auto-resolution
dead-code; keep --without-nginx / --without-certbot as the only opt-out.

- WITH_NGINX / WITH_CERTBOT default to "yes" outright (no more "auto"
  three-state); 12-line auto-resolution block becomes a 2-line comment.
- CLI parser drops --with-nginx / --with-certbot. Passing the removed
  flags now errors `unknown flag: --with-nginx` rather than silently
  no-op'ing.
- Header usage block + interactive defaults comment updated to match.
- docs/cloud-setup.md §6.2: drop --with-nginx from both invocations
  (replace_all over the doc).

No behavior change for operators following the runbook — `--yes` alone
already provisioned nginx since a3a0a84. This commit only removes the
explicit `--with-nginx` redundancy.
CLAUDE.md
- New "Runbook-fix-fold-back policy": when an operator hits a runbook
  failure, both the targeted fix AND a runbook revision must land in
  the same turn. Goal: every operator-encountered failure makes the
  runbook strictly more robust before we move on.

stage7-demo-and-verification.md (§0)
Absorbs every failure the operator hit walking this PR end-to-end:

- §0 Tooling: pulled CLI build out of a sub-bullet into a numbered
  ordered checklist (cargo build → cp to ~/.local/bin → which/version
  smoke-test → init). Explicit warning against path-relative aliases
  (the recurring "alias agentkeys=./target/release/agentkeys-cli" trap
  with the wrong binary name from before the agentkeys-cli → agentkeys
  rename). Spells out crate-name vs binary-name distinction.

- §0.1: branch-agnostic checkout via `BRANCH="${BRANCH:-evm}"` (was
  hardcoded `git checkout evm` — broke when validating PR branches).
  Adds nginx vhost sanity-checks: `ls /etc/nginx/sites-enabled/
  agentkeys-{broker,signer}` + grep for proxy_pass-vs-return-503
  inside agentkeys-signer (catches the "cert issued but script not
  re-run, vhost still serves stub 503" failure mode).

- §0.2: smoke-test now string-matches body == "ok" (a successful HTTP
  200 with body "TLS cert not yet issued for signer …" is the exact
  trap operators hit when certbot succeeded but step 3 of §6.2 wasn't
  run). Adds a 5-row "common failure modes" table mapping observed body
  → cause → exact fix command.

§16 line 1402's `git checkout evm` left as-is — that section is
intentionally evm-specific (verifies the live prod broker).
Operator hit `which agentkeys` → "aliased to ./target/release/agentkeys-cli"
even after `cp target/release/agentkeys ~/.local/bin/`. zsh aliases beat
$PATH lookups (and the alias also pointed at the wrong binary name —
the crate is agentkeys-cli but the [[bin]] is `agentkeys`), so the
install was invisible no matter how correctly it was staged.

§0 build checklist now goes 5 steps in this order:

1. sed-strip any `alias agentkeys[-= ]…` from ~/.zshenv + ~/.zshrc
   (with .bak), then `unalias` for the current shell. Fail-soft
   (`|| true`) so missing files don't abort.
2. Append `~/.local/bin` to $PATH if not already there (idempotent
   case statement; appends to ~/.zshenv).
3. cargo build (was step 1).
4. cp to ~/.local/bin (was step 2).
5. `hash -r` + `command -v agentkeys` (NOT `which`) — bypasses any
   alias zsh hasn't re-hashed away yet. Spells out the expected
   absolute-path output.

Plus a tiered fallback callout: if `command -v` still shows the alias,
grep ~/.zprofile / ~/.aliases / shell includes for stragglers, then
`exec zsh -l`.

Per Runbook-fix-fold-back policy (CLAUDE.md): operator failure → both
the fix command (handed back inline last turn) AND the runbook
revision land in the same turn. Next operator running this top-down
won't hit the alias trap.
Operator hit `curl: (7) Failed to connect to 127.0.0.1 port 18090`
because their shell had a stale `BACKEND_URL=http://127.0.0.1:18090`
local-dev export in ~/.zshenv that shadowed
operator-workstation.env's BACKEND_URL=$AGENTKEYS_SIGNER_URL alias.

§0.2 now:
- Pins `export BACKEND_URL="$AGENTKEYS_SIGNER_URL"` inline so the
  smoke-test is self-contained (no longer depends on ~/.zshenv being
  un-shadowed).
- Adds a defensive `case "$BACKEND_URL" in https://signer.*) ;; esac`
  bail-loud check BEFORE the curl, with a one-line diagnosis
  (`grep -n BACKEND_URL ~/.zshenv && unset && re-source`).
- Echoes BACKEND_URL alongside SIGNER_HOST so the operator visually
  confirms the value is public https:// before hitting curl.

Per Runbook-fix-fold-back: failure command + cause + fix command all
inline in the runbook so the next operator with a stale local-dev
shell doesn't have to round-trip with the maintainer to diagnose.
Operator hit `error: unexpected argument '--json' found` running
§0.4's `agentkeys signer derive --signer-url … --omni-account … --json`.
Per crates/agentkeys-cli/src/main.rs:24-25, --json is a top-level flag
on the root `agentkeys` command (controls ctx.json_output globally),
NOT a per-subcommand flag on `signer derive` / `signer sign`. Clap
rejects it after the subcommand's required args.

Eight occurrences fixed across §0.4 (×2), §3 SIG_A/SIG_ADDR/SIG_B
(×3 multi-line), and §16 live walkthrough (×3 single-line):

  agentkeys signer derive … --json | jq …
→ agentkeys --json signer derive … | jq …

  agentkeys signer sign   … --json | jq …
→ agentkeys --json signer sign   … | jq …

Plain text-output calls at lines 1047 and 1099 left unchanged
(no --json there to begin with).

Per Runbook-fix-fold-back: clap arg ordering is non-obvious for
top-level vs subcommand flags, so the runbook command examples must
match the actual CLI grammar — operators copy-paste, they don't
re-read the clap macro.
Operator hit `Error: SIGNER_UNAUTHORIZED  invalid session JWT:
InvalidToken` running §0.4's first signer derive call. The §0.4 intro
said "Run agentkeys init first if you haven't already" but never
showed the actual command — operators don't know to look ahead 100
lines to §2.0 for the real `--email --broker-url --signer-url`
invocation.

§0.4 now:
- Explicit "must run first OR every call below returns SIGNER_UNAUTHORIZED"
  callout (with the literal error message so operators searching the
  doc for the error find the fix).
- Inline `agentkeys init --email alice@demo.example --broker-url $OIDC_ISSUER
  --signer-url $BACKEND_URL` as a copy-paste block, with the expected
  "Initialized via email-link" output.
- Cross-link to §2.0 for explanation + OAuth2 alternative — minimal in
  §0.4, full context in §2.0.

§2.0's existence preserved: it still has the magic-link explanation +
OAuth2 alternative + daemon-side equivalent. §0.4's inline init is the
minimum to keep the §0 prereq chain self-contained.

Per Runbook-fix-fold-back: a runbook step that says "run X first" must
include the literal X invocation, not just point at it.
Pass 1 implementation per .omc/ralph/prd.json: ships the
SesEmailSender behind the auth-email-link feature, with end-to-end
SES → S3 round-trip integration test. Pass 2 (separate commit) wires
boot.rs + setup-broker-host.sh + broker.env defaults + demo doc.

Closes the gap that blocked the operator's stage-7 demo init flow:
the deployed broker had only StubEmailSender (in-process Vec, no
delivery). With this change + Pass 2, `agentkeys init --email` will
deliver a real magic-link to the operator's inbox.

US-1: Cargo.toml deps
- aws-sdk-sesv2 = "1" added as optional dep gated by auth-email-link
- aws-sdk-s3 + uuid added to dev-dependencies for the integration test
- dev-deps now enable auth-email-link so tests/* compile by default

US-2: SesEmailSender impl (crates/agentkeys-broker-server/src/plugins/auth/email_link.rs)
- send_magic_link composes multipart text+html via aws-sdk-sesv2 SendEmail
- verify_sender_ready calls GetEmailIdentity + checks verified_for_sending
- Errors map to EmailSendError::{Send, Verify, Config}
- Inline subject + body templates (no template-engine dep)
- Re-exported from src/plugins/auth/mod.rs

US-3: Body composition unit tests (4 added)
- ses_subject_is_non_empty
- ses_text_body_contains_landing_url
- ses_html_body_contains_landing_url_twice (href + visible text)
- ses_text_and_html_alternatives_both_present

US-4: Integration test (crates/agentkeys-broker-server/tests/ses_email_flow.rs)
- Gated by RUN_SES_INTEGRATION_TESTS=1 + #[ignore]
- CleanupGuard Drop impl: list-and-delete every S3 object whose body
  contains the per-test UUID, even on panic
- Polls inbound/ prefix for up to 60s (5s × 12 attempts)
- Asserts MIME body contains both unique token AND landing URL
  (allowing for quoted-printable encoding of '=' as '=3D')

US-5: Quality gates ALL GREEN
- cargo build -p agentkeys-broker-server                            → exit 0
- cargo build -p agentkeys-broker-server --features auth-email-link → exit 0
- 161 lib tests pass; integration test compiles + skips gracefully
- cargo clippy --no-deps -- -D warnings → exit 0
- (Pre-existing clippy warning in agentkeys-core/src/init_flow.rs:177
  unrelated; will tackle in Pass 2 if it blocks.)

US-6: BLOCKED on operator — live SES round-trip
- Operator runs:
    awsp agentkeys-admin
    RUN_SES_INTEGRATION_TESTS=1 ACCOUNT_ID=429071895007 \
      cargo test -p agentkeys-broker-server --features auth-email-link \
        --test ses_email_flow -- --ignored --nocapture
… identity

Operator hit `NotFoundException: Email identity <noreply@bots.litentry.org>
does not exist` running the SES integration test. Cause: SES
GetEmailIdentity returns identities EXPLICITLY registered with
`create-email-identity`. cloud-setup.md §2.1 verifies the DOMAIN
(`bots.litentry.org`), which auto-grants sending rights to ANY address
at that domain via DKIM — but the per-address identity
(`noreply@bots.litentry.org`) was never registered. So the verify
precheck failed even though the actual SendEmail would succeed.

Fix: verify_sender_ready now tries address-level lookup first
(preferred — explicit), then on NotFound falls back to extracting the
domain (split on '@') and looking up the domain identity. Either
passing → Ok(()).

Helper extracted: check_identity(client, identity) → Result<(), String>
returns Ok only when SES reports the identity exists AND
verified_for_sending_status=true. Used by both attempts.

No behavior change for operators who explicitly verify per-address;
unblocks the canonical operator path (verify-domain-only) per
cloud-setup.md §2.1.

Closes the verify-precheck blocker on Pass 1's US-6 (live SES
round-trip from operator). Quality gates re-checked:
  - cargo build -p agentkeys-broker-server --features auth-email-link → ok
  - cargo test  -p agentkeys-broker-server --features auth-email-link --lib → 161 passed
  - cargo clippy -p agentkeys-broker-server --features auth-email-link --tests --no-deps -- -D warnings → ok
Per operator request after Pass 1:
  1. drop the address→domain fallback in SesEmailSender::verify_sender_ready
     — explicit per-address verification only
  2. register noreply-test@bots.litentry.org as a per-address SES identity
     and pin it in operator-workstation.env
  3. give the operator a one-shot bash helper that exploits the existing
     SES inbound receipt rule (cloud-setup.md §2.1) to fully automate the
     address verification — no inbox-clicking, no manual MIME parsing

Code (crates/agentkeys-broker-server/src/plugins/auth/email_link.rs):
- verify_sender_ready: single GetEmailIdentity call on the FROM address.
  No fallback. Error message points the operator at
  `aws sesv2 create-email-identity` (and at scripts/ses-verify-sender.sh
  for the automated path) so the next failure self-diagnoses.
- Removed check_identity helper (was the fallback shared call).

Test (crates/agentkeys-broker-server/tests/ses_email_flow.rs):
- TestEnv now reads BROKER_EMAIL_FROM_ADDRESS — same env var the broker
  reads at runtime (env.rs:143). One source of truth between the test +
  the broker process.
- Default: noreply-test@${MAIL_DOMAIN} (was: hardcoded noreply@…).

Env (scripts/operator-workstation.env):
- New: MAIL_DOMAIN (bots.litentry.org), MAIL_BUCKET, BROKER_EMAIL_FROM_ADDRESS.
- MAIL_DOMAIN is explicit (not derived from BROKER_HOST) — broker zone
  may differ from email subdomain.

Helper (scripts/ses-verify-sender.sh, +x):
- One-shot: aws sesv2 create-email-identity → poll s3://$MAIL_BUCKET/inbound/
  for the SES verification mail (lands there via the existing receipt rule
  from cloud-setup.md §2.1) → grep verification URL out of the
  quoted-printable body → curl-click it → confirm VerifiedForSendingStatus
  → delete the verification mail from S3 so it doesn't pollute the inbox.
- Idempotent: re-running on a verified identity exits 0 immediately.
- Requires: aws + jq + curl + grep + sed (all present on macOS / Ubuntu).

Quality gates:
- cargo build -p agentkeys-broker-server                            → ok
- cargo build -p agentkeys-broker-server --features auth-email-link → ok
- cargo test  -p agentkeys-broker-server --features auth-email-link --lib → 161 passed
- cargo test  -p agentkeys-broker-server --features auth-email-link --test ses_email_flow
                                                                    → 1 ignored (skips)
- cargo clippy -p agentkeys-broker-server --features auth-email-link --tests --no-deps -- -D warnings
                                                                    → ok
…ded body

Operator hit "endless waiting" — the script polled S3 forever even though
SES had likely written the verification mail. Two bugs in the polling
predicate:

1. `grep -q "$FROM"` looked for the literal `noreply-test@bots.litentry.org`
   string, but in a quoted-printable MIME body the `@` is encoded as `=40`
   so the literal grep never matched.

2. `grep -qE 'ses[._-]?verification|amazonaws\.com.*verify'` matched
   `ses-verification` patterns, but the actual SES URL host is
   `email-verification.<region>.amazonaws.com` — neither alternative hit.

Fix: drop both prereq greps. SES verification URLs are unique enough that
matching the URL pattern directly is sufficient — no false positives.

Also added per-attempt diagnostics:
- log "$count object(s) under inbound/" each iteration so the operator
  can see whether anything is landing at all
- on timeout: structured 3-step diagnosis pointing at receipt-rule
  state, identity status, and bucket contents

Refactored URL extraction into extract_verify_url() helper (single source
of truth) — handles quoted-printable soft-wrap (=\n) + =3D decoding.
…ck_on

Operator hit the test panic at line 145:
  "Cannot start a runtime from within a runtime. This happens because a
   function (like `block_on`) attempted to block the current thread while
   the thread is being used to drive asynchronous tasks."

Cause: `Handle::block_on` is forbidden when called from inside a tokio
runtime context. Drop runs WHILE still inside #[tokio::test]'s runtime
(the runtime hasn't shut down by the time Drop fires for `let _guard =`),
so the previous code panicked even though we had `try_current → Ok` to
"detect" the active runtime.

Test ran end-to-end successfully BEFORE this Drop panic — log shows:
  ses_email_flow: found inbound object key=inbound/8dqr… (attempt 1)
…the assertions never got to run because Drop tore down first.

Fix: wrap `handle.block_on(cleanup_fut)` in `tokio::task::block_in_place`,
which suspends the current async task so a nested blocking call is legal.
Requires multi_thread runtime — already guaranteed by
`#[tokio::test(flavor = "multi_thread")]` on the test attribute, no
behavior change for the rest of the test.

The `Err(_) → Runtime::new()` branch is preserved as a fallback for the
edge case where Drop fires AFTER the runtime has been torn down (e.g.
test panic during runtime shutdown). Won't normally trip in practice.
Operator request: enforce that no hardcoded values land in scripts/code/
runbooks unless logged in a dedicated audit doc.

CLAUDE.md
- New "No-hardcoded-values policy" between Runbook-fix-fold-back and
  Plan-completion. Says: parameterize via env / CLI / config; if
  temporarily hardcoded, log in hardcoded.md with file+line, why, and
  the unblock action.

hardcoded.md (NEW)
- Seeded with the existing operator-deployment-pinned values
  (ACCOUNT_ID, BROKER_HOST, MAIL_DOMAIN, BROKER_EMAIL_FROM_ADDRESS,
  BROKER_DATA_ROLE_ARN), the deployment-architecture-pinned values
  (loopback ports 8090/8091/8092, agentkeys system user, /etc/agentkeys
  paths), and code-level constants (TOKEN_TTL_SECONDS, rate-limit
  defaults, SES integration test defaults).
- Each entry: what's hardcoded, why, what would unblock making dynamic.
- Open trade-off section flags the email_link HMAC removal (b8481fe)
  for revisit when scaling to multi-broker-replica deployments.

scripts/broker.env (smell fix called out in hardcoded.md)
- Add ACCOUNT_ID=429071895007 as the single source of truth.
- Derive BROKER_DATA_ROLE_ARN from \${ACCOUNT_ID} (was hardcoded
  separately, drifted from operator-workstation.env's ACCOUNT_ID).
- Verified: `set -a; source ./scripts/broker.env; set +a` expands
  ACCOUNT_ID + BROKER_DATA_ROLE_ARN correctly.
WildmetaAgent and others added 28 commits May 11, 2026 08:57
… switch into stage7 doc

The script previously masked AccessDenied from list-objects-v2 with
'2>/dev/null || true', manifesting as endless 'attempt N/24 - 0
object(s) under inbound/' polling when the operator forgot to switch
to agentkeys-admin profile (the broker user lacks s3:ListBucket on
the mail bucket per cloud-setup.md section 2.1).

Two changes:
1. Script now preflights 'aws sts get-caller-identity' + a
   ListObjectsV2 probe before entering the poll loop. Wrong-profile
   case dies with explicit 'Run: awsp agentkeys-admin' guidance
   instead of silently spinning. Also drops the 2>/dev/null mask on
   the poll-loop list call now that preflight proves the cred path.

2. Stage 7 demo doc section 0.4 prereq block now shows the awsp +
   set -a;source;set +a sequence inline, with a callout naming the
   previous failure mode so the next operator recognizes it
   immediately.

Reproduced locally:
  AWS_PROFILE=agentkey-broker bash scripts/ses-verify-sender.sh
  -> exits 1 with: 'wrong AWS profile: arn:...:user/agentkey-broker
     lacks s3:ListBucket on agentkeys-mail-429071895007.
     Run: awsp agentkeys-admin   then re-run this script.'

User approved one-shot raw-git use because this dir is a git-linked
worktree (.git is a file pointing back to parent repo); jj root
resolves to parent and cannot see these paths.
…-restart

Root cause: the post-restart healthz check used a single 5s curl with
'|| warn' — a service in systemd Restart=always loop (e.g. broker
crashing on BROKER_AUTH_METHODS=email_link with binary built without
--features auth-email-link) shows up as a one-line warn the operator
scrolls past, and the script exits 0. Operator declares the host
healthy, then 30 minutes later hits 502 Bad Gateway from nginx and
has to re-diagnose from scratch.

Three changes:

1. scripts/setup-broker-host.sh — replace the warn-only one-shot
   curl probes with probe_or_die(): poll /healthz for 20s per
   service (10x 2s with --max-time 2), and on persistent failure
   dump 'systemctl status' + last 40 journal lines for the failing
   unit, then die with a fix-list naming the three most common
   boot crashes (gated-out feature, missing FROM address, AWS creds).

2. docs/stage7-demo-and-verification.md §0.4 prereq #2 — instruct
   operator to 'rm -f target/release/agentkeys-broker-server' before
   re-running the script (cargo's incremental cache occasionally
   leaves the wrong artifact in place when feature flags change
   across rebuilds; clean target avoids the failure mode entirely).
   Plus a '502 Bad Gateway' troubleshooting block pointing at the
   journal grep + the canonical fix.

3. Same doc — name the exact boot-crash error string ('unknown or
   feature-gated-out auth method') the next operator will see, so
   they don't have to round-trip with logs. Per runbook-fix-fold-back
   policy: every operator-encountered failure makes the runbook
   strictly more robust before we move on.
…ed-mode case bug

Pass-by-pass cleanup of scripts/setup-broker-host.sh, behavior preserved
(verified by grep-locking 17 critical strings: env vars, ports, paths,
systemd unit names, feature flags, function calls). Net -75 lines (1019
-> 944, -7.4%).

Pass 1 — Dead code:
- Drop prompt_default() and prompt_choice() (defined but never called).
- Drop --skip-pull flag, PULL_SKIP var, and the redundant '! $PULL_SKIP'
  guard (the outer '[[ -n "$PULL_REF" ]]' already gates the pull).
  --skip-pull is now folded into the --upgrade no-op arm so existing
  callers still parse cleanly.

Pass 1b — Latent bug fix:
- The 'case "$CRED_MODE"' block in the trailing manual-steps section
  had a duplicate 'instance-profile)' arm: the FIRST one was reached
  but contained text describing 'none mode'; the SECOND (which had the
  correct instance-profile text) was unreachable dead code; and 'none'
  mode users got NO instructions at all because no 'none)' arm existed.
  Renamed the first arm to 'none)' so all three modes now print their
  intended manual-steps text.

Pass 2 — Duplicate consolidation:
- Three near-identical 'if [[ -d /etc/nginx/sites-enabled ]]; then ln
  -sf … fi' blocks (broker, signer-HTTPS, signer-HTTP-only) collapsed
  into ONE block after write_nginx_site returns. ln -sf is idempotent
  so this is behavior-equivalent.
- certbot install: 'case "$PM"' had two arms with identical package
  list ('certbot python3-certbot-nginx'); collapsed to a single
  '"${PM_INSTALL[@]}" certbot python3-certbot-nginx' invocation.

Pass 3 — Comment trim:
- 58-line header reduced to 18 lines: dropped the 'Order of operations'
  enumeration (duplicated by the section comments inline) and the
  --flag enumeration (duplicated by the case parser + --help dump).
  Kept the canonical 'CLAUDE.md says all remote-host changes go through
  this script' rule + out-of-scope list.

Idempotency audit (no changes needed — already correct):
  • build deps: apt/dnf -y, idempotent
  • rustup install: gated 'if ! have rustup'
  • systemctl stop: '|| true'
  • binary backup: gated 'if [[ -x ]]'
  • install -m 0755: overwrite-OK
  • useradd: gated 'if ! id -u agentkeys'
  • install -d: idempotent
  • DEV_KEY_SERVICE secret: gated 'if ! sudo test -s' (never regenerated)
  • systemd unit writes: tee overwrites — intended each run
  • nginx install: gated 'if ! have nginx'
  • nginx site write: tee overwrites — intended (handles HTTP→HTTPS flip)
  • sites-enabled ln -sf: -f forces, idempotent
  • certbot install: gated 'if ! have certbot'
  • ensure_broker_keypairs: per-keypair 'if sudo test -f' guard
  • daemon-reload, enable, restart: idempotent

Verification:
  bash -n scripts/setup-broker-host.sh   # syntax ok
  grep -F locked 17 critical strings     # all present
…ps auth-email-link

Root cause of the broker host's repeated 'BOOT_FAIL: BROKER_AUTH_METHODS=
"email_link": unknown or feature-gated-out auth method' even after a
fresh target/ rebuild: the script used a SINGLE cargo invocation to
build BOTH agentkeys-mock-server AND agentkeys-broker-server with
'--features agentkeys-broker-server/auth-email-link', and cargo
silently DROPS the feature flag in this multi-package selection mode.

Reproduced empirically with --message-format json:
  cargo build --release -p agentkeys-mock-server -p agentkeys-broker-server \
    --features agentkeys-broker-server/auth-email-link
  → broker compiled features: [audit-sqlite, auth-wallet-sig, default,
    wallet-keystore]   ← NO auth-email-link

vs the working separate form:
  cargo build --release -p agentkeys-broker-server --features auth-email-link
  → broker compiled features: [audit-sqlite, auth-email-link,
    auth-wallet-sig, default, wallet-keystore]   ← present

Fix:
1. Split the build into two separate cargo invocations — mock-server
   alone (default features), broker-server alone with the feature flag.
   Documented the footgun in a long block comment so the next person
   who 'optimizes' by re-merging them will read why before doing it.

2. Added a post-build sanity check: 'strings target/release/agentkeys-
   broker-server | grep /v1/auth/email/(request|verify)' must match
   before install + restart. If the cargo footgun ever resurfaces (or
   anyone introduces a similar feature-strip bug), the script dies HERE
   with a clear diagnostic instead of after install + systemd restart
   loop + journal dump.

Verified locally:
  bash -n scripts/setup-broker-host.sh             # syntax ok
  strings target/release/agentkeys-broker-server | grep /v1/auth/email
  → /v1/auth/email/request /v1/auth/email/verify /v1/auth/email/status
    /v1/auth/email/landing  (all four routes present)
…o clean -p

The previous fix (commit 6d75599) split the cargo build into separate
invocations to defeat the multi-package + --features footgun, but the
broker host STILL deployed binaries lacking auth-email-link. Two real
root causes survived:

1. CARGO INCREMENTAL CACHE: 'rm -f target/release/agentkeys-broker-server'
   only removed the output binary, not target/release/deps/.fingerprint/
   nor the per-feature-set cached .rlib deps. On a host that previously
   built without auth-email-link, cargo's incremental could relink from
   stale deps and produce a binary missing the feature even when the
   build call was correct. Fix: 'cargo clean -p agentkeys-broker-server
   --release' before the rebuild — only ~1s, only this crate's cache.

2. WEAK VERIFICATION: 'strings | grep -qE "/v1/auth/email/request"'
   is a heuristic that:
     - false-positives on tower middleware names containing 'email'
     - false-negatives when LTO dedupes string literals across the binary
     - dies with an unactionable 'this is the cargo footgun' guess that
       was wrong (the call was correct; the host environment was the bug)
   Replace with: parse cargo's own --message-format=json output and
   ASSERT auth-email-link is in the bin artifact's features list.
   Cargo's reported features ARE the truth — no heuristic.

Critical bash detail: cargo --message-format=json sends NDJSON to stdout
and compiler messages to stderr. Merging them with '2>&1' corrupts the
NDJSON and jq dies with 'Invalid numeric literal at line N column M'.
The script now redirects them to separate temp files
(BUILD_JSON / BUILD_ERR) and only mixes them in the diagnostic 'tail
-30' on failure.

The strings check is kept as belt-and-suspenders (catches the 'cargo
claims success but binary on disk is stale' edge case). Switched to
'grep -aFq' per codex review: -a forces text mode (some Linux strings
implementations differ on binary detection), -F treats the route as a
fixed string (no regex interpretation of '/').

If cargo reports auth-email-link is NOT enabled despite --features
auth-email-link, the new die message lists 5 specific things to check
($HOME/.cargo/config.toml, workspace .cargo/config.toml, env vars,
'which cargo', Cargo.lock drift) instead of guessing.

Verified locally:
  - cargo clean -p removes 17 files / 61.8MiB (only broker artifacts)
  - cargo --message-format=json reports features=[audit-sqlite,
    auth-email-link, auth-wallet-sig, default, wallet-keystore]
  - assertion passes; strings check passes
…re paths

Per CLAUDE.md runbook-fix-fold-back: now that scripts/setup-broker-host.sh
catches the cargo-feature-not-enabled case at build-time (commit c235373's
--message-format=json assertion), the operator-facing troubleshooting
needs two distinct entries:

1. Build-time die ('cargo did NOT enable auth-email-link'): host has a
   .cargo/config.toml or env-var override; script lists 5 things to
   check before the operator should file an issue.
2. Boot-time BOOT_FAIL: now historical (defended by both cargo clean -p
   AND the JSON assertion); kept as a fallback diagnostic for the case
   where the broker was started outside the script.

If the boot-time BOOT_FAIL ever recurs on a fresh re-deploy, the doc
now points the operator at 'bash -x' tracing instead of the previous
generic 'rm -f && re-run' fix that no longer applies.
…nm to warn

Reported failure: on Ubuntu with rustc 1.95.0, the script dies with
'binary on disk does not match cargo's reported feature set' even
though cargo --message-format=json correctly reports auth-email-link
is enabled. The 'strings | grep' belt-and-suspenders check is a false
negative on this combination — likely rustc 1.95 MIR opts or Ubuntu
binutils' strings defaults differ from macOS, splitting/stripping the
route literal in ways grep doesn't see.

Cargo's JSON output IS the canonical truth. If cargo says the feature
is enabled, it IS enabled — the post-build sanity check should not
override that with a heuristic.

Three changes:

1. Drop the 'strings die' entirely — it produced wrong-failure on a
   correctly-built binary, blocking the deploy AFTER cargo had already
   confirmed success.

2. Replace with a 'nm' symbol-table check (more reliable than strings;
   symbols are link-time evidence the function is compiled in). But
   keep it WARN-only: if nm doesn't see the symbols on this rustc
   version either, that's a diagnostic signal, not a stop signal.

3. probe_or_die post-restart is the canonical runtime gate. If the
   binary really lacks the feature, the broker BOOT_FAILs with
   'unknown auth method' and probe_or_die catches it within 20s with
   the journal output. So we lose nothing by trusting cargo here.

Tested locally:
  - nm sees 5+ email-link symbols on macOS
  - cargo JSON assertion still fires on bad builds
  - probe_or_die remains the runtime safety net

The user can now re-pull + re-run setup-broker-host.sh and the build
phase will succeed (because cargo's truth is trusted). If the binary
is actually broken, probe_or_die catches it post-restart with full
journal output.
…n needed

User feedback: 'cargo clean -p' on every re-deploy adds 3-5min full
rebuild — too slow for the common case where the cache is fine.

New behavior:

  Default (no flag):  incremental build, no clean. Assert via cargo's
                      JSON output that auth-email-link is enabled. If
                      the assertion misses, SELF-HEAL by running
                      'cargo clean -p' + rebuild ONCE. Failing the
                      retry is a real environment bug (host config
                      override, env var pin) and dies with diagnostics.
                      → Fast path: ~10-30s on warm cache.

  --clean             Force 'cargo clean -p' upfront before the build.
                      Use after a feature flag flip when you KNOW
                      cargo's cache will mislead. → 3-5min full rebuild.

  --no-clean          Never clean; trust incremental cache. Disable
                      self-heal too — die immediately on assertion miss.
                      Use in CI / unattended re-deploys where you want
                      hermetic, fast, fail-loud behavior.

Also: the assertion now treats 'cargo emitted no compiler-artifact'
(incremental cache hit, nothing to rebuild) as a PASS rather than a
fail. Without the artifact line cargo is saying 'binary on disk is
unchanged from last build' — that's fine, because last build was
either also under this script's control (with the assertion) or the
assertion will trigger the rebuild path.

Refactored into two helpers (build_broker_with_features +
assert_feature_enabled) to make the auto/--clean/--no-clean dispatch
readable.

Verified locally:
  - default mode + warm cache: artifact emitted, features reported,
    assertion passes (~instant)
  - --clean: clean + rebuild + assertion passes
  - --no-clean: assertion-only, no retry on miss
…n disk

Edge case: if a previous build completed successfully, then someone
manually 'rm target/release/agentkeys-broker-server' (e.g. trying to
force a rebuild), cargo's incremental cache says 'nothing changed'
and emits no compiler-artifact line. The previous logic treated that
as a pass and proceeded to install — which then failed with
'install: cannot stat /path: No such file or directory' instead of
something actionable.

Add a one-liner: when ENABLED_FEATURES is empty (no artifact line),
check that the binary actually exists at the expected path. If not,
return 1 so the self-heal path kicks in (cargo clean -p + rebuild).

Cheap (-x test, ~ms) and shores up the only remaining hole in the
incremental-build trust model.
… SES v2

Pass-2 broker (auth-email-link) hits AccessDeniedException at runtime
because the broker calls 'ses SendEmail' (SES v2 API) with its OWN
instance-profile credentials, but cloud-setup.md only granted SES
permission to the per-user-assumed agentkeys-data-role.

Two layered fixes:

1. cloud-setup.md §3.4 (agentkeys-broker-host instance profile): add
   a second put-role-policy call attaching 'BrokerSendEmail' with
   ses:SendEmail on both the domain identity and any per-address
   identity at that domain. The runbook had only sts:AssumeRole on
   this role, which was sufficient pre-Pass-2 but not anymore.

2. stage7-demo-and-verification.md §0.4 prereqs: add a troubleshooting
   block for the exact error string the operator sees:
     'broker rejected /v1/auth/email/request: status=502
      body={"error":"backend_unreachable",
      "message":"... ses SendEmail: unhandled error
       (AccessDeniedException)"}'
   with the one-shot fix command + explanation of WHY ses:SendEmail
   (not ses:SendRawEmail — different IAM action for sesv1 vs sesv2).

The IAM update propagates ~instantly; no broker restart needed (sesv2
picks up creds per-call).

Per CLAUDE.md runbook-fix-fold-back: every operator-encountered
failure makes the runbook strictly more robust before we move on.
… hardcoded name

Applied ses:SendEmail to the broker's actual runtime role
(S3-full-access — discovered via 'aws ec2 describe-instances' on
the live broker host). The existing docs assumed the canonical role
name 'agentkeys-broker-host' from §3.4 fresh setup, but legacy
deploys (this one included) use an ad-hoc legacy name from initial
provisioning that predates the broker.

Two doc changes:

1. cloud-setup.md — moved the SES grant out of §3.4 (where it was
   wrong: §3.4 is a clean-slate role-creation block, and operators
   running through it would get the grant for the wrong reasons).
   Added new §3.4a 'ses:SendEmail grant on the broker's runtime role
   (Pass 2 prereq)' with explicit two-step flow:
     Step 1: discover the actual role attached via the broker's EC2 IP
       ROLE=$(aws ec2 describe-instances --filters Name=ip-address,...)
       ROLE=$(aws iam get-instance-profile --instance-profile-name "$ROLE" ...)
     Step 2: aws iam put-role-policy --role-name "$ROLE" --policy-name BrokerSendEmail
   Both steps reference $ROLE (variable, set by discovery), NOT a
   hardcoded role name. Includes the verify command operators should
   run after.

2. stage7-demo-and-verification.md §0.4 troubleshooting block —
   updated to use the discovery-then-grant pattern instead of
   hardcoding 'agentkeys-broker-host'. Cross-links to §3.4a for the
   full flow.

Verified end-to-end: ran the discovery + grant against the live
broker host (i-0c0b739bd35643fd3 / S3-full-access role, elastic IP
54.164.117.252). The inline policy 'BrokerSendEmail' now grants
ses:SendEmail on:
  - arn:aws:ses:us-east-1:429071895007:identity/bots.litentry.org
  - arn:aws:ses:us-east-1:429071895007:identity/*@bots.litentry.org

No broker restart needed — sesv2 picks up the grant per-call.
Two related fixes addressing the user-encountered blocker (CLI polls
forever because alice@demo.example is RFC 2606 example domain — no
inbox to click from):

1. NEW scripts/agentkeys-init-email-demo.sh — fully automated demo:
   • Picks demo-1@bots.litentry.org or demo-2@... by parity of unix
     epoch seconds (so consecutive runs don't collide on the broker's
     single-use token TTL).
   • Snapshots existing inbound/ keys BEFORE SendEmail so we only
     inspect arrivals NEW to this run (vs scanning 400+ stale objects).
   • Spawns 'agentkeys init --email' in background; polls S3 for the
     magic-link email; QP-decodes the body to extract
     '$OIDC_ISSUER/auth/email/landing#t=<token>'.
   • Lifts the token out of the URL fragment and POSTs
     {token: <t>} to /v1/auth/email/verify — replicating exactly
     what the browser-side JS in /auth/email/landing does (curling
     the landing URL alone wouldn't work; fragments don't ride in
     HTTP requests).
   • Cleans up the consumed S3 object on success.
   • Waits for agentkeys init to complete; dumps log + dies on
     timeout. Includes preflight that rejects wrong AWS profile
     (agentkey-broker user lacks ListBucket).

2. cloud-setup.md §3.4a:
   • Step 2: grant now includes BOTH ses:SendEmail (per-request) AND
     ses:GetEmailIdentity (verify_sender_ready startup probe).
     Previously the broker BOOT_FAILED on GetEmailIdentity for any
     fresh deploy with this section's recommended grant.
   • NEW Step 3 'security audit': explicit warning + commands to
     detach AmazonS3FullAccess and similar over-broad managed
     policies. The broker process at runtime ONLY uses aws-sdk-sts +
     aws-sdk-sesv2; per-user S3 access is via JWT-assumed
     agentkeys-data-role, NEVER via the broker's runtime role. A
     compromised broker with S3FullAccess could read every magic
     link in the inbound bucket.

3. stage7-demo-and-verification.md §0.4: replaced
   'agentkeys init --email alice@demo.example' (undeliverable) with
   the new auto-click helper as the RECOMMENDED path; kept manual
   alternative for operators with a real inbox they control. Explicit
   warning to not use example.com / demo.example.

Live broker IAM (i-0c0b739bd35643fd3, role 'S3-full-access'):
  • Inline 'BrokerSendEmail': ses:SendEmail + ses:GetEmailIdentity
    on identity/bots.litentry.org + identity/*@bots.litentry.org
  • Detached: AmazonS3FullAccess (was: full read/write on all account
    buckets, including the verification-token bucket)
  • Final state: 1 inline policy, 0 attached policies, all least-
    privilege.

The script's auto-click flow is also a useful regression-test loop —
the user wanted '1 or 2 emails for test' so we can drive a full
auth round-trip without a human in the loop.
The polling loop waited the full 2-min budget for an email that would
never arrive if 'agentkeys init' had already exited (broker rejection,
signer unauthorized, etc.). Add a 'kill -0 $init_pid' check at the
top of each iteration: if init is gone, dump its log and die. Cuts
the failure-mode latency from 2 min to ~5s and surfaces the actual
error from init's stdout/stderr.
User hit: 'REGION env var required (source operator-workstation.env)'
even after sourcing the env file. Root cause: they ran the script
with sudo, which (per most distros' default sudoers) strips env to
PATH/USER/HOME/TERM/MAIL only — REGION/MAIL_DOMAIN/MAIL_BUCKET/
OIDC_ISSUER/BACKEND_URL all vanish in the child process and the
script dies on the first ${VAR:?...} guard.

The script doesn't need root: AWS calls use the operator's profile
(in shell env), and 'agentkeys init' writes the session JWT to the
USER's OS keychain. Running under sudo would actually break things
even if env was preserved (keychain lookup would target root's
keychain, not the operator's).

Two changes:

1. scripts/agentkeys-init-email-demo.sh: detect SUDO_USER at start
   and die loud with the exact re-run command, before the cryptic
   env-var guard fires.

2. docs/stage7-demo-and-verification.md \xc2\xa70.4: explicit
   'Do NOT prefix sudo' note next to the recommended invocation,
   explaining why (env stripping + wrong keychain).
…-true)

Bug: 'aws --output text' returns keys TAB-separated. The previous
substring check 'case " $pre_keys " in *" $k "*' looked for
SPACE-surrounded matches, so every key in current_keys missed and
every poll attempt reported all 415+ pre-existing keys as 'new'.
Functionally correct (the per-key body grep still narrows down to
the magic-link email) but ~415 needless 'aws s3 cp' calls per
attempt — slow.

Fix: build a bash associative array (pre_set[$k]=1) at snapshot
time. O(1) membership check per key in the polling loop. Switch
new_keys from a space-separated string to a proper array so it
works regardless of key contents.

Verified locally: bash -n syntax ok; empty-array iteration safe
under 'set -euo pipefail' (declare -a + "${new_keys[@]}").
User error: 'declare: -A: invalid option'. macOS ships /bin/bash 3.2
forever (Apple GPLv3 freeze) and the script's shebang resolves there.
'declare -A' (associative arrays) requires bash 4+.

Replace the associative-array set with a string-based set:
  PRE_KEYS_SET=' $pre_keys_text '   # leading + trailing spaces
  case "$PRE_KEYS_SET" in *" $k "*) continue ;; esac

Bash-3.2 compatible. SES-generated S3 keys are alphanumeric (no
spaces), so the space delimiter is exact-match safe. 'tr \t \ '
normalizes the tab-separated 'aws --output text' output upfront.

Verified locally under /bin/bash 3.2.57:
  - syntax check passes
  - isolated dry-run: 5 pre-existing keys, 1 new arrival → set
    difference correctly returns just the new key

Indexed arrays + array+= and "${arr[@]}" iteration are bash 3.1+,
so the rest of the script (new_keys array) still works.
…r ran on success

Bash precedence: '|' binds tighter than '||'. So
  cmd1 2>/dev/null || true | tr '\t' ' '
parses as
  cmd1 2>/dev/null || (true | tr '\t' ' ')
meaning tr ONLY runs if aws fails. On success (the common path)
pre_keys_text remained tab-separated, the case-pattern
'*" $k "*' looked for space-surrounded matches, every key
missed, every poll attempt reported all 417 keys as 'new'.

The earlier '/bin/bash isolated dry-run' didn't reproduce because
it used a different invocation form (printf piped to tr) that
wasn't subject to this precedence trap.

Fix: group with braces so the pipe gets the output of either
branch:

Verified live against the actual 417-object inbound bucket under
/bin/bash 3.2.57:
  - pre_keys_text now space-separated (no tabs detected)
  - same-list comparison correctly returns 0 new keys
Magic-link demo (scripts/agentkeys-init-email-demo.sh) was failing
after the broker accepted the click ({"ok":true}) but before
returning the derived wallet. The error was 'signer error:
unauthorized: missing Authorization: Bearer <jwt> header'.

Root cause: in crates/agentkeys-core/src/init_flow.rs, two HTTP signer
calls used HttpSignerClient::new() WITHOUT chaining .with_session_jwt():

  - derive_via_signer  (line 261): creates client without JWT, /dev/derive-address fails 401
  - siwe_round_trip    (line 314): creates client without JWT, /dev/sign-message fails 401

The standalone agentkeys signer derive / signer sign CLI commands DO
chain .with_session_jwt(session.token) from the keychain (lib.rs:1169),
but the in-flow init_via_email_link path also has the identity-session
JWT in hand (just minted by the broker after the magic-link click), so
it just needs to be threaded through. Fixed both call sites + added
#[allow(clippy::too_many_arguments)] on finish_init (which was already
at 8 args — pre-existing clippy warning that surfaced after the audit).

Doc fold-back: stage7-demo-and-verification.md §3 'Mint OIDC JWT for
STS' previously assumed $SESSION_JWT_A was already populated, but the
§2.0 path ('agentkeys init --email') leaves the JWT in the keychain
or file fallback with no CLI extraction wrapper. Added explicit
instructions for both \§2.0 (file fallback / macOS Keychain) and
\§2.1-2.4 (manual SIWE response capture) paths.

Self-check (all 5 steps green against live broker.litentry.org):
  1. agentkeys signer derive  → 0x885904faf3d5624a30b0427078015d0072f604ea
  2. agentkeys signer sign    → 132-char sig
  3. broker /healthz          → 200
  4. /v1/mint-oidc-jwt        → 692-char OIDC JWT with correct
                                aws.amazon.com/tags claims
  5. AssumeRoleWithWebIdentity → assumed-role/agentkeys-data-role/...

Stage 7 demo flow validated end-to-end through §4.1 (STS exchange).
§4.2-4.3 (S3 isolation probe) requires writing to the production
bucket and is left to explicit operator authorization.
…tion

The broker's SES outbound mails are pure-ASCII so the parts are
7bit-encoded — the magic-link URL appears in the body with a LITERAL
'=' between 't' and the base64url token:
  https://broker.litentry.org/auth/email/landing#t=Kwm1lO8z...

The previous regex looked only for 't=3D' (QP-encoded form). It never
matched on production emails, so the script timed out polling even
though the email had arrived in S3.

Fix: alternation '#t=(3D)?[A-Za-z0-9_-]+' matches both forms, then
'sed s/#t=3D/#t=/' normalizes to literal-'='. Verified by extracting
against an actual stored email — token came out clean and POSTs to
/v1/auth/email/verify succeed with {"ok":true}.
…rofile defaults to us-west-2

The agentkeys-admin local profile defaults to us-west-2 (verified via
`aws --profile agentkeys-admin configure get region`), while every
broker-side resource (EC2, S3 mail bucket, SES identity) lives in
us-east-1. Without an explicit `--region "$REGION"` on every regional
AWS CLI call, the agentkeys-admin profile silently searches the wrong
region — describe-instances returns empty (no error, exit 0), and the
downstream `iam put-role-policy --role-name ""` silently no-ops.

Real symptom (this session): operator ran the §0.4 ROLE discovery snippet
under awsp agentkeys-admin → ROLE came back empty → SES grant never
landed. Diagnosis took two rounds because there's no stderr signal.

Changes:
- CLAUDE.md: new "AWS local-profile ↔ remote-IAM mapping" section
  documenting (a) the three-profile table, (b) the per-profile region
  divergence trap (agentkeys-admin=us-west-2, others=us-east-1), and
  (c) case-insensitive caller-arn matching since the remote IAM user
  is agentKeys-admin (capital K) vs local agentkeys-admin (lowercase).
- docs/stage7-demo-and-verification.md §0.4: ROLE discovery now passes
  --region "$REGION" + fail-loud guard on empty INSTANCE_PROFILE_ARN.
  Plus 5x s3api lines (§4.2 + §16) gain --region.
- docs/cloud-setup.md §3.4a: ROLE discovery rewritten with --region +
  fail-loud guard. Plus 5x s3api lines (bucket-policy + lifecycle +
  delete-bucket + access-block) gain --region.
- scripts/inspect-inbound-email.sh: require REGION up-front (loud-fail
  guard); pass --region "$REGION" on all 4 aws calls.
- scripts/ses-verify-sender.sh: case-insensitive caller-arn match
  (`tr [:upper:] [:lower:]` — portable to /bin/bash 3.2) so
  agentKeys-admin (capital K) no longer triggers the bogus "caller is
  not agentkeys-admin" warning.

Verified end-to-end under AWS_PROFILE=agentkeys-admin (profile region
us-west-2): ROLE discovery now returns S3-full-access correctly;
inspect-inbound-email.sh runs cleanly; ses-verify-sender.sh no longer
emits the spurious warning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eted in prod

Production broker EC2 (i-0c0b739bd35643fd3) was migrated 2026-05-12
from legacy `S3-full-access` instance profile to canonical
`agentkeys-broker-host`. Migration steps executed:

1. Created `agentkeys-broker-host` role + instance profile via
   `aws iam create-role` + `create-instance-profile` (matches
   cloud-setup.md §3.4 conventions).
2. Attached complete `BrokerSendEmail` inline policy on new role:
   `ses:SendEmail` AND `ses:GetEmailIdentity` (the latter folds in
   the perm gap that prevented `verify_sender_ready` from succeeding).
3. Atomically swapped EC2 instance profile via
   `aws ec2 replace-iam-instance-profile-association` (no creds gap).
4. Verified broker /healthz=200 + sent two test emails through the new
   role (HTTP 200, request_id eml-bf4e..., eml-2aff...).
5. Cleaned up legacy artifacts: removed role from old profile, deleted
   inline policy + role + instance profile, revoked the temporary
   `ec2:Describe/ReplaceIamInstanceProfileAssociations` grant on
   `agentKeys-admin` IAM user.

Doc updates:
- cloud-setup.md §3.4a: drops "may use ad-hoc S3-full-access from
  initial provisioning" framing — fully retired. Discovery snippet
  retained because it's robust against any future drift.
- stage7-demo-and-verification.md §0.4 troubleshooting block: same.
  Drops the `legacy/fresh` distinction that no longer applies.

Known follow-up (separate scope, spawned task):
`/readyz` still returns 503 with "SES verification cache absent at
/var/lib/agentkeys/.agentkeys/broker/ses-verify.json" — this is a
pre-existing bug independent of IAM. Production code never calls
`verify_sender_ready()` and never invokes `SesVerifyCache::save()`,
so the cache file is never populated. The IAM permission is now in
place (this commit's `agentkeys-broker-host` role has
`ses:GetEmailIdentity`), so once the boot path wires
`verify_sender_ready()` + `cache.save()` /readyz will turn green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The email-link plug-in's `Readiness::ready()` reads `SesVerifyCache`
from disk and reports `auth/email_link: SES verification cache absent`
when the file is missing. No production code path called
`verify_sender_ready()` or `SesVerifyCache::save()`, so /readyz was
permanently 503-degraded on this check even when SES was configured
correctly and email-link auth worked end-to-end.

Add a Tier-2 probe spawned alongside the existing backend probe:
calls `sender.verify_sender_ready()`, writes the cache on success,
flips `Tier2State::ses_verified`. Exponential backoff up to 5min on
failure (non-blocking; honors BROKER_REFUSE_TO_BOOT_STRICT). After a
success, re-verifies every 12h so the cache stays well under the
plug-in's 24h freshness TTL.
…tEmailIdentity grant + 722a990 verify-probe note

§0.4 troubleshooting block updated for the post-rename world:

- Lead with the canonical role: "Broker IAM role: `agentkeys-broker-host`"
  (was: "the role name varies by deployment ... legacy may use S3-full-access").
- Document the **complete** BrokerSendEmail policy: BOTH `ses:SendEmail`
  AND `ses:GetEmailIdentity`. Previously the grant snippet only granted
  SendEmail; the missing GetEmailIdentity perm was why /readyz reported
  `auth/email_link: SES verification cache absent` even when SES was
  working. Both actions now in the put-role-policy snippet AND in the
  copy-paste verify command (`aws iam get-role-policy ...`).
- Reframe AccessDeniedException troubleshooting: from "find the unknown
  role name" → "verify it's still agentkeys-broker-host (defensive
  against future drift)". The discovery snippet stays — robust against
  future instance-profile churn — but the verify expected output now
  references the canonical name explicitly.
- Add the restart-needed nuance for the verify probe: SendEmail picks
  up creds per-call (no restart needed), but the Tier-2 verify probe
  (commit 722a990) runs once at boot then every 12h, so adding
  GetEmailIdentity requires a broker restart for /readyz to reflect it.

Production verified: `aws iam get-role-policy ... BrokerSendEmail` returns
`[["ses:SendEmail","ses:GetEmailIdentity"]]` exactly as the doc claims.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Keychain-free CLI

Two operator-blocking traps surfaced while walking §0.4 against the
live broker; both fixed end-to-end.

Trap 1: signer rejects derive with "JWT omni_account claim does not
match request body". §0.4 used to call `signer derive --omni-account
$OMNI_A` where `$OMNI_A = sha("agentkeys","email","alice@demo.example")`
from §0.3 — but the session JWT minted by `agentkeys-init-email-demo.sh`
is for `demo-1@bots.litentry.org` (or demo-2 on rotation). After issue
#74 step 1b's strict JWT-omni check, the signer requires
`JWT.omni_account == request.omni_account` exactly. The arbitrary
alice/bob omni never matches.

Fix:
- §0.3 reframed as "math reference only" — the helper recomputes the
  broker's omni formula so the operator can verify the algorithm,
  but the actual `OMNI_A` / `OMNI_B` come from the live session JWTs
  in §0.4 below.
- §0.4 adds a `decode_jwt_payload()` helper that pulls
  `agentkeys.omni_account` and `agentkeys.wallet_address` directly
  from `~/.agentkeys/master/session.json` (no signature verify — just
  base64-decoding the body for our local read).
- For the §4 isolation proof we now run `init-email-demo.sh` TWICE
  (the script's epoch-parity rotation between demo-1 and demo-2 gives
  two distinct sessions automatically; consecutive runs naturally
  yield two distinct (omni, wallet) pairs).
- Drops the wrong `ADDR_A == JWT.wallet_address` assertion. The
  signer derive returns the EVM-omni's wallet (post-SIWE-promoted
  identity), which is a *different* keypair from the email-omni's
  wallet stored in `JWT.agentkeys.wallet_address`. Both are real,
  both are derived by the same signer; they play different roles
  in the demo (the JWT's wallet_address was the SIWE signing key
  that bootstrapped the session; ADDR_A is the EVM-identity wallet
  used downstream for S3 path scoping).

Trap 2: even with matching omni, `agentkeys signer derive` returned
`SIGNER_UNAUTHORIZED: invalid session JWT: InvalidToken` while a raw
`curl` with the same JWT succeeded. Root cause: the CLI defaults to
`KeyringMode::Auto` (crates/agentkeys-core/src/session_store.rs:86) —
Keychain first, file fallback. A stale Keychain entry from earlier
dev runs gets picked up and fed to the signer, which rejects the
signature. The user-visible symptom is also keychain access prompts
on every CLI call.

Fix:
- `scripts/operator-workstation.env` exports `AGENTKEYS_SESSION_STORE=file`,
  which forces `KeyringMode::FileOnly`. The demo is now Keychain-free
  end-to-end. Comment explains the trade-off (fresh-machine users can
  comment the line out to re-enable Keychain).
- §0.4 callout block documents the trap + the raw-curl fallback so an
  operator can self-diagnose "is it the JWT or the CLI?" in one step.

End-to-end verified under AWS_PROFILE=agentkeys-admin with the new
env: OMNI_A extracted from session.json's JWT decodes to
`402d4bac…`; `agentkeys --json signer derive --omni-account $OMNI_A`
returns `0xcd936bf34d3156e84cd2e479e267cf39d15a85a6` (HTTP 200, no
Keychain prompts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…4 key-topology rewrite

User pain (during §0.4 walk-through):
  1. Each `init-email-demo.sh` run overwrites ~/.agentkeys/master/session.json,
     so back-to-back inits for the §4 two-actor isolation proof can't coexist.
  2. §0.4 forced operators to hand-decode the JWT in 6 lines of awk+base64 just
     to learn OMNI / ADDR — once per session, twice per demo, no rich output.
  3. The OMNI_B / ADDR_B / identity-omni / derived-wallet / evm-omni terminology
     was opaque: §0.4 didn't reconcile its own vars with `architecture.md` §3+§4
     (K3/K4, identity omni vs actor omni), so the operator couldn't tell which
     wallet AWS actually sees at the PrincipalTag step in §4.

Changes:
  - crates/agentkeys-cli: top-level `--session-id` flag (env AGENTKEYS_SESSION_ID),
    plumbed through CommandContext to session_store. Defaults to "master" so
    existing behavior is preserved. `with_session_id` ignores empty strings to
    keep a forgotten `AGENTKEYS_SESSION_ID=` shell-export from silently writing
    to ~/.agentkeys//session.json.
  - scripts/agentkeys-init-email-demo.sh: accepts `--session-id <name>` flag
    and exports AGENTKEYS_SESSION_ID so the background `agentkeys init` writes
    under ~/.agentkeys/<name>/. Two back-to-back runs with distinct ids leave
    both sessions live for the §4 proof — no need to re-init to switch. Auto-
    invokes scripts/agentkeys-demo-show.sh at the end so the operator sees the
    (omni, wallet) pair without a follow-up command.
  - scripts/agentkeys-demo-show.sh (new): one-shot rich-output inspector. Reads
    ~/.agentkeys/<id>/session.json, decodes the JWT body, prints
      • identity (type, value, locally-recomputed identity_omni)
      • actor   (actor_omni, master_wallet)
      • signer-wire smoke test (HKDF(K3, actor_omni) — a SECOND wallet,
                                  flagged NOT-used-for-AWS in the output)
      • JWT TTL remaining
    Supports --json, --no-derive, and positional session-id. Bash-3.2 portable
    (no `${var,,}`, no `mapfile`, jq+awk+base64 only).
  - docs/stage7-demo-and-verification.md §0.3: corrected the "both omnis end up
    in the JWT's `agentkeys` claim" line — the FINAL JWT carries only the EVM
    actor omni (the identity-omni is transient and consumed at SIWE-verify).
    Cross-linked the truth to crates/agentkeys-broker-server/src/handlers/auth/
    wallet_verify.rs:51.
  - docs/stage7-demo-and-verification.md §0.4: new "Key topology" subsection
    that names the three wallets the demo conflates today —
      identity_omni  → SHA256("agentkeys"||"email"||email), transient, NOT in JWT
      MASTER_WALLET  → HKDF(K3, identity_omni_email), the SIWE-linked wallet, JWT.wallet_address
      ADDR (= W2)    → HKDF(K3, actor_omni), what §2's SIWE round-trip uses and
                       what §4's S3 isolation actually tags via §2.3's fresh JWT
    Both wallets are real, signable, and deterministic; §2.2's `signer sign`
    only works for ADDR because the strict JWT-omni check forces the signed
    omni to match the JWT's actor_omni. Updated the §0.4 capture block to use
    the new demo-show.sh JSON output for both OMNI_A/ADDR_A and an explicit
    MASTER_WALLET_A side-channel for cross-reference. Cross-linked
    crates/agentkeys-broker-server/src/handlers/oidc.rs:106 (the line that
    decides which wallet AWS sees).

End-to-end verified locally:
  bash scripts/agentkeys-demo-show.sh --no-derive master           → rich text
  bash scripts/agentkeys-demo-show.sh --no-derive --json master    → JSON shape
  bash scripts/agentkeys-demo-show.sh --no-derive nonexistent      → loud fail
  cargo run -p agentkeys-cli -- --help | grep session-id           → exposed
  AGENTKEYS_SESSION_ID=alice cargo run -p agentkeys-cli -- --help  → env wired
  cargo test -p agentkeys-cli --lib                                → green
… prereqs

User hit a silent-failure trap walking §0.4 today: ran
`bash scripts/agentkeys-init-email-demo.sh --session-id alice`, the
script reported success ("Initialized via email-link..."), but the
session landed at ~/.agentkeys/master/session.json instead of
~/.agentkeys/alice/session.json — and demo-show.sh then failed with
"no session file at ~/.agentkeys/alice/session.json".

Root cause: the `agentkeys` binary on $PATH was built before today
(2026-05-12). The `--session-id` flag (and its env=AGENTKEYS_SESSION_ID
binding) is a clap declaration in the binary — an older binary silently
ignores the env var, falls back to the hardcoded "master" default, and
writes to ~/.agentkeys/master/.

Diagnose-before-edit verified by:
  command -v agentkeys → /Users/<you>/.local/bin/agentkeys (May 11 21:01)
  agentkeys --help | grep session-id → empty (no flag)
  ls -la ~/.agentkeys/master/session.json → freshly written
  ls ~/.agentkeys/alice/ → no such directory

Fix lands in THREE places (per runbook-fix-fold-back):

  1. scripts/agentkeys-init-email-demo.sh — preflight that `agentkeys
     --help` exposes `--session-id`. Dies loud with the exact rebuild
     command (`cargo install --path crates/agentkeys-cli --force`) and
     the verify-after command. Catches the trap BEFORE the script burns
     2 minutes polling for an email + writing to the wrong session-id.

  2. scripts/agentkeys-demo-show.sh — same capability check inside the
     signer-derive branch. Without it, a stale binary feeding the
     wrong --session-id to `signer derive` would silently re-derive
     against the master session's omni, masking the real diagnosis.

  3. docs/stage7-demo-and-verification.md §0 prereqs — step 6 after the
     existing `agentkeys --version` check that re-runs the same grep
     and dies if absent. Folds the diagnosis inline so the next
     operator catches the stale binary at the moment they're already
     looking at install output — no need to discover the trap by
     watching init-email-demo.sh "succeed" first.

Verified locally:
  REGION=u MAIL_DOMAIN=t MAIL_BUCKET=t OIDC_ISSUER=https://t BACKEND_URL=https://t \
    bash scripts/agentkeys-init-email-demo.sh --session-id alice
  → "stale 'agentkeys' binary at /Users/agent-jojo/.local/bin/agentkeys
     — missing --session-id flag. Rebuild + reinstall from this worktree:
     cargo install --path crates/agentkeys-cli --force"
  → exit 1 (no S3 polling, no SES SendEmail)
… --export mode

Three operator-blockers landed today walking §0.4:

  1. `--session-id alice` and `--session-id bob` produced the SAME wallet
     because the legacy default recipient rotated demo-1/demo-2 by epoch
     parity — two back-to-back runs hit the same parity, got the same
     recipient, derived the same identity_omni (HKDF deterministic),
     thus the same MASTER_WALLET. The §4 isolation proof becomes
     vacuous (same actor → same prefix → trivially "allowed both
     reads"; demo doesn't prove anything).

  2. The `init-email-demo.sh` log + demo-show.sh output named the
     identity_omni hex but did NOT show the SHA256 inputs (type, value),
     so the operator couldn't reproduce the math by hand or diagnose
     why two different sessions collided.

  3. §0.4 had three `jq -r` extractions per session to pull OMNI / ADDR
     / MASTER_WALLET out of `--json` — 6 lines for two sessions, with
     the field paths hand-typed and easy to mis-name. The doc + the
     show script weren't a single source of truth.

Fixes:

  - scripts/agentkeys-init-email-demo.sh — new recipient precedence:
    $RECIPIENT > positional arg > $SESSION_ID-derived (when not "master")
    > legacy demo-1/demo-2 rotation. With `--session-id alice` the
    recipient is now alice@$MAIL_DOMAIN deterministically, NOT a
    rotating demo-N. The log now prints the computed identity_omni and
    the SHA256 formula inline so collisions are visible BEFORE SES
    SendEmail fires.

  - scripts/agentkeys-demo-show.sh — new `--export <prefix>` mode emits
    eval-able shell assignments:
        SESSION_ID_<P>=…   OMNI_<P>=…   ADDR_<P>=…   MASTER_WALLET_<P>=…
        IDENTITY_TYPE_<P>=…   IDENTITY_VALUE_<P>=…   IDENTITY_OMNI_<P>=…
    so the doc / an operator script can capture all seven fields with
    one `eval "$(bash scripts/agentkeys-demo-show.sh --export A alice)"`.
    Values are `printf %q`-escaped — survives eval with arbitrary
    content. The human-readable output now shows the full
    `= SHA256("agentkeys" || "<type>" || "<value>")` formula under the
    identity_omni line so the math is reproducible at a glance.

  - docs/stage7-demo-and-verification.md §0.4 — replaced the 12-line
    `--json | jq -r` extraction block with two `eval` calls + a new
    collision-diagnostic that explains exactly why MASTER_WALLET_A ==
    MASTER_WALLET_B can happen (same recipient → same identity_omni)
    and what the fix is.

Verified locally:
  eval "$(bash scripts/agentkeys-demo-show.sh --no-derive --export A master)"
  echo "$OMNI_A $IDENTITY_OMNI_A $MASTER_WALLET_A $IDENTITY_TYPE_A $IDENTITY_VALUE_A"
  → all seven vars populated, identity values match what shasum -a 256 computes

  bash scripts/agentkeys-init-email-demo.sh --session-id alice
  → Recipient: alice@bots.litentry.org (not demo-N)
  → identity_omni (email) = dbcb6acd... (visible BEFORE SendEmail)

Why this is the fix and not a workaround: HKDF(K3, omni) is the
contractual signer derive — same omni in, same wallet out is the WHOLE
point of the deterministic-derive design. The bug was the demo's
recipient rotation, NOT the signer. Two operators with literally the
same email address WILL get the same wallet, by design. The fix
guarantees each --session-id maps to a distinct recipient so the §4
proof actually exercises two distinct actors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants