agentkeys: stage 7+ — issue #74 step 1 (dev_key_service signer + bootstrap chain)#75
Open
hanwencheng wants to merge 63 commits into
Open
agentkeys: stage 7+ — issue #74 step 1 (dev_key_service signer + bootstrap chain)#75hanwencheng wants to merge 63 commits into
hanwencheng wants to merge 63 commits into
Conversation
…strap chain)
Plan steps 0-9 of docs/spec/plans/issue-74-dev-key-service-plan.md
landed in this PR:
- 0: docs/spec/signer-protocol.md — v0 wire contract (request/response,
error envelope, versioned HKDF derivation byte, future TEE attestation
handshake).
- 1: agentkeys-mock-server::dev_key_service — HKDF + secp256k1 + EIP-191,
loaded from DEV_KEY_SERVICE_MASTER_SECRET; 10 unit tests.
- 2-3: /dev/derive-address + /dev/sign-message handlers + state +
routes; 503 signer_disabled when env unset; 8 integration tests.
- 4: scripts/setup-broker-host.sh auto-generates the master secret
into /etc/agentkeys/dev-key-service.env (mode 0600), wires it via
EnvironmentFile= in the backend systemd unit. Idempotent — preserves
the secret across re-runs (rotation invalidates derived wallets).
scripts/broker.env documents the separation.
- 5: agentkeys-daemon main.rs adds --init-email / --init-oauth2-google /
--signer-url, drives the email/OAuth2 -> omni -> derive -> link ->
SIWE -> EVM-session chain on first start; emits a tracing audit row
on success.
- 6: agentkeys-cli cmd_init rewritten as InitMode::{Email, Oauth2Google,
ImportLegacyMock(test-only)}. --mock-token flag hard-cut from the
user-facing CLI surface. All 9 cli_tests.rs sites migrated.
- 7: agentkeys whoami CLI (read-only; surfaces signer-derived wallet).
- 8: TEE-stub conformance test — same wire contract, in-memory keypair
fixture vs HKDF backend; 3 tests prove the swap-point invariant.
- 9: docs/stage7-demo-and-verification.md rewritten end-to-end for the
new flow.
Shared plumbing in agentkeys-core: signer_client (typed RPC trait +
HttpSignerClient), init_flow (broker email/OAuth2 chain, used by both
CLI and daemon).
CLAUDE.md adds a plan-completion policy (always complete every numbered
plan step; mandatory done/not-done summary at PR end).
Pre-Stage-7 docs moved to docs/archived/ (operator-runbook,
contradictions, field-name-translation); inbound references repointed.
Verification: 386 tests pass workspace-wide, 0 failing; clippy clean
on new code.
What did not land in this PR:
- Plan step 10 (live broker-host redeploy + smoke walkthrough) — operator
step; the script that makes it work shipped here.
- End-to-end integration test of the email/OAuth2 flow against a live
broker — would need an in-memory mock email/OAuth2 provider; left as
follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…th) + step 1c plan + arch doc Lands the architectural follow-up to PR #75: PR #75 shipped the dev_key_service signer with no HTTP-layer auth (loopback assumption per signer-protocol.md §"What's intentionally out of scope at v0"). This commit: - DEPLOYS signer.litentry.org as an independent backend listener (issue #74 step 1b). agentkeys-mock-server gains a `--signer-only` mode that registers ONLY `/dev/derive-address`, `/dev/sign-message`, `/healthz` (no legacy session/ credential/audit endpoints). Bound to 127.0.0.1:8092; nginx fronts it at https://signer.<zone> with its own cert. Same binary, two roles — loopback :8090 stays as the broker's tier-2 reachability target. - ADDS JWT bearer verification to /dev/* handlers. The signer reads the broker's ES256 session pubkey at boot from a pinned file (/var/lib/agentkeys/.agentkeys/broker/session-keypair.pub.pem) written by the broker's new --export-session-pubkey-to flag. Every /dev/* request must carry Authorization: Bearer <jwt> with claims.agentkeys.omni_account matching body.omni_account; otherwise 401 unauthorized. No SIGNER_ACCESS_TOKEN. No HMAC. No device-key signing — those land in step 1c. - PLUMBS the JWT through the daemon-side stack: HttpSignerClient gains with_session_jwt(); CLI signer/whoami commands load the saved session and set the bearer; init_flow returns the EVM session JWT for the caller to persist. - AUTOMATES setup-broker-host.sh to provision the new agentkeys-signer.service systemd unit and the nginx server block for signer.<zone>. Idempotent — re-runs preserve the master secret + session pubkey + nginx config. PLAN DOCS: - docs/spec/plans/issue-74-step-1c-device-key-auth.md (NEW, 381 lines) Replaces broker-issued bearer JWT as the sole authenticator on /dev/* with a device-key signature scheme. Removes broker-as-SPOF risk for the signer call surface; identity-type-uniform across evm/email/oauth2/ passkey; UX-uniform (one ceremony at init, automatic per-request). Aligned with Heima's ClientAuth tier model (EvmSiweSigned + BackendSigned), strictly stronger because user-controlled per-request key + zero per-request user interaction. See gh issue #76. - docs/spec/architecture.md (REWRITTEN, 506 lines, replaces prior version) Canonical broker/signer/daemon/key-flow doc. Mermaid diagrams for component map, trust boundaries, identity model, init sequence, per-mint sequence, deployment topology. Full K1–K10 key inventory table designed for direct Figma reuse. Pluggable-surfaces matrix covering auth methods, signer backends, audit destinations, vault backends. stage7-wip.md absorbed into §1, §6, §7, §11; archived. - docs/spec/heima-gaps-vs-desired-architecture.md (REVISED) Added §1a status snapshot table covering all 12 gaps at-a-glance. §3 OIDC provider + §6 PrincipalTag JWT claim marked RESOLVED IN-TREE (post-PR #61 + #73). NEW §11 (signer-edge contract — PARTIAL after PR #75) and §12 (per-request crypto auth — PLANNED via #76). Resolution log under §10. - docs/stage7-demo-and-verification.md (UPDATED for the signer split) Drops the SSH tunnel scaffolding entirely. Single demo path uses the public signer hostname. Trust-model diagram + two-machine layout + §0.2 reach-the-signer + §14.3 troubleshooting + §16.4 live walkthrough + §16.7 auto-provision + §17 cleanup all updated. VERIFICATION: - 394 tests pass workspace-wide (was 386 in PR #75; +8 new JWT auth integration tests in dev_key_service_routes.rs). - 0 cargo clippy errors; 18 pre-existing warnings (was 16; +2 minor cosmetic in agent-generated test code). WHAT DID NOT LAND: - Live broker host redeploy + signer.<zone> certbot issuance — operator step. The script that makes it work shipped here. To land: ssh broker host → bash scripts/setup-broker-host.sh --yes → sudo certbot --nginx -d signer.<zone> → smoke per docs/stage7-demo- and-verification.md §16. - Device-key auth (issue #74 step 1c) — separate issue #76, plan doc shipped in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dentity-type processes, K9 explanation) Addresses /Users/agent-jojo/.claude/plans/review-questions.md Q3 (K9 DKIM explanation): expanded the K9 row in architecture.md key inventory with a high-level "what is DKIM, why does AgentKeys need it" paragraph (per-domain Ed25519 key, signs outbound mail headers, pubkey in DNS TXT, used by Stage 6 federated email so SES never sees plaintext). Q5 (cold-start sequence ordering): rewrote architecture.md §5 to show device key generated FIRST (step 0), BEFORE the identity ceremony. The ceremony then binds D_pub atomically. Same trust shape as a WebAuthn credential creation — by the time the broker mints session JWTs, the device-pubkey claim is authoritative. Q6 (per-identity-type processes): NEW architecture.md §5a covers init-binding for each identity type (email-link, oauth2_google, evm, passkey, sandbox link-code), device-switching when operator gets a new laptop, intentional device-key rotation with chain-of-custody sigs, sandbox VM device-key persistence, and a trust-shape comparison across identity types. Architecture.md is now the single source of truth; step-1c plan defers to it. Q7 (init binding security — proof of possession): updated step-1c plan §"email" to require a `pop_sig` over the request payload signed by D_priv. Broker rejects with 400 bad_pop on mismatch. Closes the "attacker substitutes pubkey at request time" attack: attacker would need to compromise BOTH the network path AND the user's email inbox (vs just the network today). Q8 (sandbox VM device-key persistence): resolved via architecture.md §5a.4. Stock agent-infra/sandbox falls back to keyring-rs file backend under ~/.agentkeys/daemon-<wallet>/session.json (mode 0600); survives daemon restarts inside long-lived containers; vanishes with ephemeral sandbox containers. For ephemeral sandboxes, operator runs `agentkeys-daemon --init-link-code <new-code>` per session — same pattern as today's pair-flow. Q1 (forward-references): - issue-74-dev-key-service-plan.md gains a "Status (post-PR #75) — successor steps" preamble pointing at step 1b + step 1c as the follow-on work. - stage7-demo-and-verification.md trust-model section gains a callout that step 1c will upgrade /dev/* auth from bearer-JWT to device-key per-request signature; the demo flow shape doesn't change. Q2 (cleanup + placement): filed as issue #77 (separate from this commit). Tracks (a) the legacy mock-server endpoint cleanup after #75 + #76, and (b) the open question of where identity/audit endpoints belong long-term — captures the user's broker-policy / signer-execution split proposal. Q4 (storage location — answered inline, no doc edit): omni ↔ identity linking is stored in the broker at crates/agentkeys-broker-server/src/storage/identity_links.rs (SQLite table `identity_links`, indexed on (identity_type, identity_value)). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cy, stale refs) Three structural cleanups across the 5 docs touched in commit 6d36a7b: 1. heima-gaps-vs-desired-architecture.md — section ordering fix. Previous numbering was 1, 1a, 2..9, 11, 12, 10 (Tracking out of order). Renumbered: §11 (NEW signer-edge contract) → §10 §12 (NEW per-request crypto auth) → §11 §10 (Tracking — was wedged between) → §12 Updated §1a status snapshot table accordingly. Updated 3 stale in-body §-refs: - §1a row 3: "architecture.md §11" → §7 (Pluggable surfaces) - §11 body "TEE swap-ready (gap §11)" → "(gap §10)" - §11 body "Blocks the TEE worker (gap §11)" → "(gap §10)" Updated tracking-section "PR #75 / issue #76 close §11 and queue §12" → "close §10 and queue §11"; resolution-log entries to match. 2. issue-74-step-1c-device-key-auth.md — PoP consistency across all identity types. Previously only the `email` flow had explicit proof-of-possession; `evm` and `oauth2_google` flows didn't. Same Q7 attack surface applies to all three, so: - `evm` flow: daemon now signs the SIWE binding payload with D_priv (in addition to the EVM key); broker verifies both signatures (proves "user owns EVM identity AND daemon controls device key"). - `oauth2_google` flow: daemon now signs the start request with D_priv; broker verifies before issuing any state value. Composes with the existing `state` parameter binding. 3. architecture.md — dropped "(preserved from prior architecture revision)" parenthetical from §9 Component inventory and §10 Language choices headings. Internal-changelog noise that doesn't help readers. Verification: 394 workspace tests pass, 0 fail. heima-gaps section ordering now sequential (1 → 1a → 2..9 → 10 → 11 → 12). All §-refs resolve to live anchors. step-1c PoP coverage confirmed in all three identity-type sections. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rget) Architecturally collapses the four bespoke per-identity PoP shapes (email pop_sig, oauth2 pop_sig, evm dual-sign-SIWE, passkey) into two uniform binding ceremonies, split by machine class: - Master machines (workstation with platform authenticator) -> WebAuthn enrollment ceremony. Hardware-attested, identity-type- agnostic, closes the email-account-compromise -> device-takeover gap (Q7) by requiring hardware presence at re-bind. - Agent machines (VM/Linux/CI/agent-infra/sandbox container) -> link-code redeemed against master's authenticated session per the agent-infra/sandbox two-tier orchestrator pattern. Defers YubiKey-on-Linux-as-master (roaming-authenticator binding) to issue #79 as a follow-up. arch.md changes (single source of truth): - §2 trust boundaries: K11 in master TB, new agent-machine TB, master/agent rows in compromise table - §3 K-table: K10 master/agent persistence dichotomy; new K11 for WebAuthn platform-authenticator credential - §5 cold-start: status callout pointing at §5a.1 for v0.2 target - §5a header: master-vs-agent intro + WebAuthn-uniform status - §5a.1: rewrite into identity ceremonies + 5a.1.M (WebAuthn) + 5a.1.A (link-code) + v1c-interim PoP shapes pointer - §5a.2: master/agent device-switch shapes; cross-device confirmation note - §5a.3: WebAuthn get()-gated rotation for masters - §5a.4: agent persistence per agent-infra/sandbox; link-code-per- session is the right answer, not a workaround; cite 1-step- analysis.md - §5a.5: trust-shape table collapses to master/agent rows Plan files defer to arch.md as authoritative: - step-1c plan: status callout + per-identity-type section header marked v1c-interim - dev-key-service master plan: successor steps note WebAuthn binding + link to #79 Companion artifacts: - gh issue #79 filed (YubiKey-on-Linux master deferral) - comment on #76 with WebAuthn refinement summary
…s §5a.1.M) §5 cold-start sequenceDiagram correctly shows D generated at step 0 (before identity ceremony / network traffic). §5a.1.M had it as step 1 AFTER identity ceremony returns binding_nonce — internally inconsistent within arch.md. §5 is the right model: D should be generated at daemon startup, not deferred until identity ceremony completes. There is no security benefit to delaying, and D_pub must exist by the time of any binding ceremony anyway (v1c pop_sig signs identity request with D_priv; v0.2 WebAuthn challenge folds D_pub into the ceremony challenge). Changes: - §5a.1 intro: explicit three-stage pipeline. Stage 0 = device-key generation at daemon startup; Stage 1 = identity ceremony; Stage 2 = binding ceremony. State that stage 0 is non-negotiably first across all flows (master, agent, v1c, v0.2) with the reasoning. - §5a.1.M: drop the misleading "step 1: generate D_priv". Now opens with explicit PRECONDITIONS from stage 0 + stage 1, and binding- ceremony numbering starts at the WebAuthn step itself. Final step notes D_priv was already persisted at stage 0 (just persist J0). - §5a.1.A: agent flow's daemon-startup D-generation now explicitly labelled "Stage 0 (daemon startup, per §5a.1)" for symmetry. Numbering unchanged (cross-machine sequence continues from master). - §5a.2.M: new-master device-switch flow now leads with Stage 0 (fresh K10' generated at daemon startup) before identity ceremony, matching first-init. §5a.3.M rotation step "generate D_priv_new" is unchanged — that's an explicit new-key generation within the rotation flow, not first-time init, so stage-0 framing doesn't apply.
§5a.1.A's precondition expected J1_master (the EVM-omni session JWT) but §5a.1.M ended at J0 (the identity-omni JWT). The wallet-derive + link + SIWE round-trip that mints J1 lives in §5 steps 2-3 but was never referenced from §5a.1.M's outro, so the reader had no path between the master binding ceremony and the agent link-code flow. Changes: - §5a.1.M: new "From J0 to J1 (master only — bridge to per-mint flows)" subsection. 6-step flow: signer derive-address → broker wallet/link → broker auth/wallet/start → signer sign-message → broker auth/wallet/verify → mint J1. States that K10 + K11 claims propagate from J0 into J1 atomically. Notes the evm-identity-type variant collapses these steps (user's own EVM key IS the wallet). - §5a.1.A precondition: now reads "ON MASTER (already initialized per §5a.1.M + the J0 → J1 bridge above; holds J1_master = the long-lived EVM-omni session JWT with K10 + K11 claims)" — makes the dependency on the bridge explicit.
…, -235)
Adopts the per-agent omni model proposed by user critique:
- Each agent is a first-class actor with its own omni derived from
master via HDKD //label, its own wallet (HKDF(K3, O_agent)), its
own AWS PrincipalTag, its own audit slot.
- Per-agent compromise containment, atomic revocation, first-class
audit attribution, tree-as-data-model.
- v1c "shared omni + multiple device pubkeys" is now a degenerate
v1.0 tree (no children).
Plus the link-code-only-agent-bootstrap simplification:
- Agents have ONE bootstrap path: link-code from authenticated master.
- No identity ceremony for agents, no shared bearer, no agent-side
recovery. One test surface, one threat model.
arch.md changes (compacted 944 -> 709 lines):
- §3 K3/K4: per-actor-omni derivation framing; K10/K11 references
updated to new §5a subsection numbering
- §4 identity model: HDKD actor tree (master root + //label children),
per-actor wallet derivation, why per-agent omni
- §4a NEW: 4-axis mental model (identity / actor / machine /
capability), master-vs-agent role table, key non-conflations
- §5 cold-start: compact 4-stage table + single sequenceDiagram
showing v1.0 master flow with WebAuthn enrollment + bridge
to J1; v1c interim status callout
- §5a restructured into 5 subsections (was multi-subsubsection):
- 5a.1 master init (per-identity-type + uniform WebAuthn binding)
- 5a.2 agent bootstrap (link-code only - explicit "no other path")
- 5a.3 master device switch + rotation (combined)
- 5a.4 agent re-bootstrap + persistence (combined; cites
1-step-analysis.md)
- 5a.5 trust shape (per-actor isolation properties)
CLAUDE.md: added "Architecture-as-source-of-truth policy" requiring
arch.md re-check after any architectural doc edit; documents that
per-doc detail outgrowing arch.md should link outward, not duplicate.
step-1c plan: status callout reframed - v0.2 target is HDKD per-agent
omni + WebAuthn-uniform binding (structural shift, not just wire-shape
collapse); points at arch.md §4/§4a/§5a as single source of truth.
Companion artifacts (not in commit; reference only):
- .omc/wiki/agent-role-and-usage-hdkd-per-agent-omni.md
(project-local wiki page, gitignored per .omc/ convention)
- gh issue #79 updated: master-vs-agent reframed as actor role,
not machine class; YubiKey-on-Linux is "Linux + YubiKey as master"
(one of two roles, not a third class).
Updates the operator-facing demo doc for the master/agent + HDKD mental model landed in the prior commit (50a0ffa). Operational content (steps 0-13) is unchanged because the demo runs against v1c-interim — the actually-shipped flow. Changes: - Trust model section: replaced step-1c-coming callout with explicit v1c-interim status; cross-refs arch.md §4 (HDKD actor tree), §4a (mental model), §5a (per-actor binding); flags v0.2 target features as not-yet-implemented and tracked in #76 / #79. - Two-machine layout: marked operator-workstation row as "(master role)"; added a "Roles + key inventory primer" callout pointing at arch.md §4a (4-axis mental model), §3 (K1-K11 inventory), §5a.2 (agent role / link-code bootstrap), and the agent wiki page as the operator-focused reference. - Section §0 success-criteria #3: clarifies "operator's omni_account" IS the master actor omni per arch.md §4. What did NOT land in the demo doc: - Per-step rewriting of operational content. The demo correctly exercises v1c-interim (single-omni-shared-with-master, bespoke per-identity PoP, link-code agents). v0.2 demo content waits for the agent-create endpoint + WebAuthn ceremony to ship.
…R_URL - scripts/operator-workstation.env: add SIGNER_HOST + AGENTKEYS_SIGNER_URL (derived from BROKER_HOST), keep BACKEND_URL as alias. Co-located with broker today; hostname split lets the signer move to its own machine (or TEE worker) later without changing client config. - docs/cloud-setup.md §1.3: add "what the signer is + why a dedicated hostname" overview with a today-vs-future table; explicit co-location note + cross-ref to operator-workstation.env. - docs/stage7-demo-and-verification.md §0.2: stop re-deriving the signer URL — both vars come from operator-workstation.env now. Cross-ref the topology section in cloud-setup.md. No code change; arch.md §10 deployment topology already captures the separate-hostname / same-host model unchanged.
§1.3 used $EIP, but $EIP isn't set until §5.1 — copy-pasting top-down broke. Make §1.3 a brief intro consistent with §1.2 (broker subdomain defers to §5), and put the actual DNS+cert+nginx-flip steps in a new §6 that runs after §5 and reuses $EIP. - §1.3: brief signer intro + defer to §6 (matches §1.2 shape). - §6 NEW: Signer host — overview table (today vs future), DNS A record (§6.1), TLS cert + nginx flip (§6.2), verify (§6.3). - §7: Cleanup (was §6). - Top TOC: add §6 Signer host row, bump Cleanup to §7. - stage7 demo: cross-refs §1.3 → §6 for the cert+DNS steps; cross-ref to "cloud-setup.md §6" cleanup → §7.
… $SIGNER_HOST Reported failure: `sudo certbot --nginx -d "$SIGNER_HOST"` on the broker host fell through to certbot's interactive vhost picker showing only broker.litentry.org. Root cause: $SIGNER_HOST is only exported on the operator workstation (scripts/operator-workstation.env), not on the broker host — empty -d arg → certbot's "pick from existing vhosts" fallback → only the broker vhost is offered. §6.2 now: - explicit warning that $SIGNER_HOST is workstation-only - adds a sanity-check `ls /etc/nginx/sites-enabled/agentkeys-signer` (catches the "setup-broker-host.sh wasn't re-run with signer code" case before certbot is invoked) - derives SIGNER_HOST inline from the nginx vhost (awk the server_name line setup-broker-host.sh just wrote) so the certbot command is copy-paste safe on a fresh broker shell with no env vars set
…uto → no) Reported failure: `sudo bash scripts/setup-broker-host.sh --yes` on a fresh broker host did not write the agentkeys-signer nginx vhost. Then `sudo certbot --nginx -d signer.<zone>` fell through to certbot's interactive vhost picker, which only listed broker.<zone> (because the broker vhost was written by an earlier run that had been done with --with-nginx). Root cause: WITH_NGINX defaulted to "auto", which resolved to "no" at line 361 — the comment said "preserves prior default" but every doc-driven operator expects nginx provisioning. The runbook (cloud-setup.md §5 + §6) explicitly assumes nginx is set up by the script. Now: auto → yes for both WITH_NGINX and WITH_CERTBOT. Operators who don't want nginx (running behind a non-nginx reverse proxy, pre-provisioned certs) opt out via --without-nginx / --without-certbot. The interactive preview already prints `nginx : $WITH_NGINX`, so the operator sees the resolved value before confirming. Also pin --with-nginx explicitly in cloud-setup.md §6.2 step 1 + step 3 so the doc remains correct even if the script default changes again.
…olver Reported failure: operator's `dig +short broker.litentry.org A` returned 198.18.1.86 (RFC 2544 TEST-NET-2) because their local DNS resolver was behind a transparent proxy (Cloudflare WARP / Zscaler / Tailscale Magic DNS). Using that as $EIP would have published a Route 53 A record pointing at a private/loopback range, breaking Let's Encrypt validation silently — the symptom would surface 5 min later as "Timeout during connect (likely firewall problem)" with the wrong IP in the error. §6.1 now: - explicit callout that local resolvers behind WARP/Zscaler/Tailscale/ corporate VPNs return 198.18.0.0/15 for proxied hostnames - shows `aws ec2 describe-addresses` as the authoritative re-derivation - replaces fire-and-forget verify with a polling loop until Cloudflare DoH confirms the A record matches $EIP (Route 53 propagation up to TTL=300) §5.2 unchanged — within §5 the operator just set $EIP from AWS API in §5.1, so the local-resolver trap doesn't apply there.
The §1.3 + §6 + §6.1 + §6.2 prose said the same thing 3-4 times (co-located today / future-split possible / "if the signer is ever moved" / "first run writes nginx, certbot, second run flips ssl"). Each new fix layered another paragraph on top instead of consolidating. Pass 1 — §1.3 collapsed from 12 lines to 1 (matches §1.2's defer-to-§5 shape; §6 has all the detail). Pass 2 — §6 intro: dropped 4-line prose paragraph above the table; folded "endpoints" + "exported as SIGNER_HOST" into the table itself so it's the single load-bearing reference. Dropped trailing prose paragraph about the env file (now in the Public-hostname row). Pass 3 — §6.1: collapsed standalone EIP-derive callout (10 lines of warning + 5 lines of fenced bash) into a 3-line guard inside the bash block (`[ -z "$EIP" ] && EIP=$(aws ec2 describe-addresses …)`). Kept the WARP/Zscaler/198.18.x.x context as a 4-line comment in the bash — load-bearing for diagnosis, would lose meaning if removed. Pass 4 — §6.2: dropped "Three host-side steps. setup-broker-host.sh is idempotent…" preamble paragraph (table already says this). Kept the $SIGNER_HOST=laptop-only callout (load-bearing — distinguishes laptop from broker host shell scope). No behavior change. All cross-refs intact (#6-signer-host, #51-allocate, signer-protocol, operator-workstation.env all still resolve). 60 code fences, balanced.
… are yes The flags were redundant once defaults flipped to yes (commit a3a0a84). Per CLAUDE.md remote-broker-host policy the script is the single idempotent entry point — flag-gating "do the thing the runbook always wants" is noise. Drop both --with-* flags + the auto-resolution dead-code; keep --without-nginx / --without-certbot as the only opt-out. - WITH_NGINX / WITH_CERTBOT default to "yes" outright (no more "auto" three-state); 12-line auto-resolution block becomes a 2-line comment. - CLI parser drops --with-nginx / --with-certbot. Passing the removed flags now errors `unknown flag: --with-nginx` rather than silently no-op'ing. - Header usage block + interactive defaults comment updated to match. - docs/cloud-setup.md §6.2: drop --with-nginx from both invocations (replace_all over the doc). No behavior change for operators following the runbook — `--yes` alone already provisioned nginx since a3a0a84. This commit only removes the explicit `--with-nginx` redundancy.
CLAUDE.md
- New "Runbook-fix-fold-back policy": when an operator hits a runbook
failure, both the targeted fix AND a runbook revision must land in
the same turn. Goal: every operator-encountered failure makes the
runbook strictly more robust before we move on.
stage7-demo-and-verification.md (§0)
Absorbs every failure the operator hit walking this PR end-to-end:
- §0 Tooling: pulled CLI build out of a sub-bullet into a numbered
ordered checklist (cargo build → cp to ~/.local/bin → which/version
smoke-test → init). Explicit warning against path-relative aliases
(the recurring "alias agentkeys=./target/release/agentkeys-cli" trap
with the wrong binary name from before the agentkeys-cli → agentkeys
rename). Spells out crate-name vs binary-name distinction.
- §0.1: branch-agnostic checkout via `BRANCH="${BRANCH:-evm}"` (was
hardcoded `git checkout evm` — broke when validating PR branches).
Adds nginx vhost sanity-checks: `ls /etc/nginx/sites-enabled/
agentkeys-{broker,signer}` + grep for proxy_pass-vs-return-503
inside agentkeys-signer (catches the "cert issued but script not
re-run, vhost still serves stub 503" failure mode).
- §0.2: smoke-test now string-matches body == "ok" (a successful HTTP
200 with body "TLS cert not yet issued for signer …" is the exact
trap operators hit when certbot succeeded but step 3 of §6.2 wasn't
run). Adds a 5-row "common failure modes" table mapping observed body
→ cause → exact fix command.
§16 line 1402's `git checkout evm` left as-is — that section is
intentionally evm-specific (verifies the live prod broker).
Operator hit `which agentkeys` → "aliased to ./target/release/agentkeys-cli" even after `cp target/release/agentkeys ~/.local/bin/`. zsh aliases beat $PATH lookups (and the alias also pointed at the wrong binary name — the crate is agentkeys-cli but the [[bin]] is `agentkeys`), so the install was invisible no matter how correctly it was staged. §0 build checklist now goes 5 steps in this order: 1. sed-strip any `alias agentkeys[-= ]…` from ~/.zshenv + ~/.zshrc (with .bak), then `unalias` for the current shell. Fail-soft (`|| true`) so missing files don't abort. 2. Append `~/.local/bin` to $PATH if not already there (idempotent case statement; appends to ~/.zshenv). 3. cargo build (was step 1). 4. cp to ~/.local/bin (was step 2). 5. `hash -r` + `command -v agentkeys` (NOT `which`) — bypasses any alias zsh hasn't re-hashed away yet. Spells out the expected absolute-path output. Plus a tiered fallback callout: if `command -v` still shows the alias, grep ~/.zprofile / ~/.aliases / shell includes for stragglers, then `exec zsh -l`. Per Runbook-fix-fold-back policy (CLAUDE.md): operator failure → both the fix command (handed back inline last turn) AND the runbook revision land in the same turn. Next operator running this top-down won't hit the alias trap.
Operator hit `curl: (7) Failed to connect to 127.0.0.1 port 18090` because their shell had a stale `BACKEND_URL=http://127.0.0.1:18090` local-dev export in ~/.zshenv that shadowed operator-workstation.env's BACKEND_URL=$AGENTKEYS_SIGNER_URL alias. §0.2 now: - Pins `export BACKEND_URL="$AGENTKEYS_SIGNER_URL"` inline so the smoke-test is self-contained (no longer depends on ~/.zshenv being un-shadowed). - Adds a defensive `case "$BACKEND_URL" in https://signer.*) ;; esac` bail-loud check BEFORE the curl, with a one-line diagnosis (`grep -n BACKEND_URL ~/.zshenv && unset && re-source`). - Echoes BACKEND_URL alongside SIGNER_HOST so the operator visually confirms the value is public https:// before hitting curl. Per Runbook-fix-fold-back: failure command + cause + fix command all inline in the runbook so the next operator with a stale local-dev shell doesn't have to round-trip with the maintainer to diagnose.
…ale value" This reverts commit 11e59ce.
Operator hit `error: unexpected argument '--json' found` running §0.4's `agentkeys signer derive --signer-url … --omni-account … --json`. Per crates/agentkeys-cli/src/main.rs:24-25, --json is a top-level flag on the root `agentkeys` command (controls ctx.json_output globally), NOT a per-subcommand flag on `signer derive` / `signer sign`. Clap rejects it after the subcommand's required args. Eight occurrences fixed across §0.4 (×2), §3 SIG_A/SIG_ADDR/SIG_B (×3 multi-line), and §16 live walkthrough (×3 single-line): agentkeys signer derive … --json | jq … → agentkeys --json signer derive … | jq … agentkeys signer sign … --json | jq … → agentkeys --json signer sign … | jq … Plain text-output calls at lines 1047 and 1099 left unchanged (no --json there to begin with). Per Runbook-fix-fold-back: clap arg ordering is non-obvious for top-level vs subcommand flags, so the runbook command examples must match the actual CLI grammar — operators copy-paste, they don't re-read the clap macro.
Operator hit `Error: SIGNER_UNAUTHORIZED invalid session JWT: InvalidToken` running §0.4's first signer derive call. The §0.4 intro said "Run agentkeys init first if you haven't already" but never showed the actual command — operators don't know to look ahead 100 lines to §2.0 for the real `--email --broker-url --signer-url` invocation. §0.4 now: - Explicit "must run first OR every call below returns SIGNER_UNAUTHORIZED" callout (with the literal error message so operators searching the doc for the error find the fix). - Inline `agentkeys init --email alice@demo.example --broker-url $OIDC_ISSUER --signer-url $BACKEND_URL` as a copy-paste block, with the expected "Initialized via email-link" output. - Cross-link to §2.0 for explanation + OAuth2 alternative — minimal in §0.4, full context in §2.0. §2.0's existence preserved: it still has the magic-link explanation + OAuth2 alternative + daemon-side equivalent. §0.4's inline init is the minimum to keep the §0 prereq chain self-contained. Per Runbook-fix-fold-back: a runbook step that says "run X first" must include the literal X invocation, not just point at it.
Pass 1 implementation per .omc/ralph/prd.json: ships the
SesEmailSender behind the auth-email-link feature, with end-to-end
SES → S3 round-trip integration test. Pass 2 (separate commit) wires
boot.rs + setup-broker-host.sh + broker.env defaults + demo doc.
Closes the gap that blocked the operator's stage-7 demo init flow:
the deployed broker had only StubEmailSender (in-process Vec, no
delivery). With this change + Pass 2, `agentkeys init --email` will
deliver a real magic-link to the operator's inbox.
US-1: Cargo.toml deps
- aws-sdk-sesv2 = "1" added as optional dep gated by auth-email-link
- aws-sdk-s3 + uuid added to dev-dependencies for the integration test
- dev-deps now enable auth-email-link so tests/* compile by default
US-2: SesEmailSender impl (crates/agentkeys-broker-server/src/plugins/auth/email_link.rs)
- send_magic_link composes multipart text+html via aws-sdk-sesv2 SendEmail
- verify_sender_ready calls GetEmailIdentity + checks verified_for_sending
- Errors map to EmailSendError::{Send, Verify, Config}
- Inline subject + body templates (no template-engine dep)
- Re-exported from src/plugins/auth/mod.rs
US-3: Body composition unit tests (4 added)
- ses_subject_is_non_empty
- ses_text_body_contains_landing_url
- ses_html_body_contains_landing_url_twice (href + visible text)
- ses_text_and_html_alternatives_both_present
US-4: Integration test (crates/agentkeys-broker-server/tests/ses_email_flow.rs)
- Gated by RUN_SES_INTEGRATION_TESTS=1 + #[ignore]
- CleanupGuard Drop impl: list-and-delete every S3 object whose body
contains the per-test UUID, even on panic
- Polls inbound/ prefix for up to 60s (5s × 12 attempts)
- Asserts MIME body contains both unique token AND landing URL
(allowing for quoted-printable encoding of '=' as '=3D')
US-5: Quality gates ALL GREEN
- cargo build -p agentkeys-broker-server → exit 0
- cargo build -p agentkeys-broker-server --features auth-email-link → exit 0
- 161 lib tests pass; integration test compiles + skips gracefully
- cargo clippy --no-deps -- -D warnings → exit 0
- (Pre-existing clippy warning in agentkeys-core/src/init_flow.rs:177
unrelated; will tackle in Pass 2 if it blocks.)
US-6: BLOCKED on operator — live SES round-trip
- Operator runs:
awsp agentkeys-admin
RUN_SES_INTEGRATION_TESTS=1 ACCOUNT_ID=429071895007 \
cargo test -p agentkeys-broker-server --features auth-email-link \
--test ses_email_flow -- --ignored --nocapture
… identity Operator hit `NotFoundException: Email identity <noreply@bots.litentry.org> does not exist` running the SES integration test. Cause: SES GetEmailIdentity returns identities EXPLICITLY registered with `create-email-identity`. cloud-setup.md §2.1 verifies the DOMAIN (`bots.litentry.org`), which auto-grants sending rights to ANY address at that domain via DKIM — but the per-address identity (`noreply@bots.litentry.org`) was never registered. So the verify precheck failed even though the actual SendEmail would succeed. Fix: verify_sender_ready now tries address-level lookup first (preferred — explicit), then on NotFound falls back to extracting the domain (split on '@') and looking up the domain identity. Either passing → Ok(()). Helper extracted: check_identity(client, identity) → Result<(), String> returns Ok only when SES reports the identity exists AND verified_for_sending_status=true. Used by both attempts. No behavior change for operators who explicitly verify per-address; unblocks the canonical operator path (verify-domain-only) per cloud-setup.md §2.1. Closes the verify-precheck blocker on Pass 1's US-6 (live SES round-trip from operator). Quality gates re-checked: - cargo build -p agentkeys-broker-server --features auth-email-link → ok - cargo test -p agentkeys-broker-server --features auth-email-link --lib → 161 passed - cargo clippy -p agentkeys-broker-server --features auth-email-link --tests --no-deps -- -D warnings → ok
Per operator request after Pass 1:
1. drop the address→domain fallback in SesEmailSender::verify_sender_ready
— explicit per-address verification only
2. register noreply-test@bots.litentry.org as a per-address SES identity
and pin it in operator-workstation.env
3. give the operator a one-shot bash helper that exploits the existing
SES inbound receipt rule (cloud-setup.md §2.1) to fully automate the
address verification — no inbox-clicking, no manual MIME parsing
Code (crates/agentkeys-broker-server/src/plugins/auth/email_link.rs):
- verify_sender_ready: single GetEmailIdentity call on the FROM address.
No fallback. Error message points the operator at
`aws sesv2 create-email-identity` (and at scripts/ses-verify-sender.sh
for the automated path) so the next failure self-diagnoses.
- Removed check_identity helper (was the fallback shared call).
Test (crates/agentkeys-broker-server/tests/ses_email_flow.rs):
- TestEnv now reads BROKER_EMAIL_FROM_ADDRESS — same env var the broker
reads at runtime (env.rs:143). One source of truth between the test +
the broker process.
- Default: noreply-test@${MAIL_DOMAIN} (was: hardcoded noreply@…).
Env (scripts/operator-workstation.env):
- New: MAIL_DOMAIN (bots.litentry.org), MAIL_BUCKET, BROKER_EMAIL_FROM_ADDRESS.
- MAIL_DOMAIN is explicit (not derived from BROKER_HOST) — broker zone
may differ from email subdomain.
Helper (scripts/ses-verify-sender.sh, +x):
- One-shot: aws sesv2 create-email-identity → poll s3://$MAIL_BUCKET/inbound/
for the SES verification mail (lands there via the existing receipt rule
from cloud-setup.md §2.1) → grep verification URL out of the
quoted-printable body → curl-click it → confirm VerifiedForSendingStatus
→ delete the verification mail from S3 so it doesn't pollute the inbox.
- Idempotent: re-running on a verified identity exits 0 immediately.
- Requires: aws + jq + curl + grep + sed (all present on macOS / Ubuntu).
Quality gates:
- cargo build -p agentkeys-broker-server → ok
- cargo build -p agentkeys-broker-server --features auth-email-link → ok
- cargo test -p agentkeys-broker-server --features auth-email-link --lib → 161 passed
- cargo test -p agentkeys-broker-server --features auth-email-link --test ses_email_flow
→ 1 ignored (skips)
- cargo clippy -p agentkeys-broker-server --features auth-email-link --tests --no-deps -- -D warnings
→ ok
…ded body Operator hit "endless waiting" — the script polled S3 forever even though SES had likely written the verification mail. Two bugs in the polling predicate: 1. `grep -q "$FROM"` looked for the literal `noreply-test@bots.litentry.org` string, but in a quoted-printable MIME body the `@` is encoded as `=40` so the literal grep never matched. 2. `grep -qE 'ses[._-]?verification|amazonaws\.com.*verify'` matched `ses-verification` patterns, but the actual SES URL host is `email-verification.<region>.amazonaws.com` — neither alternative hit. Fix: drop both prereq greps. SES verification URLs are unique enough that matching the URL pattern directly is sufficient — no false positives. Also added per-attempt diagnostics: - log "$count object(s) under inbound/" each iteration so the operator can see whether anything is landing at all - on timeout: structured 3-step diagnosis pointing at receipt-rule state, identity status, and bucket contents Refactored URL extraction into extract_verify_url() helper (single source of truth) — handles quoted-printable soft-wrap (=\n) + =3D decoding.
…ck_on Operator hit the test panic at line 145: "Cannot start a runtime from within a runtime. This happens because a function (like `block_on`) attempted to block the current thread while the thread is being used to drive asynchronous tasks." Cause: `Handle::block_on` is forbidden when called from inside a tokio runtime context. Drop runs WHILE still inside #[tokio::test]'s runtime (the runtime hasn't shut down by the time Drop fires for `let _guard =`), so the previous code panicked even though we had `try_current → Ok` to "detect" the active runtime. Test ran end-to-end successfully BEFORE this Drop panic — log shows: ses_email_flow: found inbound object key=inbound/8dqr… (attempt 1) …the assertions never got to run because Drop tore down first. Fix: wrap `handle.block_on(cleanup_fut)` in `tokio::task::block_in_place`, which suspends the current async task so a nested blocking call is legal. Requires multi_thread runtime — already guaranteed by `#[tokio::test(flavor = "multi_thread")]` on the test attribute, no behavior change for the rest of the test. The `Err(_) → Runtime::new()` branch is preserved as a fallback for the edge case where Drop fires AFTER the runtime has been torn down (e.g. test panic during runtime shutdown). Won't normally trip in practice.
Operator request: enforce that no hardcoded values land in scripts/code/ runbooks unless logged in a dedicated audit doc. CLAUDE.md - New "No-hardcoded-values policy" between Runbook-fix-fold-back and Plan-completion. Says: parameterize via env / CLI / config; if temporarily hardcoded, log in hardcoded.md with file+line, why, and the unblock action. hardcoded.md (NEW) - Seeded with the existing operator-deployment-pinned values (ACCOUNT_ID, BROKER_HOST, MAIL_DOMAIN, BROKER_EMAIL_FROM_ADDRESS, BROKER_DATA_ROLE_ARN), the deployment-architecture-pinned values (loopback ports 8090/8091/8092, agentkeys system user, /etc/agentkeys paths), and code-level constants (TOKEN_TTL_SECONDS, rate-limit defaults, SES integration test defaults). - Each entry: what's hardcoded, why, what would unblock making dynamic. - Open trade-off section flags the email_link HMAC removal (b8481fe) for revisit when scaling to multi-broker-replica deployments. scripts/broker.env (smell fix called out in hardcoded.md) - Add ACCOUNT_ID=429071895007 as the single source of truth. - Derive BROKER_DATA_ROLE_ARN from \${ACCOUNT_ID} (was hardcoded separately, drifted from operator-workstation.env's ACCOUNT_ID). - Verified: `set -a; source ./scripts/broker.env; set +a` expands ACCOUNT_ID + BROKER_DATA_ROLE_ARN correctly.
17 tasks
…al traceability
… switch into stage7 doc
The script previously masked AccessDenied from list-objects-v2 with
'2>/dev/null || true', manifesting as endless 'attempt N/24 - 0
object(s) under inbound/' polling when the operator forgot to switch
to agentkeys-admin profile (the broker user lacks s3:ListBucket on
the mail bucket per cloud-setup.md section 2.1).
Two changes:
1. Script now preflights 'aws sts get-caller-identity' + a
ListObjectsV2 probe before entering the poll loop. Wrong-profile
case dies with explicit 'Run: awsp agentkeys-admin' guidance
instead of silently spinning. Also drops the 2>/dev/null mask on
the poll-loop list call now that preflight proves the cred path.
2. Stage 7 demo doc section 0.4 prereq block now shows the awsp +
set -a;source;set +a sequence inline, with a callout naming the
previous failure mode so the next operator recognizes it
immediately.
Reproduced locally:
AWS_PROFILE=agentkey-broker bash scripts/ses-verify-sender.sh
-> exits 1 with: 'wrong AWS profile: arn:...:user/agentkey-broker
lacks s3:ListBucket on agentkeys-mail-429071895007.
Run: awsp agentkeys-admin then re-run this script.'
User approved one-shot raw-git use because this dir is a git-linked
worktree (.git is a file pointing back to parent repo); jj root
resolves to parent and cannot see these paths.
…-restart Root cause: the post-restart healthz check used a single 5s curl with '|| warn' — a service in systemd Restart=always loop (e.g. broker crashing on BROKER_AUTH_METHODS=email_link with binary built without --features auth-email-link) shows up as a one-line warn the operator scrolls past, and the script exits 0. Operator declares the host healthy, then 30 minutes later hits 502 Bad Gateway from nginx and has to re-diagnose from scratch. Three changes: 1. scripts/setup-broker-host.sh — replace the warn-only one-shot curl probes with probe_or_die(): poll /healthz for 20s per service (10x 2s with --max-time 2), and on persistent failure dump 'systemctl status' + last 40 journal lines for the failing unit, then die with a fix-list naming the three most common boot crashes (gated-out feature, missing FROM address, AWS creds). 2. docs/stage7-demo-and-verification.md §0.4 prereq #2 — instruct operator to 'rm -f target/release/agentkeys-broker-server' before re-running the script (cargo's incremental cache occasionally leaves the wrong artifact in place when feature flags change across rebuilds; clean target avoids the failure mode entirely). Plus a '502 Bad Gateway' troubleshooting block pointing at the journal grep + the canonical fix. 3. Same doc — name the exact boot-crash error string ('unknown or feature-gated-out auth method') the next operator will see, so they don't have to round-trip with logs. Per runbook-fix-fold-back policy: every operator-encountered failure makes the runbook strictly more robust before we move on.
…ed-mode case bug
Pass-by-pass cleanup of scripts/setup-broker-host.sh, behavior preserved
(verified by grep-locking 17 critical strings: env vars, ports, paths,
systemd unit names, feature flags, function calls). Net -75 lines (1019
-> 944, -7.4%).
Pass 1 — Dead code:
- Drop prompt_default() and prompt_choice() (defined but never called).
- Drop --skip-pull flag, PULL_SKIP var, and the redundant '! $PULL_SKIP'
guard (the outer '[[ -n "$PULL_REF" ]]' already gates the pull).
--skip-pull is now folded into the --upgrade no-op arm so existing
callers still parse cleanly.
Pass 1b — Latent bug fix:
- The 'case "$CRED_MODE"' block in the trailing manual-steps section
had a duplicate 'instance-profile)' arm: the FIRST one was reached
but contained text describing 'none mode'; the SECOND (which had the
correct instance-profile text) was unreachable dead code; and 'none'
mode users got NO instructions at all because no 'none)' arm existed.
Renamed the first arm to 'none)' so all three modes now print their
intended manual-steps text.
Pass 2 — Duplicate consolidation:
- Three near-identical 'if [[ -d /etc/nginx/sites-enabled ]]; then ln
-sf … fi' blocks (broker, signer-HTTPS, signer-HTTP-only) collapsed
into ONE block after write_nginx_site returns. ln -sf is idempotent
so this is behavior-equivalent.
- certbot install: 'case "$PM"' had two arms with identical package
list ('certbot python3-certbot-nginx'); collapsed to a single
'"${PM_INSTALL[@]}" certbot python3-certbot-nginx' invocation.
Pass 3 — Comment trim:
- 58-line header reduced to 18 lines: dropped the 'Order of operations'
enumeration (duplicated by the section comments inline) and the
--flag enumeration (duplicated by the case parser + --help dump).
Kept the canonical 'CLAUDE.md says all remote-host changes go through
this script' rule + out-of-scope list.
Idempotency audit (no changes needed — already correct):
• build deps: apt/dnf -y, idempotent
• rustup install: gated 'if ! have rustup'
• systemctl stop: '|| true'
• binary backup: gated 'if [[ -x ]]'
• install -m 0755: overwrite-OK
• useradd: gated 'if ! id -u agentkeys'
• install -d: idempotent
• DEV_KEY_SERVICE secret: gated 'if ! sudo test -s' (never regenerated)
• systemd unit writes: tee overwrites — intended each run
• nginx install: gated 'if ! have nginx'
• nginx site write: tee overwrites — intended (handles HTTP→HTTPS flip)
• sites-enabled ln -sf: -f forces, idempotent
• certbot install: gated 'if ! have certbot'
• ensure_broker_keypairs: per-keypair 'if sudo test -f' guard
• daemon-reload, enable, restart: idempotent
Verification:
bash -n scripts/setup-broker-host.sh # syntax ok
grep -F locked 17 critical strings # all present
…ps auth-email-link
Root cause of the broker host's repeated 'BOOT_FAIL: BROKER_AUTH_METHODS=
"email_link": unknown or feature-gated-out auth method' even after a
fresh target/ rebuild: the script used a SINGLE cargo invocation to
build BOTH agentkeys-mock-server AND agentkeys-broker-server with
'--features agentkeys-broker-server/auth-email-link', and cargo
silently DROPS the feature flag in this multi-package selection mode.
Reproduced empirically with --message-format json:
cargo build --release -p agentkeys-mock-server -p agentkeys-broker-server \
--features agentkeys-broker-server/auth-email-link
→ broker compiled features: [audit-sqlite, auth-wallet-sig, default,
wallet-keystore] ← NO auth-email-link
vs the working separate form:
cargo build --release -p agentkeys-broker-server --features auth-email-link
→ broker compiled features: [audit-sqlite, auth-email-link,
auth-wallet-sig, default, wallet-keystore] ← present
Fix:
1. Split the build into two separate cargo invocations — mock-server
alone (default features), broker-server alone with the feature flag.
Documented the footgun in a long block comment so the next person
who 'optimizes' by re-merging them will read why before doing it.
2. Added a post-build sanity check: 'strings target/release/agentkeys-
broker-server | grep /v1/auth/email/(request|verify)' must match
before install + restart. If the cargo footgun ever resurfaces (or
anyone introduces a similar feature-strip bug), the script dies HERE
with a clear diagnostic instead of after install + systemd restart
loop + journal dump.
Verified locally:
bash -n scripts/setup-broker-host.sh # syntax ok
strings target/release/agentkeys-broker-server | grep /v1/auth/email
→ /v1/auth/email/request /v1/auth/email/verify /v1/auth/email/status
/v1/auth/email/landing (all four routes present)
…o clean -p The previous fix (commit 6d75599) split the cargo build into separate invocations to defeat the multi-package + --features footgun, but the broker host STILL deployed binaries lacking auth-email-link. Two real root causes survived: 1. CARGO INCREMENTAL CACHE: 'rm -f target/release/agentkeys-broker-server' only removed the output binary, not target/release/deps/.fingerprint/ nor the per-feature-set cached .rlib deps. On a host that previously built without auth-email-link, cargo's incremental could relink from stale deps and produce a binary missing the feature even when the build call was correct. Fix: 'cargo clean -p agentkeys-broker-server --release' before the rebuild — only ~1s, only this crate's cache. 2. WEAK VERIFICATION: 'strings | grep -qE "/v1/auth/email/request"' is a heuristic that: - false-positives on tower middleware names containing 'email' - false-negatives when LTO dedupes string literals across the binary - dies with an unactionable 'this is the cargo footgun' guess that was wrong (the call was correct; the host environment was the bug) Replace with: parse cargo's own --message-format=json output and ASSERT auth-email-link is in the bin artifact's features list. Cargo's reported features ARE the truth — no heuristic. Critical bash detail: cargo --message-format=json sends NDJSON to stdout and compiler messages to stderr. Merging them with '2>&1' corrupts the NDJSON and jq dies with 'Invalid numeric literal at line N column M'. The script now redirects them to separate temp files (BUILD_JSON / BUILD_ERR) and only mixes them in the diagnostic 'tail -30' on failure. The strings check is kept as belt-and-suspenders (catches the 'cargo claims success but binary on disk is stale' edge case). Switched to 'grep -aFq' per codex review: -a forces text mode (some Linux strings implementations differ on binary detection), -F treats the route as a fixed string (no regex interpretation of '/'). If cargo reports auth-email-link is NOT enabled despite --features auth-email-link, the new die message lists 5 specific things to check ($HOME/.cargo/config.toml, workspace .cargo/config.toml, env vars, 'which cargo', Cargo.lock drift) instead of guessing. Verified locally: - cargo clean -p removes 17 files / 61.8MiB (only broker artifacts) - cargo --message-format=json reports features=[audit-sqlite, auth-email-link, auth-wallet-sig, default, wallet-keystore] - assertion passes; strings check passes
…re paths Per CLAUDE.md runbook-fix-fold-back: now that scripts/setup-broker-host.sh catches the cargo-feature-not-enabled case at build-time (commit c235373's --message-format=json assertion), the operator-facing troubleshooting needs two distinct entries: 1. Build-time die ('cargo did NOT enable auth-email-link'): host has a .cargo/config.toml or env-var override; script lists 5 things to check before the operator should file an issue. 2. Boot-time BOOT_FAIL: now historical (defended by both cargo clean -p AND the JSON assertion); kept as a fallback diagnostic for the case where the broker was started outside the script. If the boot-time BOOT_FAIL ever recurs on a fresh re-deploy, the doc now points the operator at 'bash -x' tracing instead of the previous generic 'rm -f && re-run' fix that no longer applies.
…nm to warn Reported failure: on Ubuntu with rustc 1.95.0, the script dies with 'binary on disk does not match cargo's reported feature set' even though cargo --message-format=json correctly reports auth-email-link is enabled. The 'strings | grep' belt-and-suspenders check is a false negative on this combination — likely rustc 1.95 MIR opts or Ubuntu binutils' strings defaults differ from macOS, splitting/stripping the route literal in ways grep doesn't see. Cargo's JSON output IS the canonical truth. If cargo says the feature is enabled, it IS enabled — the post-build sanity check should not override that with a heuristic. Three changes: 1. Drop the 'strings die' entirely — it produced wrong-failure on a correctly-built binary, blocking the deploy AFTER cargo had already confirmed success. 2. Replace with a 'nm' symbol-table check (more reliable than strings; symbols are link-time evidence the function is compiled in). But keep it WARN-only: if nm doesn't see the symbols on this rustc version either, that's a diagnostic signal, not a stop signal. 3. probe_or_die post-restart is the canonical runtime gate. If the binary really lacks the feature, the broker BOOT_FAILs with 'unknown auth method' and probe_or_die catches it within 20s with the journal output. So we lose nothing by trusting cargo here. Tested locally: - nm sees 5+ email-link symbols on macOS - cargo JSON assertion still fires on bad builds - probe_or_die remains the runtime safety net The user can now re-pull + re-run setup-broker-host.sh and the build phase will succeed (because cargo's truth is trusted). If the binary is actually broken, probe_or_die catches it post-restart with full journal output.
…n needed
User feedback: 'cargo clean -p' on every re-deploy adds 3-5min full
rebuild — too slow for the common case where the cache is fine.
New behavior:
Default (no flag): incremental build, no clean. Assert via cargo's
JSON output that auth-email-link is enabled. If
the assertion misses, SELF-HEAL by running
'cargo clean -p' + rebuild ONCE. Failing the
retry is a real environment bug (host config
override, env var pin) and dies with diagnostics.
→ Fast path: ~10-30s on warm cache.
--clean Force 'cargo clean -p' upfront before the build.
Use after a feature flag flip when you KNOW
cargo's cache will mislead. → 3-5min full rebuild.
--no-clean Never clean; trust incremental cache. Disable
self-heal too — die immediately on assertion miss.
Use in CI / unattended re-deploys where you want
hermetic, fast, fail-loud behavior.
Also: the assertion now treats 'cargo emitted no compiler-artifact'
(incremental cache hit, nothing to rebuild) as a PASS rather than a
fail. Without the artifact line cargo is saying 'binary on disk is
unchanged from last build' — that's fine, because last build was
either also under this script's control (with the assertion) or the
assertion will trigger the rebuild path.
Refactored into two helpers (build_broker_with_features +
assert_feature_enabled) to make the auto/--clean/--no-clean dispatch
readable.
Verified locally:
- default mode + warm cache: artifact emitted, features reported,
assertion passes (~instant)
- --clean: clean + rebuild + assertion passes
- --no-clean: assertion-only, no retry on miss
…n disk Edge case: if a previous build completed successfully, then someone manually 'rm target/release/agentkeys-broker-server' (e.g. trying to force a rebuild), cargo's incremental cache says 'nothing changed' and emits no compiler-artifact line. The previous logic treated that as a pass and proceeded to install — which then failed with 'install: cannot stat /path: No such file or directory' instead of something actionable. Add a one-liner: when ENABLED_FEATURES is empty (no artifact line), check that the binary actually exists at the expected path. If not, return 1 so the self-heal path kicks in (cargo clean -p + rebuild). Cheap (-x test, ~ms) and shores up the only remaining hole in the incremental-build trust model.
… SES v2
Pass-2 broker (auth-email-link) hits AccessDeniedException at runtime
because the broker calls 'ses SendEmail' (SES v2 API) with its OWN
instance-profile credentials, but cloud-setup.md only granted SES
permission to the per-user-assumed agentkeys-data-role.
Two layered fixes:
1. cloud-setup.md §3.4 (agentkeys-broker-host instance profile): add
a second put-role-policy call attaching 'BrokerSendEmail' with
ses:SendEmail on both the domain identity and any per-address
identity at that domain. The runbook had only sts:AssumeRole on
this role, which was sufficient pre-Pass-2 but not anymore.
2. stage7-demo-and-verification.md §0.4 prereqs: add a troubleshooting
block for the exact error string the operator sees:
'broker rejected /v1/auth/email/request: status=502
body={"error":"backend_unreachable",
"message":"... ses SendEmail: unhandled error
(AccessDeniedException)"}'
with the one-shot fix command + explanation of WHY ses:SendEmail
(not ses:SendRawEmail — different IAM action for sesv1 vs sesv2).
The IAM update propagates ~instantly; no broker restart needed (sesv2
picks up creds per-call).
Per CLAUDE.md runbook-fix-fold-back: every operator-encountered
failure makes the runbook strictly more robust before we move on.
… hardcoded name
Applied ses:SendEmail to the broker's actual runtime role
(S3-full-access — discovered via 'aws ec2 describe-instances' on
the live broker host). The existing docs assumed the canonical role
name 'agentkeys-broker-host' from §3.4 fresh setup, but legacy
deploys (this one included) use an ad-hoc legacy name from initial
provisioning that predates the broker.
Two doc changes:
1. cloud-setup.md — moved the SES grant out of §3.4 (where it was
wrong: §3.4 is a clean-slate role-creation block, and operators
running through it would get the grant for the wrong reasons).
Added new §3.4a 'ses:SendEmail grant on the broker's runtime role
(Pass 2 prereq)' with explicit two-step flow:
Step 1: discover the actual role attached via the broker's EC2 IP
ROLE=$(aws ec2 describe-instances --filters Name=ip-address,...)
ROLE=$(aws iam get-instance-profile --instance-profile-name "$ROLE" ...)
Step 2: aws iam put-role-policy --role-name "$ROLE" --policy-name BrokerSendEmail
Both steps reference $ROLE (variable, set by discovery), NOT a
hardcoded role name. Includes the verify command operators should
run after.
2. stage7-demo-and-verification.md §0.4 troubleshooting block —
updated to use the discovery-then-grant pattern instead of
hardcoding 'agentkeys-broker-host'. Cross-links to §3.4a for the
full flow.
Verified end-to-end: ran the discovery + grant against the live
broker host (i-0c0b739bd35643fd3 / S3-full-access role, elastic IP
54.164.117.252). The inline policy 'BrokerSendEmail' now grants
ses:SendEmail on:
- arn:aws:ses:us-east-1:429071895007:identity/bots.litentry.org
- arn:aws:ses:us-east-1:429071895007:identity/*@bots.litentry.org
No broker restart needed — sesv2 picks up the grant per-call.
Two related fixes addressing the user-encountered blocker (CLI polls
forever because alice@demo.example is RFC 2606 example domain — no
inbox to click from):
1. NEW scripts/agentkeys-init-email-demo.sh — fully automated demo:
• Picks demo-1@bots.litentry.org or demo-2@... by parity of unix
epoch seconds (so consecutive runs don't collide on the broker's
single-use token TTL).
• Snapshots existing inbound/ keys BEFORE SendEmail so we only
inspect arrivals NEW to this run (vs scanning 400+ stale objects).
• Spawns 'agentkeys init --email' in background; polls S3 for the
magic-link email; QP-decodes the body to extract
'$OIDC_ISSUER/auth/email/landing#t=<token>'.
• Lifts the token out of the URL fragment and POSTs
{token: <t>} to /v1/auth/email/verify — replicating exactly
what the browser-side JS in /auth/email/landing does (curling
the landing URL alone wouldn't work; fragments don't ride in
HTTP requests).
• Cleans up the consumed S3 object on success.
• Waits for agentkeys init to complete; dumps log + dies on
timeout. Includes preflight that rejects wrong AWS profile
(agentkey-broker user lacks ListBucket).
2. cloud-setup.md §3.4a:
• Step 2: grant now includes BOTH ses:SendEmail (per-request) AND
ses:GetEmailIdentity (verify_sender_ready startup probe).
Previously the broker BOOT_FAILED on GetEmailIdentity for any
fresh deploy with this section's recommended grant.
• NEW Step 3 'security audit': explicit warning + commands to
detach AmazonS3FullAccess and similar over-broad managed
policies. The broker process at runtime ONLY uses aws-sdk-sts +
aws-sdk-sesv2; per-user S3 access is via JWT-assumed
agentkeys-data-role, NEVER via the broker's runtime role. A
compromised broker with S3FullAccess could read every magic
link in the inbound bucket.
3. stage7-demo-and-verification.md §0.4: replaced
'agentkeys init --email alice@demo.example' (undeliverable) with
the new auto-click helper as the RECOMMENDED path; kept manual
alternative for operators with a real inbox they control. Explicit
warning to not use example.com / demo.example.
Live broker IAM (i-0c0b739bd35643fd3, role 'S3-full-access'):
• Inline 'BrokerSendEmail': ses:SendEmail + ses:GetEmailIdentity
on identity/bots.litentry.org + identity/*@bots.litentry.org
• Detached: AmazonS3FullAccess (was: full read/write on all account
buckets, including the verification-token bucket)
• Final state: 1 inline policy, 0 attached policies, all least-
privilege.
The script's auto-click flow is also a useful regression-test loop —
the user wanted '1 or 2 emails for test' so we can drive a full
auth round-trip without a human in the loop.
The polling loop waited the full 2-min budget for an email that would never arrive if 'agentkeys init' had already exited (broker rejection, signer unauthorized, etc.). Add a 'kill -0 $init_pid' check at the top of each iteration: if init is gone, dump its log and die. Cuts the failure-mode latency from 2 min to ~5s and surfaces the actual error from init's stdout/stderr.
User hit: 'REGION env var required (source operator-workstation.env)'
even after sourcing the env file. Root cause: they ran the script
with sudo, which (per most distros' default sudoers) strips env to
PATH/USER/HOME/TERM/MAIL only — REGION/MAIL_DOMAIN/MAIL_BUCKET/
OIDC_ISSUER/BACKEND_URL all vanish in the child process and the
script dies on the first ${VAR:?...} guard.
The script doesn't need root: AWS calls use the operator's profile
(in shell env), and 'agentkeys init' writes the session JWT to the
USER's OS keychain. Running under sudo would actually break things
even if env was preserved (keychain lookup would target root's
keychain, not the operator's).
Two changes:
1. scripts/agentkeys-init-email-demo.sh: detect SUDO_USER at start
and die loud with the exact re-run command, before the cryptic
env-var guard fires.
2. docs/stage7-demo-and-verification.md \xc2\xa70.4: explicit
'Do NOT prefix sudo' note next to the recommended invocation,
explaining why (env stripping + wrong keychain).
…-true)
Bug: 'aws --output text' returns keys TAB-separated. The previous
substring check 'case " $pre_keys " in *" $k "*' looked for
SPACE-surrounded matches, so every key in current_keys missed and
every poll attempt reported all 415+ pre-existing keys as 'new'.
Functionally correct (the per-key body grep still narrows down to
the magic-link email) but ~415 needless 'aws s3 cp' calls per
attempt — slow.
Fix: build a bash associative array (pre_set[$k]=1) at snapshot
time. O(1) membership check per key in the polling loop. Switch
new_keys from a space-separated string to a proper array so it
works regardless of key contents.
Verified locally: bash -n syntax ok; empty-array iteration safe
under 'set -euo pipefail' (declare -a + "${new_keys[@]}").
User error: 'declare: -A: invalid option'. macOS ships /bin/bash 3.2
forever (Apple GPLv3 freeze) and the script's shebang resolves there.
'declare -A' (associative arrays) requires bash 4+.
Replace the associative-array set with a string-based set:
PRE_KEYS_SET=' $pre_keys_text ' # leading + trailing spaces
case "$PRE_KEYS_SET" in *" $k "*) continue ;; esac
Bash-3.2 compatible. SES-generated S3 keys are alphanumeric (no
spaces), so the space delimiter is exact-match safe. 'tr \t \ '
normalizes the tab-separated 'aws --output text' output upfront.
Verified locally under /bin/bash 3.2.57:
- syntax check passes
- isolated dry-run: 5 pre-existing keys, 1 new arrival → set
difference correctly returns just the new key
Indexed arrays + array+= and "${arr[@]}" iteration are bash 3.1+,
so the rest of the script (new_keys array) still works.
…r ran on success Bash precedence: '|' binds tighter than '||'. So cmd1 2>/dev/null || true | tr '\t' ' ' parses as cmd1 2>/dev/null || (true | tr '\t' ' ') meaning tr ONLY runs if aws fails. On success (the common path) pre_keys_text remained tab-separated, the case-pattern '*" $k "*' looked for space-surrounded matches, every key missed, every poll attempt reported all 417 keys as 'new'. The earlier '/bin/bash isolated dry-run' didn't reproduce because it used a different invocation form (printf piped to tr) that wasn't subject to this precedence trap. Fix: group with braces so the pipe gets the output of either branch: Verified live against the actual 417-object inbound bucket under /bin/bash 3.2.57: - pre_keys_text now space-separated (no tabs detected) - same-list comparison correctly returns 0 new keys
Magic-link demo (scripts/agentkeys-init-email-demo.sh) was failing
after the broker accepted the click ({"ok":true}) but before
returning the derived wallet. The error was 'signer error:
unauthorized: missing Authorization: Bearer <jwt> header'.
Root cause: in crates/agentkeys-core/src/init_flow.rs, two HTTP signer
calls used HttpSignerClient::new() WITHOUT chaining .with_session_jwt():
- derive_via_signer (line 261): creates client without JWT, /dev/derive-address fails 401
- siwe_round_trip (line 314): creates client without JWT, /dev/sign-message fails 401
The standalone agentkeys signer derive / signer sign CLI commands DO
chain .with_session_jwt(session.token) from the keychain (lib.rs:1169),
but the in-flow init_via_email_link path also has the identity-session
JWT in hand (just minted by the broker after the magic-link click), so
it just needs to be threaded through. Fixed both call sites + added
#[allow(clippy::too_many_arguments)] on finish_init (which was already
at 8 args — pre-existing clippy warning that surfaced after the audit).
Doc fold-back: stage7-demo-and-verification.md §3 'Mint OIDC JWT for
STS' previously assumed $SESSION_JWT_A was already populated, but the
§2.0 path ('agentkeys init --email') leaves the JWT in the keychain
or file fallback with no CLI extraction wrapper. Added explicit
instructions for both \§2.0 (file fallback / macOS Keychain) and
\§2.1-2.4 (manual SIWE response capture) paths.
Self-check (all 5 steps green against live broker.litentry.org):
1. agentkeys signer derive → 0x885904faf3d5624a30b0427078015d0072f604ea
2. agentkeys signer sign → 132-char sig
3. broker /healthz → 200
4. /v1/mint-oidc-jwt → 692-char OIDC JWT with correct
aws.amazon.com/tags claims
5. AssumeRoleWithWebIdentity → assumed-role/agentkeys-data-role/...
Stage 7 demo flow validated end-to-end through §4.1 (STS exchange).
§4.2-4.3 (S3 isolation probe) requires writing to the production
bucket and is left to explicit operator authorization.
…tion The broker's SES outbound mails are pure-ASCII so the parts are 7bit-encoded — the magic-link URL appears in the body with a LITERAL '=' between 't' and the base64url token: https://broker.litentry.org/auth/email/landing#t=Kwm1lO8z... The previous regex looked only for 't=3D' (QP-encoded form). It never matched on production emails, so the script timed out polling even though the email had arrived in S3. Fix: alternation '#t=(3D)?[A-Za-z0-9_-]+' matches both forms, then 'sed s/#t=3D/#t=/' normalizes to literal-'='. Verified by extracting against an actual stored email — token came out clean and POSTs to /v1/auth/email/verify succeed with {"ok":true}.
…rofile defaults to us-west-2 The agentkeys-admin local profile defaults to us-west-2 (verified via `aws --profile agentkeys-admin configure get region`), while every broker-side resource (EC2, S3 mail bucket, SES identity) lives in us-east-1. Without an explicit `--region "$REGION"` on every regional AWS CLI call, the agentkeys-admin profile silently searches the wrong region — describe-instances returns empty (no error, exit 0), and the downstream `iam put-role-policy --role-name ""` silently no-ops. Real symptom (this session): operator ran the §0.4 ROLE discovery snippet under awsp agentkeys-admin → ROLE came back empty → SES grant never landed. Diagnosis took two rounds because there's no stderr signal. Changes: - CLAUDE.md: new "AWS local-profile ↔ remote-IAM mapping" section documenting (a) the three-profile table, (b) the per-profile region divergence trap (agentkeys-admin=us-west-2, others=us-east-1), and (c) case-insensitive caller-arn matching since the remote IAM user is agentKeys-admin (capital K) vs local agentkeys-admin (lowercase). - docs/stage7-demo-and-verification.md §0.4: ROLE discovery now passes --region "$REGION" + fail-loud guard on empty INSTANCE_PROFILE_ARN. Plus 5x s3api lines (§4.2 + §16) gain --region. - docs/cloud-setup.md §3.4a: ROLE discovery rewritten with --region + fail-loud guard. Plus 5x s3api lines (bucket-policy + lifecycle + delete-bucket + access-block) gain --region. - scripts/inspect-inbound-email.sh: require REGION up-front (loud-fail guard); pass --region "$REGION" on all 4 aws calls. - scripts/ses-verify-sender.sh: case-insensitive caller-arn match (`tr [:upper:] [:lower:]` — portable to /bin/bash 3.2) so agentKeys-admin (capital K) no longer triggers the bogus "caller is not agentkeys-admin" warning. Verified end-to-end under AWS_PROFILE=agentkeys-admin (profile region us-west-2): ROLE discovery now returns S3-full-access correctly; inspect-inbound-email.sh runs cleanly; ses-verify-sender.sh no longer emits the spurious warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eted in prod Production broker EC2 (i-0c0b739bd35643fd3) was migrated 2026-05-12 from legacy `S3-full-access` instance profile to canonical `agentkeys-broker-host`. Migration steps executed: 1. Created `agentkeys-broker-host` role + instance profile via `aws iam create-role` + `create-instance-profile` (matches cloud-setup.md §3.4 conventions). 2. Attached complete `BrokerSendEmail` inline policy on new role: `ses:SendEmail` AND `ses:GetEmailIdentity` (the latter folds in the perm gap that prevented `verify_sender_ready` from succeeding). 3. Atomically swapped EC2 instance profile via `aws ec2 replace-iam-instance-profile-association` (no creds gap). 4. Verified broker /healthz=200 + sent two test emails through the new role (HTTP 200, request_id eml-bf4e..., eml-2aff...). 5. Cleaned up legacy artifacts: removed role from old profile, deleted inline policy + role + instance profile, revoked the temporary `ec2:Describe/ReplaceIamInstanceProfileAssociations` grant on `agentKeys-admin` IAM user. Doc updates: - cloud-setup.md §3.4a: drops "may use ad-hoc S3-full-access from initial provisioning" framing — fully retired. Discovery snippet retained because it's robust against any future drift. - stage7-demo-and-verification.md §0.4 troubleshooting block: same. Drops the `legacy/fresh` distinction that no longer applies. Known follow-up (separate scope, spawned task): `/readyz` still returns 503 with "SES verification cache absent at /var/lib/agentkeys/.agentkeys/broker/ses-verify.json" — this is a pre-existing bug independent of IAM. Production code never calls `verify_sender_ready()` and never invokes `SesVerifyCache::save()`, so the cache file is never populated. The IAM permission is now in place (this commit's `agentkeys-broker-host` role has `ses:GetEmailIdentity`), so once the boot path wires `verify_sender_ready()` + `cache.save()` /readyz will turn green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The email-link plug-in's `Readiness::ready()` reads `SesVerifyCache` from disk and reports `auth/email_link: SES verification cache absent` when the file is missing. No production code path called `verify_sender_ready()` or `SesVerifyCache::save()`, so /readyz was permanently 503-degraded on this check even when SES was configured correctly and email-link auth worked end-to-end. Add a Tier-2 probe spawned alongside the existing backend probe: calls `sender.verify_sender_ready()`, writes the cache on success, flips `Tier2State::ses_verified`. Exponential backoff up to 5min on failure (non-blocking; honors BROKER_REFUSE_TO_BOOT_STRICT). After a success, re-verifies every 12h so the cache stays well under the plug-in's 24h freshness TTL.
…tEmailIdentity grant + 722a990 verify-probe note §0.4 troubleshooting block updated for the post-rename world: - Lead with the canonical role: "Broker IAM role: `agentkeys-broker-host`" (was: "the role name varies by deployment ... legacy may use S3-full-access"). - Document the **complete** BrokerSendEmail policy: BOTH `ses:SendEmail` AND `ses:GetEmailIdentity`. Previously the grant snippet only granted SendEmail; the missing GetEmailIdentity perm was why /readyz reported `auth/email_link: SES verification cache absent` even when SES was working. Both actions now in the put-role-policy snippet AND in the copy-paste verify command (`aws iam get-role-policy ...`). - Reframe AccessDeniedException troubleshooting: from "find the unknown role name" → "verify it's still agentkeys-broker-host (defensive against future drift)". The discovery snippet stays — robust against future instance-profile churn — but the verify expected output now references the canonical name explicitly. - Add the restart-needed nuance for the verify probe: SendEmail picks up creds per-call (no restart needed), but the Tier-2 verify probe (commit 722a990) runs once at boot then every 12h, so adding GetEmailIdentity requires a broker restart for /readyz to reflect it. Production verified: `aws iam get-role-policy ... BrokerSendEmail` returns `[["ses:SendEmail","ses:GetEmailIdentity"]]` exactly as the doc claims. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Keychain-free CLI
Two operator-blocking traps surfaced while walking §0.4 against the
live broker; both fixed end-to-end.
Trap 1: signer rejects derive with "JWT omni_account claim does not
match request body". §0.4 used to call `signer derive --omni-account
$OMNI_A` where `$OMNI_A = sha("agentkeys","email","alice@demo.example")`
from §0.3 — but the session JWT minted by `agentkeys-init-email-demo.sh`
is for `demo-1@bots.litentry.org` (or demo-2 on rotation). After issue
#74 step 1b's strict JWT-omni check, the signer requires
`JWT.omni_account == request.omni_account` exactly. The arbitrary
alice/bob omni never matches.
Fix:
- §0.3 reframed as "math reference only" — the helper recomputes the
broker's omni formula so the operator can verify the algorithm,
but the actual `OMNI_A` / `OMNI_B` come from the live session JWTs
in §0.4 below.
- §0.4 adds a `decode_jwt_payload()` helper that pulls
`agentkeys.omni_account` and `agentkeys.wallet_address` directly
from `~/.agentkeys/master/session.json` (no signature verify — just
base64-decoding the body for our local read).
- For the §4 isolation proof we now run `init-email-demo.sh` TWICE
(the script's epoch-parity rotation between demo-1 and demo-2 gives
two distinct sessions automatically; consecutive runs naturally
yield two distinct (omni, wallet) pairs).
- Drops the wrong `ADDR_A == JWT.wallet_address` assertion. The
signer derive returns the EVM-omni's wallet (post-SIWE-promoted
identity), which is a *different* keypair from the email-omni's
wallet stored in `JWT.agentkeys.wallet_address`. Both are real,
both are derived by the same signer; they play different roles
in the demo (the JWT's wallet_address was the SIWE signing key
that bootstrapped the session; ADDR_A is the EVM-identity wallet
used downstream for S3 path scoping).
Trap 2: even with matching omni, `agentkeys signer derive` returned
`SIGNER_UNAUTHORIZED: invalid session JWT: InvalidToken` while a raw
`curl` with the same JWT succeeded. Root cause: the CLI defaults to
`KeyringMode::Auto` (crates/agentkeys-core/src/session_store.rs:86) —
Keychain first, file fallback. A stale Keychain entry from earlier
dev runs gets picked up and fed to the signer, which rejects the
signature. The user-visible symptom is also keychain access prompts
on every CLI call.
Fix:
- `scripts/operator-workstation.env` exports `AGENTKEYS_SESSION_STORE=file`,
which forces `KeyringMode::FileOnly`. The demo is now Keychain-free
end-to-end. Comment explains the trade-off (fresh-machine users can
comment the line out to re-enable Keychain).
- §0.4 callout block documents the trap + the raw-curl fallback so an
operator can self-diagnose "is it the JWT or the CLI?" in one step.
End-to-end verified under AWS_PROFILE=agentkeys-admin with the new
env: OMNI_A extracted from session.json's JWT decodes to
`402d4bac…`; `agentkeys --json signer derive --omni-account $OMNI_A`
returns `0xcd936bf34d3156e84cd2e479e267cf39d15a85a6` (HTTP 200, no
Keychain prompts).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…4 key-topology rewrite
User pain (during §0.4 walk-through):
1. Each `init-email-demo.sh` run overwrites ~/.agentkeys/master/session.json,
so back-to-back inits for the §4 two-actor isolation proof can't coexist.
2. §0.4 forced operators to hand-decode the JWT in 6 lines of awk+base64 just
to learn OMNI / ADDR — once per session, twice per demo, no rich output.
3. The OMNI_B / ADDR_B / identity-omni / derived-wallet / evm-omni terminology
was opaque: §0.4 didn't reconcile its own vars with `architecture.md` §3+§4
(K3/K4, identity omni vs actor omni), so the operator couldn't tell which
wallet AWS actually sees at the PrincipalTag step in §4.
Changes:
- crates/agentkeys-cli: top-level `--session-id` flag (env AGENTKEYS_SESSION_ID),
plumbed through CommandContext to session_store. Defaults to "master" so
existing behavior is preserved. `with_session_id` ignores empty strings to
keep a forgotten `AGENTKEYS_SESSION_ID=` shell-export from silently writing
to ~/.agentkeys//session.json.
- scripts/agentkeys-init-email-demo.sh: accepts `--session-id <name>` flag
and exports AGENTKEYS_SESSION_ID so the background `agentkeys init` writes
under ~/.agentkeys/<name>/. Two back-to-back runs with distinct ids leave
both sessions live for the §4 proof — no need to re-init to switch. Auto-
invokes scripts/agentkeys-demo-show.sh at the end so the operator sees the
(omni, wallet) pair without a follow-up command.
- scripts/agentkeys-demo-show.sh (new): one-shot rich-output inspector. Reads
~/.agentkeys/<id>/session.json, decodes the JWT body, prints
• identity (type, value, locally-recomputed identity_omni)
• actor (actor_omni, master_wallet)
• signer-wire smoke test (HKDF(K3, actor_omni) — a SECOND wallet,
flagged NOT-used-for-AWS in the output)
• JWT TTL remaining
Supports --json, --no-derive, and positional session-id. Bash-3.2 portable
(no `${var,,}`, no `mapfile`, jq+awk+base64 only).
- docs/stage7-demo-and-verification.md §0.3: corrected the "both omnis end up
in the JWT's `agentkeys` claim" line — the FINAL JWT carries only the EVM
actor omni (the identity-omni is transient and consumed at SIWE-verify).
Cross-linked the truth to crates/agentkeys-broker-server/src/handlers/auth/
wallet_verify.rs:51.
- docs/stage7-demo-and-verification.md §0.4: new "Key topology" subsection
that names the three wallets the demo conflates today —
identity_omni → SHA256("agentkeys"||"email"||email), transient, NOT in JWT
MASTER_WALLET → HKDF(K3, identity_omni_email), the SIWE-linked wallet, JWT.wallet_address
ADDR (= W2) → HKDF(K3, actor_omni), what §2's SIWE round-trip uses and
what §4's S3 isolation actually tags via §2.3's fresh JWT
Both wallets are real, signable, and deterministic; §2.2's `signer sign`
only works for ADDR because the strict JWT-omni check forces the signed
omni to match the JWT's actor_omni. Updated the §0.4 capture block to use
the new demo-show.sh JSON output for both OMNI_A/ADDR_A and an explicit
MASTER_WALLET_A side-channel for cross-reference. Cross-linked
crates/agentkeys-broker-server/src/handlers/oidc.rs:106 (the line that
decides which wallet AWS sees).
End-to-end verified locally:
bash scripts/agentkeys-demo-show.sh --no-derive master → rich text
bash scripts/agentkeys-demo-show.sh --no-derive --json master → JSON shape
bash scripts/agentkeys-demo-show.sh --no-derive nonexistent → loud fail
cargo run -p agentkeys-cli -- --help | grep session-id → exposed
AGENTKEYS_SESSION_ID=alice cargo run -p agentkeys-cli -- --help → env wired
cargo test -p agentkeys-cli --lib → green
… prereqs
User hit a silent-failure trap walking §0.4 today: ran
`bash scripts/agentkeys-init-email-demo.sh --session-id alice`, the
script reported success ("Initialized via email-link..."), but the
session landed at ~/.agentkeys/master/session.json instead of
~/.agentkeys/alice/session.json — and demo-show.sh then failed with
"no session file at ~/.agentkeys/alice/session.json".
Root cause: the `agentkeys` binary on $PATH was built before today
(2026-05-12). The `--session-id` flag (and its env=AGENTKEYS_SESSION_ID
binding) is a clap declaration in the binary — an older binary silently
ignores the env var, falls back to the hardcoded "master" default, and
writes to ~/.agentkeys/master/.
Diagnose-before-edit verified by:
command -v agentkeys → /Users/<you>/.local/bin/agentkeys (May 11 21:01)
agentkeys --help | grep session-id → empty (no flag)
ls -la ~/.agentkeys/master/session.json → freshly written
ls ~/.agentkeys/alice/ → no such directory
Fix lands in THREE places (per runbook-fix-fold-back):
1. scripts/agentkeys-init-email-demo.sh — preflight that `agentkeys
--help` exposes `--session-id`. Dies loud with the exact rebuild
command (`cargo install --path crates/agentkeys-cli --force`) and
the verify-after command. Catches the trap BEFORE the script burns
2 minutes polling for an email + writing to the wrong session-id.
2. scripts/agentkeys-demo-show.sh — same capability check inside the
signer-derive branch. Without it, a stale binary feeding the
wrong --session-id to `signer derive` would silently re-derive
against the master session's omni, masking the real diagnosis.
3. docs/stage7-demo-and-verification.md §0 prereqs — step 6 after the
existing `agentkeys --version` check that re-runs the same grep
and dies if absent. Folds the diagnosis inline so the next
operator catches the stale binary at the moment they're already
looking at install output — no need to discover the trap by
watching init-email-demo.sh "succeed" first.
Verified locally:
REGION=u MAIL_DOMAIN=t MAIL_BUCKET=t OIDC_ISSUER=https://t BACKEND_URL=https://t \
bash scripts/agentkeys-init-email-demo.sh --session-id alice
→ "stale 'agentkeys' binary at /Users/agent-jojo/.local/bin/agentkeys
— missing --session-id flag. Rebuild + reinstall from this worktree:
cargo install --path crates/agentkeys-cli --force"
→ exit 1 (no S3 polling, no SES SendEmail)
… --export mode
Three operator-blockers landed today walking §0.4:
1. `--session-id alice` and `--session-id bob` produced the SAME wallet
because the legacy default recipient rotated demo-1/demo-2 by epoch
parity — two back-to-back runs hit the same parity, got the same
recipient, derived the same identity_omni (HKDF deterministic),
thus the same MASTER_WALLET. The §4 isolation proof becomes
vacuous (same actor → same prefix → trivially "allowed both
reads"; demo doesn't prove anything).
2. The `init-email-demo.sh` log + demo-show.sh output named the
identity_omni hex but did NOT show the SHA256 inputs (type, value),
so the operator couldn't reproduce the math by hand or diagnose
why two different sessions collided.
3. §0.4 had three `jq -r` extractions per session to pull OMNI / ADDR
/ MASTER_WALLET out of `--json` — 6 lines for two sessions, with
the field paths hand-typed and easy to mis-name. The doc + the
show script weren't a single source of truth.
Fixes:
- scripts/agentkeys-init-email-demo.sh — new recipient precedence:
$RECIPIENT > positional arg > $SESSION_ID-derived (when not "master")
> legacy demo-1/demo-2 rotation. With `--session-id alice` the
recipient is now alice@$MAIL_DOMAIN deterministically, NOT a
rotating demo-N. The log now prints the computed identity_omni and
the SHA256 formula inline so collisions are visible BEFORE SES
SendEmail fires.
- scripts/agentkeys-demo-show.sh — new `--export <prefix>` mode emits
eval-able shell assignments:
SESSION_ID_<P>=… OMNI_<P>=… ADDR_<P>=… MASTER_WALLET_<P>=…
IDENTITY_TYPE_<P>=… IDENTITY_VALUE_<P>=… IDENTITY_OMNI_<P>=…
so the doc / an operator script can capture all seven fields with
one `eval "$(bash scripts/agentkeys-demo-show.sh --export A alice)"`.
Values are `printf %q`-escaped — survives eval with arbitrary
content. The human-readable output now shows the full
`= SHA256("agentkeys" || "<type>" || "<value>")` formula under the
identity_omni line so the math is reproducible at a glance.
- docs/stage7-demo-and-verification.md §0.4 — replaced the 12-line
`--json | jq -r` extraction block with two `eval` calls + a new
collision-diagnostic that explains exactly why MASTER_WALLET_A ==
MASTER_WALLET_B can happen (same recipient → same identity_omni)
and what the fix is.
Verified locally:
eval "$(bash scripts/agentkeys-demo-show.sh --no-derive --export A master)"
echo "$OMNI_A $IDENTITY_OMNI_A $MASTER_WALLET_A $IDENTITY_TYPE_A $IDENTITY_VALUE_A"
→ all seven vars populated, identity values match what shasum -a 256 computes
bash scripts/agentkeys-init-email-demo.sh --session-id alice
→ Recipient: alice@bots.litentry.org (not demo-N)
→ identity_omni (email) = dbcb6acd... (visible BEFORE SendEmail)
Why this is the fix and not a workaround: HKDF(K3, omni) is the
contractual signer derive — same omni in, same wallet out is the WHOLE
point of the deterministic-derive design. The bug was the demo's
recipient rotation, NOT the signer. Two operators with literally the
same email address WILL get the same wallet, by design. The fix
guarantees each --session-id maps to a distinct recipient so the §4
proof actually exercises two distinct actors.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands plan steps 0–9 of
docs/spec/plans/issue-74-dev-key-service-plan.md:docs/spec/signer-protocol.md: v0 wire contract for the signer edge (request/response shapes, error envelope, versioned HKDF derivation byte, future TEE attestation handshake).agentkeys-mock-server::dev_key_service: HKDF-SHA256 + secp256k1 + EIP-191 signer, gated byDEV_KEY_SERVICE_MASTER_SECRET; 10 unit tests./dev/derive-address+/dev/sign-messagehandlers wired into router; 503signer_disabledwhen env unset; 8 integration tests.scripts/setup-broker-host.shauto-generates the master secret into/etc/agentkeys/dev-key-service.env(mode 0600), wires it viaEnvironmentFile=in the backend systemd unit. Idempotent (no--upgradeflag) — preserves the secret on re-run, since rotating it would invalidate every previously-derived wallet.agentkeys-daemonmain.rs adds--init-email/--init-oauth2-google/--signer-url, drives the email/OAuth2 → omni → derive →/v1/wallet/link→ SIWE → EVM-session chain on first start; emits a structuredtracing::info!(target = "agentkeys.daemon.init", …)audit row on success.agentkeys-clicmd_init rewritten asInitMode::{Email, Oauth2Google, ImportLegacyMock(test-only)}.--mock-tokenflag hard-cut from the user-facing CLI surface per CEO-review §8 ("no deprecation runway, clean slate this PR"). All 9cli_tests.rscall sites migrated toInitMode::ImportLegacyMock.agentkeys whoami(read-only; surfaces signer-derived wallet via--signer-url --omni-account).docs/stage7-demo-and-verification.mdrewritten end-to-end for the new flow: dropscast wallet sign, drops operator-held private keys, adds theagentkeys init --emailrecommended path.Shared plumbing in
agentkeys-core:signer_client: typedSignerClienttrait +HttpSignerClient— the daemon's swap-point abstraction.init_flow: broker email/OAuth2 → derive → link → SIWE chain helpers, used by both CLI and daemon.CLAUDE.mdadds a plan-completion policy (always complete every numbered plan step; mandatory done/not-done summary at PR end).Pre-Stage-7 docs moved to
docs/archived/(operator-runbook→operator-runbook-pre-stage7.md;contradictions→contradictions-stage4-2026-04.md;field-name-translation); inbound references indev-setup.md,stage7-wip.md,threat-model-key-custody.md, and the CLI's runbook URL repointed tooperator-runbook-stage7.md.Verification
cargo test --workspace), 0 failing.agentkeys-mock-server(port 18091,DEV_KEY_SERVICE_MASTER_SECRETset): two distinct omnis derive two distinct addresses;agentkeys signer signreturns the canonical 65-byte signature whose recovered address matches the derived address; the legacy 503 path returns the typedSIGNER_DISABLEDerror.bash -n scripts/setup-broker-host.shsyntax-check clean.Reviewer notes
/v1/auth/exchangeis now zero-caller in-tree. Mint endpoints already verify session JWTs locally viaverify_session_jwt, and this PR removes the last live caller path (agentkeys init --mock-token). The route +validate_bearer_token+BROKER_BACKEND_URLenv vars stay only for backward-compat with any out-of-tree clients; a separate cleanup PR will delete them once external migration completes./etc/agentkeys/dev-key-service.envis intentionally pinned across re-runs ofsetup-broker-host.sh. Issue Replace dev_key_service with TEE worker for omni-anchored EVM keypair derivation #74 step 2 (TEE worker) defines the formal rotation runbook; today'sdev_key_servicehas no rotation knob beyondkey_version(currently0x01).git, notjj(CLAUDE.md says jj). The session's jj working-copy pointer (@) was on a stale change line (/healthzfix) while git's HEAD was on the actual branch tip; rebasing@would have reset the working copy and lost the work.jj git importre-syncs after merge.What did NOT land — and why
ssh agentkey@\$BROKER_HOST→bash scripts/setup-broker-host.sh --yes→ walk §16 ofdocs/stage7-demo-and-verification.md.init_flowhelpers via inline tests,cmd_inittest-mode via CLI test suite) but doesn't have a hermetic integration test that exercises the fullemail/request → status poll → derive → link → SIWEchain. Adding one needs either an in-memory email/OAuth2 provider fixture or wiring up SES/Google mocks; left as follow-up.Test plan
cargo test --workspace— must show 386 passing, 0 failingcargo clippy --workspace --tests— no errorsbash scripts/setup-broker-host.sh --yesround-trips cleanly;journalctl -u agentkeys-backendshowsdev_key_service ENABLEDafter first run, preserves the secret on re-runcurl -sS -X POST http://127.0.0.1:8090/dev/derive-address -d '{"omni_account":"<64hex>"}'returns{"address":"0x…","key_version":1}curl -sS -X POST <broker>/v1/auth/wallet/start -d '{"address":"<derived>","chain_id":84532}'followed byagentkeys signer signand/v1/auth/wallet/verifymints an EVM session JWTdocs/stage7-demo-and-verification.mdstill passesagentkeys init --email <addr> --broker-url … --signer-url …saves an EVM session JWT end-to-end (interactive — operator clicks magic link)🤖 Generated with Claude Code