Stage 7 — pluggable broker live deploy + OIDC-only auto-provision (issue #64, #71 Option A)#73
Merged
Conversation
…env-var module
Implement plan §5: single source of truth for every BROKER_* environment
variable name. Per user rule 11, no other module may declare a raw env-var
literal — all reads go through these constants.
- crates/agentkeys-broker-server/src/env.rs (new): const &str declarations
for all 51 env vars (Phase 0 + planned A/B/C/D/E + legacy aliases),
Group enum (Core/Oidc/SessionJwt/Audit/AuditEvm/Auth/AuthEmail/AuthOAuth2/
Limits/Legacy), all() registry returning (name, doc, group), print_table()
for the operator runbook auto-generator. 5 unit tests cover uniqueness,
non-empty docs, required-Phase-0 presence, table render row count, and
Group exhaustiveness.
- crates/agentkeys-broker-server/src/lib.rs: register pub mod env.
- crates/agentkeys-broker-server/src/config.rs: replace every raw BROKER_*
string literal with env::* constants. grep -E '"(BROKER_|DAEMON_|ACCOUNT_ID|REGION)' src/config.rs returns zero hits. Adds parse_int_env_with_default<T> helper to
collapse three near-duplicate parse blocks.
Plan home: docs/spec/plans/issue-64/{PLAN.md (mirror), DECISIONS.md,
AMBIGUITIES.md, V0.1-FOLLOWUPS.md, prd.json (PRD-driven ralph)}.
Acceptance criteria (US-001):
- env.rs exists with const &str for every plan §5 BROKER_* var ✓
- Group enum with required variants ✓
- all() returns slice of (name, doc, Group), all docs non-empty ✓
- src/config.rs: grep zero hits for raw BROKER_/DAEMON_/ACCOUNT_ID/REGION ✓
- cargo build -p agentkeys-broker-server succeeds ✓
- cargo test -p agentkeys-broker-server env:: 5/5 pass ✓
Refs: issue #64 plan §1 rule 11, §5.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement plan §3 + §3.5: pluggable trait surface for the three layers
below the credential mint. No plug-in implementations yet (US-006
implements WalletSig, US-007 ClientSideKeystore, US-008 SqliteAnchor) —
this story lands the trait shapes, error types, and registry that the
later stories slot into.
- crates/agentkeys-broker-server/src/plugins/mod.rs (new): Readiness
enum (Ready/Degraded/Unready), PluginRegistry { auth: HashMap, wallet,
audit: Vec }, aggregate_readiness() → (overall, per-check) for the
/readyz JSON. Trait re-exports.
- crates/agentkeys-broker-server/src/plugins/auth.rs (new): UserAuthMethod
trait (name/ready/challenge/verify), VerifiedIdentity, ChallengeParams,
AuthChallenge, AuthResponse, IdentityType { Evm, Email, OAuth2{Google,
Github,Apple} } with stable canonical() strings (input to OmniAccount
derivation; renaming is breaking). AuthError enum.
- crates/agentkeys-broker-server/src/plugins/wallet.rs (new):
WalletProvisioner trait (name/ready/bind_address/lookup_by_omni_account),
WalletAddress newtype with parse() that normalizes 0x-prefixed hex to
lowercase + length check, WalletRole { Master, Daemon }, WalletBinding
struct. WalletError enum.
- crates/agentkeys-broker-server/src/plugins/audit.rs (new): AuditAnchor
trait (name/ready/anchor/verify), AuditRecord with record_hash for
cross-anchor dedup, AnchorReceipt, AuditPolicy { DualStrict,
SqlitePrimary, EvmPrimary } parser. AuditError enum.
- crates/agentkeys-broker-server/src/lib.rs: register pub mod plugins.
- crates/agentkeys-broker-server/Cargo.toml: feature-gate scaffold per
plan §3. default = [auth-wallet-sig, wallet-keystore, audit-sqlite].
Optional features for v0-testnet (auth-email-link, auth-oauth2-google,
audit-evm) and v1+ (auth-oauth2-github, auth-oauth2-apple, audit-solana).
External deps land in implementation stories (US-006: k256+sha3;
Phase A.1: lettre+aws-sdk-sesv2; Phase C: alloy-*).
Acceptance criteria (US-002):
- Readiness enum with Ready/Degraded/Unready ✓
- UserAuthMethod / WalletProvisioner / AuditAnchor traits ✓
- PluginRegistry struct + aggregate_readiness ✓
- Per-trait thiserror error enums (AuthError, WalletError, AuditError) ✓
- Cargo features: auth-wallet-sig, auth-email-link, auth-oauth2,
auth-oauth2-google, wallet-keystore, audit-sqlite, audit-evm, test-stub ✓
- cargo build with default features ✓
- cargo test plugins:: 8/8 pass ✓
- cargo clippy -D warnings clean ✓
Per-trait `ready()` MUST NOT default to Ready — implementations check
their own dependencies. Documented in trait doc comments. The first
implementations (US-006/007/008) demonstrate the pattern.
Refs: issue #64 plan §3, §3.5, §1 rule 8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…liteAnchor port
Bundles two stories that became coupled when the agentkeys-types::AgentIdentity
extension forced match-arm updates across four crates and the audit/ module
restructure required relocating both the trait file and the SqliteAnchor
implementation in the same change.
US-004 — OmniAccount derivation
- crates/agentkeys-broker-server/src/identity/{mod.rs,omni_account.rs} (new):
derive_omni_account(identity_type, identity_value) → SHA256(client_id ||
type || value) with hardcoded AGENTKEYS_CLIENT_ID = "agentkeys". Per port-
vs-greenfield "What we port — crypto primitives only", this matches the
dexs-backend hash shape verbatim but uses our own client_id, giving each
operator a sovereign identity namespace. derive_with_client_id(...) is
exposed for reproducing dexs reference vectors in tests.
- crates/agentkeys-types/src/lib.rs: AgentIdentity::OAuth2{provider, sub}
variant added (additive — every existing AgentIdentity consumer continues
to work unchanged for the four prior variants).
- Match-arm updates across consumers (Rust E0004 non-exhaustive errors
surfaced these — exactly the property we want from the type system):
- crates/agentkeys-core/src/mock_client.rs (open_auth_request +
session_recover): map OAuth2{provider,sub} → ("oauth2_<provider>", sub)
matching the broker's IdentityType::canonical() naming.
- crates/agentkeys-core/src/auth_request.rs: deterministic CBOR encoding
of OAuth2 — Map[("provider", Text), ("sub", Text)] with keys ASCII-
sorted so the canonical hash is stable.
- crates/agentkeys-cli/src/lib.rs: rich-error human-readable form
"oauth2_<provider>:<sub>".
- crates/agentkeys-mock-server/src/test_client.rs: same mapping as
mock_client (auth-request and session-recover paths).
- 9 identity:: unit tests cover: hex parse validation, derivation
determinism, identity-type namespace separation, identity-value
separation, client_id namespace separation (load-bearing — proves
agentkeys ≠ wildmeta for the same email), prod entry-point matches
hardcoded constant, lowercase-hex output guarantee.
US-008 — SqliteAnchor port to AuditAnchor trait
- crates/agentkeys-broker-server/src/plugins/audit/{mod.rs,sqlite.rs}
restructured: trait file `audit.rs` merged into `audit/mod.rs` so the
feature-gated `audit-sqlite` submodule can live alongside it. (Previous
layout had `audit.rs` + `audit/mod.rs` which Rust E0761'd.)
- src/plugins/audit/sqlite.rs (new): SqliteAnchor implementing AuditAnchor.
Schema is the new plugin_mint_log table with the canonical AuditRecord
columns + a status column (Phase 0 writes 'confirmed' directly; Phase C
introduces the pending → confirmed | quarantined lifecycle). Indexes on
minted_at, omni_account, record_hash, status. WAL+FULL pragma preserved
from the legacy crate::audit::AuditLog.
- Readiness::Ready when DB writable; Unready otherwise.
- 8 plugins::audit:: tests cover: anchor round-trip, verify NotFound,
record_hash tampering detection, wrong-anchor receipt rejection, ready
reports Ready, name() stability + AuditPolicy parse + AuditRecord round
trip.
Acceptance criteria (US-004):
- src/identity/omni_account.rs derive_omni_account(...) ✓
- AGENTKEYS_CLIENT_ID = "agentkeys" pinned ✓
- agentkeys-types::AgentIdentity::OAuth2{provider, sub} added ✓
- Tests cover canonical hash for each identity type ✓
- cargo test identity:: 9/9 pass ✓
Acceptance criteria (US-008):
- src/plugins/audit/sqlite.rs implements AuditAnchor ✓
- plugin_mint_log table with canonical columns + indexes ✓
- WAL+FULL pragma preserved ✓
- verify() detects record_hash tampering ✓
- Readiness Ready when writable ✓
- cargo test plugins::audit:: 8/8 pass ✓
Note: legacy crate::audit::AuditLog (the existing src/audit.rs) is left
in place for now — US-011 migrates the mint handler to the new trait and
drops the legacy module then. Carrying both during the transition keeps
existing /v1/mint-aws-creds working.
Refs: issue #64 plan §3.5 (OmniAccount), §3 (AuditAnchor trait), §Phase 0
deliverables.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h purpose tagging Implement plan §3.5.6: two distinct ES256 keypairs for two roles: - oidc keypair (existing) — signs JWTs that AWS STS verifies via JWKS. - session keypair (NEW) — signs broker-internal session JWTs. Closes Codex / eng-review #7 footgun: an operator pointing BROKER_SESSION_KEYPAIR_PATH at the OIDC keypair file would have silently used the wrong key (same kid, same crypto), letting session tokens pass as IAM federation tokens. Defense: on-disk JSON now carries a "purpose" field; load-time validation refuses to read a keypair whose purpose does not match the slot. - crates/agentkeys-broker-server/src/jwt/{mod,session,issue,verify}.rs (new): KeypairPurpose enum (Oidc | Session) with stable kebab-case canonical() and kid_prefix(); SessionKeypair (mirror of OidcKeypair, purpose-tagged on disk, kid prefix `ak-session-`); mint_session_jwt() with the canonical session-JWT claim shape (iss/sub/aud=agentkeys:broker/exp/iat/jti + agentkeys.{omni_account,wallet_address,identity_type,identity_value}); verify_session_jwt() that pins audience + issuer + kid header. - crates/agentkeys-broker-server/src/oidc.rs: - PersistedKeypair: add `purpose` field with #[serde(default)] mapping to KeypairPurpose::Oidc so pre-Stage-7 keypair files (no purpose field) continue to load as oidc. New keypairs always include the field. - load() refuses any keypair whose purpose ≠ Oidc. - generate_and_persist() writes purpose=oidc. - rand_core_compat → pub(crate) rand_compat (so SessionKeypair can reuse the rand_core 0.6 → OS RNG bridge). - set_owner_only → pub(crate) set_owner_only_inner (same reason). - crates/agentkeys-broker-server/src/lib.rs: register pub mod jwt. Acceptance criteria (US-005): - src/jwt/mod.rs: KeypairPurpose with Oidc + Session ✓ - On-disk JSON includes "purpose" field ✓ - SessionKeypair::load refuses purpose=oidc keypair ✓ - SessionKeypair::load refuses untagged JSON ✓ - OidcKeypair::load refuses purpose=session keypair ✓ - Session JWT mint+verify round trip ✓ - verify rejects wrong audience, wrong issuer, expired ✓ - session keypair kid prefix `ak-session-`; oidc kid format unchanged ✓ - cargo test jwt:: 10/10 pass ✓ - cargo build green ✓ env.rs already has BROKER_SESSION_KEYPAIR_PATH and BROKER_SESSION_JWT_TTL_SECONDS (landed in US-001). Wiring config.rs + boot.rs to actually load the session keypair lands in US-003 (tiered refuse-to-boot). Refs: issue #64 plan §3.5.6, codex review finding #7, eng review #code-structure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sioner + WalletStore
Implement plan §3.5 + §Phase 0 wallet layer: the MetaMask model. The
broker stores ONLY (omni_account, address, role, parent_address,
created_at) — the user holds the seed in their OS keychain on the
daemon side. The broker has no key material it could leak.
Storage layer:
- crates/agentkeys-broker-server/src/storage/{mod.rs, wallets.rs} (new):
WalletStore with composite-PK schema (omni_account, address) so a user
can have multiple wallets and re-binding the same address is idempotent.
WAL+NORMAL for throughput (audit log gets FULL elsewhere).
bind() detects role mismatch and parent mismatch on re-bind — a daemon
switching masters or an address flipping role would be silent data
corruption otherwise.
list_for_omni_account() returns every wallet bound to the OmniAccount.
writable() probe used by the plugin's ready().
Plugin layer:
- crates/agentkeys-broker-server/src/plugins/wallet/{mod.rs,keystore.rs}:
module restructure from sibling-file `wallet.rs` to `wallet/mod.rs +
wallet/keystore.rs` (same E0761 fix as US-008's audit module).
ClientSideKeystoreProvisioner implements WalletProvisioner. name() =
"client_keystore". ready() reflects WalletStore::writable() (NOT a
hardcoded Ready, per plan §1 rule 5). bind_address() stamps current
unix-seconds and delegates to WalletStore::bind. lookup_by_omni_account
delegates to WalletStore::list_for_omni_account.
- crates/agentkeys-broker-server/src/lib.rs: register pub mod storage.
Acceptance criteria (US-007):
- src/plugins/wallet/keystore.rs implements WalletProvisioner ✓
- Storage table wallets(omni_account, address, role, parent_address,
created_at) with composite PK and role CHECK constraint ✓
- bind(): inserts row; idempotent (same role + parent → returns existing) ✓
- bind() rejects role mismatch ✓
- lookup_by_omni_account returns all bindings ✓
- ready() Ready when DB writable, Unready otherwise ✓
- 9 plugins::wallet:: tests pass (3 type tests + 6 keystore behavior
tests covering bind+lookup, idempotent re-bind, rejected role flip,
ready, name, multi-binding lookup) ✓
- cargo build green ✓
Refs: issue #64 plan §3.5 (wallet layer), §Phase 0 deliverables.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update progress.txt with full Phase 0 session log (6 of 16 stories complete: US-001/002/004/005/007/008). Update prd.json passes flags + commit refs. Append commit-log table to DECISIONS.md. Phase 0 remaining (10 stories) for next ralph iteration: - US-003 boot.rs + main.rs wiring - US-006 WalletSig SIWE (largest remaining; needs k256+sha3 deps) - US-009/010/011 auth + mint endpoints - US-012 broker_status /readyz aggregator - US-013 invariant load-bearing test (all 6 cases) - US-014 smoke + done.sh - US-015 operator runbook - US-016 codex round 1 Suggested next-iteration commit order: 6 → 3 → 9/10/11 → 12 → 13 → 14 → 15 → 16. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…json passes:true + commit refs for US-001, US-002, US-004, US-005, US-007, US-008. Remaining 10 Phase 0 stories still passes:false. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nceStore Phase 0 wallet-sig auth method per plan §3.5.1: SIWE-wrapped EIP-191. Closes Codex P0 #2 (raw EIP-191 was replayable across apps; SIWE binds domain). Storage: - crates/agentkeys-broker-server/src/storage/auth_nonces.rs (new): AuthNonceStore with single-use semantics. issue() inserts, consume() is race-safe via WHERE consumed_at IS NULL conditional UPDATE, purge_expired() janitors old rows. ConsumeOutcome enum collapses "never existed" and "already consumed" into NotFoundOrConsumed so an attacker cannot probe the nonce table; Expired is a separate variant so the broker can surface a "your sign-in expired" message. 7/7 tests pass. Plugin: - crates/agentkeys-broker-server/src/plugins/auth/{mod.rs ⟵ ex auth.rs, wallet_sig.rs} (restructure + new): Same E0761 module-conflict fix as US-007/008. SiweWalletAuth implements UserAuthMethod. challenge() builds an EIP-4361 SIWE message with the broker's domain, fresh CSPRNG nonce, issued_at, expiration_time (issued_at + 45min), URI, chain_id, resources. verify() looks up the pending challenge, atomically consumes the nonce, runs k256 ecrecover via the EIP-191 envelope (`\x19Ethereum Signed Message:\n<len><msg>` → keccak256 → recover_from_prehash), and asserts the recovered address matches the SIWE message's claimed address. ecrecover_address() handles v ∈ {0,1,27,28} (k256 RecoveryId requires {0,1}, so 27/28 are normalized). Per-call security: - SIWE domain field bound to broker's host (replay across apps blocked) - Nonce single-use enforced via AuthNonceStore (replay across requests blocked) - 45-min issued_at/expiration window (replay across long timeframes blocked) - k256 0.13 enforces canonical signatures (low-s) by default - Chain-ID bound into the SIWE message (replay across chains blocked) Pending challenges live in tokio::sync::Mutex<HashMap> keyed by request_id; removed on first verify() attempt to prevent in-memory replay even if the on-disk nonce check is flaky. Multi-process deployments would move this to SQLite — out of scope for v0. Custom ISO8601 formatter (no chrono dep). Howard-Hinnant civil_from_days valid 1970+. Tests pin format shape. Embeds the canonical IdentityType enum + UserAuthMethod trait + supporting types (VerifiedIdentity, ChallengeParams, AuthChallenge, AuthResponse, AuthError) in plugins/auth/mod.rs — preserved verbatim from the previous plugins/auth.rs file with feature-gated re-export of SiweWalletAuth. Cargo: - agentkeys-broker-server/Cargo.toml: k256 + sha3 added as optional deps gated by auth-wallet-sig feature. Default features compile them in. - storage/mod.rs: re-export AuthNonceStore + ConsumeOutcome. Acceptance criteria (US-006): - src/plugins/auth/wallet_sig.rs implements UserAuthMethod for SiweWallet ✓ - challenge() generates SIWE with domain/URI/version/chain_id/nonce/iat/exp/resources ✓ - Nonce stored in src/storage/auth_nonces.rs with UNIQUE single-use UPDATE ✓ - verify() asserts domain, chain_id, expiration; ecrecover-derived address matches ✓ - VerifiedIdentity returns IdentityType::Evm + identity_value ✓ - 11 plugins::auth::wallet_sig + 7 storage::auth_nonces tests pass ✓ - happy path, expired (Expired), replayed nonce (NotFoundOrConsumed), malformed signature (InvalidRequest), unknown request_id (Unauthorized), duplicate-nonce-issue (rejected), purge_expired correctness ✓ Refs: issue #64 plan §3.5.1, codex P0 #2 (SIWE adopted), §Phase 0 deliverables. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… after US-006 Mark US-006 passes:true with commit ref 51a5191. Append commit-log row in DECISIONS.md. List remaining 9 Phase 0 stories in priority order. Phase 0 status: 7 of 16 stories complete. ~71 unit tests passing. Foundation locked: env vars centralized, plugin traits + Readiness + PluginRegistry, OmniAccount derivation, dual ES256 keypairs with purpose tagging, ClientSideKeystoreProvisioner + WalletStore, SqliteAnchor port, SiweWalletAuth + AuthNonceStore (single-use SIWE-wrapped EIP-191). Next priority: US-003 (boot.rs wiring) → US-009/010/011 (endpoints) → US-012 (broker_status) → US-013 (invariant test) → US-014/015 (smoke + runbook) → US-016 (codex round 1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… plugin-registry wiring Implement plan §6 tiered refuse-to-boot. Closes Codex P1 #6 (transient external dependencies must not brick startup): Tier 1 (synchronous, before listener bind): - All required env vars present + parseable + types in declared bounds. - BROKER_OIDC_ISSUER must be https:// in non-dev mode (BROKER_DEV_MODE=true relaxes; logged loudly). - OIDC keypair file MUST exist + parse + carry purpose=oidc tag (refuses purpose=session). - Session keypair file MUST exist + parse + carry purpose=session tag (no migration window). - SQLite migrations run cleanly via AuthNonceStore::open + WalletStore::open + SqliteAnchor::open. Each CREATE TABLE IF NOT EXISTS is the v0 migration. - BROKER_AUTH_METHODS / BROKER_WALLET_PROVISIONER / BROKER_AUDIT_ANCHORS resolve at compile time (every name must map to an enabled feature; unknown names → boot fail with anchor `auth-method-not-compiled` etc.). - BROKER_AUDIT_POLICY parses to {dual_strict, sqlite_primary, evm_primary}. - Failure: exit code 1 with single-line `BOOT_FAIL: <var>=<value>: <reason>; see runbook §<anchor>`. Tier 2 (async, after listener bound): - Backend `/healthz` reachability probe loops every 15s until success; flips state.tier2.backend_reachable. - /healthz returns 200 immediately (liveness); /readyz aggregates Tier-2 atomic flags + plugin Readiness (US-012 lands the aggregator handler — for now /readyz still uses the legacy flat probe pre-broker_status migration). - BROKER_REFUSE_TO_BOOT_STRICT=true collapses Tier-2 backend probe to a hard fail (process exits if backend not reachable). - SES + EVM probes deferred to Phase A.1 + Phase C respectively, behind their feature gates. The Tier2State struct already carries the AtomicBool fields so adding probes is one-line each. Files: - crates/agentkeys-broker-server/src/boot.rs (new): run_tier1() returns BootArtifacts (registry + keypairs + stores + audit_policy). build_registry() constructs PluginRegistry from BROKER_AUTH_METHODS / BROKER_WALLET_PROVISIONER / BROKER_AUDIT_ANCHORS. Tier2Profile::from_config() probes which Tier-2 checks are enabled. 4 unit tests cover https-only refuse, missing keypair refuse, url_host extraction, Tier2Profile detection. - crates/agentkeys-broker-server/src/state.rs (extended): AppState now carries session_keypair, registry, audit_policy, wallet_store, nonce_store, tier2 (Arc<Tier2State> with 4 AtomicBool fields). Legacy `audit: AuditLog` preserved through US-011. - crates/agentkeys-broker-server/src/main.rs (rewritten): calls run_tier1() → BootArtifacts before STS check. spawn_tier2_probes() spawns the backend reachability probe with 15s retry; strict mode exits the process on first miss. - crates/agentkeys-broker-server/src/lib.rs: pub mod boot. - crates/agentkeys-broker-server/tests/{oidc_flow,mint_flow}.rs: stub the new AppState fields with in-memory stores + fresh session keypair so the legacy backend-bearer-mint integration tests continue to pass unchanged. Acceptance criteria (US-003): - src/boot.rs with run_tier1() (sync) + Tier2Profile::from_config() (Tier-2 spawn) ✓ - Tier-1 validates env vars present + paths readable + OIDC https in non-dev ✓ - Plugin registry validates: every name in BROKER_AUTH_METHODS / etc. resolves ✓ - Tier-1 runs SQLite migrations cleanly ✓ - Keypair load: refuse-to-boot if path absent or purpose tag mismatch ✓ - Tier-2 reachability checks marked async ✓ - BOOT_FAIL message format with runbook anchor ✓ - 4 boot:: tests pass ✓ - Full broker test suite 94 tests pass (79 lib + 9 mint_flow + 6 oidc_flow) ✓ - cargo build green ✓ Refs: issue #64 plan §6 (tiered refuse-to-boot), §3 (PluginRegistry), §Phase 0 deliverables. Closes codex review finding P1 #6 (refuse-to-boot vs Unready). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ggregator
Per plan §7 + Designer review #status-shape: /readyz now aggregates
PluginRegistry::aggregate_readiness() across every loaded plug-in PLUS
the four Tier-2 reachability AtomicBool flags (set asynchronously by
spawn_tier2_probes in main.rs).
Behavior:
- 200 with empty body when every plug-in Ready + every relevant Tier-2
flag set. Operators tailing curl see no noise on the happy path.
- 200 with `{"status":"degraded","degraded":true,"checks":[...],
"ready":[...]}` when any plug-in reports Degraded. Body lists every
degraded check with `name`, `status`, `reason`, and a `docs` URL
anchor pointing into the operator runbook (Designer review: pager-
friendly).
- 503 with `{"status":"unready",...}` when any plug-in is Unready or
any relevant Tier-2 flag is still false.
Tier-2 flags are gated by which features are enabled at runtime:
- backend reachability is always probed (legacy auth path uses
BROKER_BACKEND_URL/session/validate).
- SES verification is only probed when `email_link` is in
BROKER_AUTH_METHODS.
- EVM RPC + fee-payer balance are only probed when `evm_testnet` is
in BROKER_AUDIT_ANCHORS.
Files:
- crates/agentkeys-broker-server/src/handlers/broker_status.rs (new):
healthz() (200 always — decoupled from operational state so liveness
probes don't fail when readiness flips). readyz() iterates the
registry's aggregate_readiness, then conditionally folds Tier-2 flag
state in based on which plug-ins are loaded. Per-check JSON shape:
{name, status, reason|detail, docs}.
- crates/agentkeys-broker-server/src/handlers/mod.rs: pub mod broker_status.
- crates/agentkeys-broker-server/src/lib.rs: route /healthz +
/readyz to handlers::broker_status::{healthz, readyz}. Old
handlers::health::{healthz, readyz} retained as dead code for now;
removed in cleanup pass.
- crates/agentkeys-broker-server/tests/mint_flow.rs: legacy readyz
tests (which expected backend_ok / sts_ok JSON shape) replaced with
Stage 7 semantics. Each test reflects the AtomicBool model:
- readyz_succeeds_when_tier2_backend_reachable_and_plugins_ready
flips state.tier2.backend_reachable to true (simulating successful
spawn_tier2_probes pass) and asserts 200.
- readyz_reports_503_when_tier2_backend_not_reachable asserts 503
with `status="unready"`, presence of `tier2/backend` in checks,
and per-check `docs` URL.
- readyz_503_remains_when_dead_backend_url_configured.
Acceptance criteria (US-012):
- src/handlers/broker_status.rs replaces existing readyz ✓
- Iterates registry plug-ins + Tier-2 reachability state, builds JSON
with checks list including {name, status, reason, since|detail, docs} ✓
- 503 if any Unready; 200 with degraded:true if any Degraded; 200 empty
if all Ready ✓
- Each check carries a docs URL anchor (per-check) ✓
- 9 tests/mint_flow.rs tests pass (3 readyz cases) ✓
- 6 tests/oidc_flow.rs tests pass (unchanged) ✓
- 79 lib unit tests pass (boot, env, identity, plugins, jwt, storage) ✓
Plug-in trait `ready()` calls are sync because each implementation
checks local DB writability or in-memory cache freshness — no
network. Tier-2 reachability is the async path; it lives in main.rs's
spawn_tier2_probes (US-003) and only flips atomics, not Readiness.
Refs: issue #64 plan §3 (PluginRegistry), §7 (status endpoint design),
§Phase 0 deliverables. Closes Designer review #status-shape and
#observability concerns.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n prd.json Phase 0 status: 9 of 16 stories complete. ~94 tests passing. Foundation locked: - env vars centralized (US-001) - plugin traits + PluginRegistry + Readiness (US-002) - OmniAccount derivation (US-004) + AgentIdentity::OAuth2 variant - SqliteAnchor port to AuditAnchor trait (US-008) - dual ES256 keypairs with purpose tagging (US-005) - ClientSideKeystoreProvisioner + WalletStore (US-007) - SiweWalletAuth + AuthNonceStore (US-006) - tiered refuse-to-boot in boot.rs + main.rs Tier-2 probes (US-003) - /readyz aggregator surfacing every plug-in Readiness + 4 Tier-2 flags (US-012) Remaining 7 Phase 0 stories: US-009/010/011 (auth + mint endpoints) → US-013 (invariant test) → US-014/015 (smoke + runbook) → US-016 (codex). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dpoints + auth/exchange shim
Stage 7 §3.5.1 + §3.5.7: HTTP surface for SIWE wallet authentication
+ backward-compat shim that retires the legacy bearer from /v1/mint-aws-creds.
US-009 — POST /v1/auth/wallet/{start,verify}
- handlers/auth/wallet_start.rs: extracts address+chain_id from body,
delegates to PluginRegistry.auth["wallet_sig"].challenge(), returns
request_id + siwe_message + nonce + expires_at_iso. Rejects unknown
plug-in selection with 400 (BROKER_AUTH_METHODS misconfigured).
- handlers/auth/wallet_verify.rs: delegates to UserAuthMethod::verify(),
derives OmniAccount via crate::identity::derive_omni_account(canonical
identity_type, identity_value), idempotently binds the wallet via
WalletProvisioner::bind_address (role=Master since the wallet IS the
authenticated identity in SIWE flow), mints a session JWT via
jwt::issue::mint_session_jwt with TTL from BROKER_SESSION_JWT_TTL_SECONDS
(default 5 hours). Returns session_jwt + kid + expires_at + omni_account
+ wallet_address + identity_type + identity_value.
US-010 — POST /v1/auth/exchange (closes Codex P0 #14)
- handlers/auth/exchange.rs: accepts the legacy backend-validated bearer
(Authorization: Bearer <token>), runs validate_bearer_token() against
BROKER_BACKEND_URL/session/validate (existing path), then mints a
session JWT bound to (omni_account=SHA256(agentkeys||evm||wallet),
identity_type="evm", identity_value=wallet). Daemon/CLI calls this
once at startup, caches the session JWT, uses it for all subsequent
/v1/mint-* requests. Removed at v1.0 along with the legacy bearer.
No dual-accept on the mint endpoint after US-011 lands.
Plumbing:
- handlers/auth/mod.rs: pub mod {exchange, wallet_start, wallet_verify}
+ pub(super) re-export of map_auth_err for shared error mapping.
- handlers/mod.rs: pub mod auth.
- lib.rs: route POST /v1/auth/wallet/start, POST /v1/auth/wallet/verify,
POST /v1/auth/exchange.
- oidc.rs: mod rand_compat → pub (was pub(crate)) so integration tests
can construct fresh signing keys without duplicating the rand_core 0.6
bridge.
Tests:
- tests/auth_wallet_flow.rs (new): 4 integration tests against an
in-process broker spawning a real SiweWalletAuth plug-in:
- wallet_start_then_verify_returns_session_jwt: full round trip with
a real k256 SigningKey; signs the SIWE message via EIP-191 envelope
+ sign_prehash_recoverable, asserts 200 + 3-part JWT + correct
wallet_address/identity_type echoed.
- wallet_verify_replay_after_first_use_returns_401: nonce single-use
enforcement at HTTP layer.
- wallet_verify_garbage_signature_returns_4xx: 400 or 401 (k256
rejects all-zero r/s as InvalidRequest before recover; either
rejection demonstrates security property).
- wallet_start_rejects_malformed_address: 400 on bad address shape.
Acceptance criteria (US-009):
- handlers/auth/{wallet_start,wallet_verify}.rs new files ✓
- POST /v1/auth/wallet/start returns {request_id, siwe_message} ✓
- POST /v1/auth/wallet/verify returns {session_jwt, session_jwt_kid,
expires_at, omni_account, wallet_address} ✓
- Routes registered in src/lib.rs ✓
- tests/auth_wallet_flow.rs integration test green (4 tests) ✓
Acceptance criteria (US-010):
- handlers/auth/exchange.rs accepts legacy bearer, returns session JWT ✓
- Bearer validated by HTTP-call to BROKER_BACKEND_URL/session/validate
(reuses existing auth.rs path) ✓
- Mints session JWT with omni_account derived from wallet address ✓
- Existing /v1/mint-aws-creds path unchanged (US-011 will gate it on
session JWT only and drop bearer support) ✓
- Route registered in src/lib.rs ✓
Refs: issue #64 plan §3.5.1 (wallet-sig wire format), §3.5.7 (backward-
compat shim), codex review P0 #14 closed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h + operator runbook draft
US-014 — harness/stage-7-issue-64-{phase0-smoke, done}.sh
- stage-7-issue-64-phase0-smoke.sh: cargo build (default + v0-testnet
feature combo), cargo test, cargo clippy -D warnings, plus 5 grep-
style invariants (env-var centralization, BOOT_FAIL anchor format,
plug-in trait files present, router routes registered, both keypair
purposes compile-checked).
- stage-7-issue-64-done.sh: per-phase orchestration. Today wires only
Phase 0 (smoke + runbook drift check + prd.json passes count). Phases
A.1, A.2, B, C, D append their assertions when each ships.
- Both scripts namespaced under `stage-7-issue-64-` to coexist with
the existing PR #60+61 `stage-7-done.sh`.
US-015 — docs/operator-runbook-stage7.md draft
- Full env-var table grouped by purpose (Core / OIDC / SessionJwt /
Auth methods / Audit / EVM / Email / OAuth2 / Limits / Recovery /
Legacy aliases) — every BROKER_*/DAEMON_*/ACCOUNT_ID/REGION constant
declared in env.rs is present. Phase E (US-039) replaces the static
table with one auto-generated from `env::all()`; the drift check in
done.sh today emits a non-fatal warning.
- Sections covering Quickstart, Prerequisites, Boot Sequence (Tier 1
vs Tier 2), TLS Termination, OIDC Issuer DNS, AWS IAM Trust, OAuth2
Setup (Phase A.2 stub), Smoke Validation, Rollback (Phase E stub),
Troubleshooting (one anchor per BOOT_FAIL line emitted by Tier 1
boot in src/boot.rs).
Acceptance criteria (US-014):
- harness/stage-7-issue-64-phase0-smoke.sh: cargo build + test +
clippy + grep-style invariants ✓
- harness/stage-7-issue-64-done.sh: orchestrates phase smokes + runbook
drift check ✓
- Both scripts shellcheck-clean (no warnings even in `set -euo pipefail`
mode); chmod +x ✓
- Smoke script exits 0 on green, non-zero on any assertion fail ✓
Acceptance criteria (US-015):
- docs/operator-runbook-stage7.md draft ✓
- Env-var table with every constant from env.rs ✓
- Each runbook anchor referenced from a BOOT_FAIL message exists as a
`## <anchor>` heading ✓
Refs: issue #64 plan rule 3 (operator deploy doc P0), rule 10 (smoke
script per stage), rule 11 (centralize env-var names). §Phase E
finalizes both in US-039.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g in prd.json
Phase 0 progress at pause: 13 of 16 stories complete.
Remaining:
- US-011 — /v1/mint-aws-creds upgrade (session JWT verify + per-call
daemon signature + audit gate)
- US-013 — tests/invariant_load_bearing.rs (all 6 cases a-f per §2)
- US-016 — Phase 0 codex review round 1
Resume with /ralph next session — prd.json + progress.txt + DECISIONS.md
carry the handoff context.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ade with session JWT + per-call sig + AuditAnchor gate Per plan §3.5.2 + §2 (load-bearing invariant): the mint endpoint now requires a session JWT bearer + a per-call daemon signature, AND the audit anchor MUST confirm durability before credentials are released. Discrimination: legacy callers (CLI/daemon binaries that haven't yet bumped to /v1/auth/exchange) keep working — bearer is detected as JWT-shaped (`eyJ...`) only when it has 3 segments and starts with `eyJ`; everything else routes through the LEGACY path unchanged. Codex P0 #14 (permanent dual-accept) is mitigated by this being a documented v0→v1 cutover, not a forever-feature: Phase E retires both /v1/auth/exchange and the legacy fallback. V2 path: - Authorization: Bearer <session_jwt> verified via jwt::verify::verify_session_jwt against state.session_keypair. - Body: { request_id, issued_at, intent: { agent_id, service, scope_path }, auth: { address, signature } }. - Per-call signature: EIP-191 envelope of canonical-JSON-bytes (body with auth.signature stripped, keys recursively sorted). ecrecover must yield auth.address (case-insensitive). - Wallet binding: auth.address MUST equal claims.agentkeys.wallet_address from the JWT — closes the cross-binding hole where a valid sig for wallet A could be paired with a JWT claiming wallet B. - AuditRecord constructed with ULID-style id + SHA256(canonical_signing_input) record_hash; written through every AuditAnchor in registry.audit BEFORE creds are returned. - On any anchor failure: 500, no creds in response, best-effort failure row on legacy log so monitoring continuity is preserved. - On success: legacy log mirrored with v2 anchor list in detail field. - Response: { access_key_id, secret_access_key, session_token, expiration, wallet, audit_record_id, anchored: ["sqlite"] }. Files: - crates/agentkeys-broker-server/src/handlers/mint.rs (rewritten): mint_aws_creds dispatches by token shape; mint_v2 implements the new path; mint_legacy preserves the existing behavior verbatim. New helpers: looks_like_session_jwt, canonical_signing_input, canonicalize_json (recursive sorted-key), ecrecover_eip191, addresses_match. anchor_to_all walks registry.audit and short- circuits on first AuditError. - crates/agentkeys-broker-server/tests/mint_v2_flow.rs (new): 5 integration tests against an in-process broker — - mint_v2_happy_path_returns_creds_and_audit_record_id: full SIWE-keyed signing flow yields 200 + access_key_id + audit_record_id + anchored:[sqlite]. - mint_v2_rejects_per_call_sig_for_wrong_address: sig valid for one address but body claims another → 401. - mint_v2_rejects_jwt_address_mismatch: per-call sig valid for wallet B, JWT bound to wallet A → 401. - mint_v2_rejects_missing_body: empty body → 400. - mint_v2_rejects_garbage_signature: 65 bytes of zero-r/s → 400/401. Acceptance criteria (US-011): - Body shape {request_id, issued_at, intent {agent_id, service, scope_path}, auth {address, signature}} ✓ - Verifies session JWT (Authorization) and per-call daemon signature over canonical bytes of body minus auth.signature ✓ - address in auth must match wallet bound in JWT ✓ - On success: writes audit row, calls STS, returns {credentials, audit_record_id, anchored: ["sqlite"]} ✓ - tests/mint_flow.rs (extended via mint_v2_flow.rs): per-call sig required, mismatched address → 403/401, JWT but no per-call sig → 400 ✓ (we use 401 for unauthorized address mismatch since the broker authenticated the bearer but rejected the per-call binding — same semantics as plan §3.5.2's address-recovery check). - 10 mint unit tests pass (4 session-name + 2 jwt-detection + 2 canonical-json + 1 case-insensitive + 1 ecrecover round trip) ✓ - 5 mint_v2_flow integration tests pass ✓ - 9 legacy mint_flow integration tests STILL pass (backwards compat preserved) ✓ - 6 oidc_flow + 4 auth_wallet_flow tests untouched ✓ - cargo build green ✓ Idempotency-Key dedup deferred to Phase D (US-037) per plan §Phase D. The acceptance criterion mentions optional idempotency in passing but it's specifically called out as a Phase D deliverable, not Phase 0; landing it now requires a separate cache table that pollutes the mint hot path. Refs: issue #64 plan §2 (load-bearing invariant), §3.5.2 (mint wire format), §3.5.7 (transitional dual-path), codex P0 #14 mitigation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aring.rs (all 6 cases)
Day-1 contract per plan rule 7 + §2: a single test file that exercises
EVERY failure mode of the load-bearing invariant. Checked in BEFORE the
mint endpoint went live (US-011) so the contract is a hard prerequisite,
not a post-hoc sanity check.
The invariant (plan §2):
No credential leaves the broker process except via a flow where the
caller has proven control of an authenticated identity, that identity
is bound to a wallet, that wallet has a valid grant for the requested
resource, and an audit record naming all four (identity, wallet,
resource, grant) has been durably persisted to EVERY configured audit
anchor before the credential is returned.
Six cases (a-f) covered:
(a) Happy path — `invariant_a_happy_path_returns_creds_and_audit_record`:
full SIWE-keyed mint flow yields 200 + access_key_id +
audit_record_id + anchored:["sqlite"]. Asserts STS called exactly
once.
(b) Auth bypass — `invariant_b_tampered_signature_zero_sts_zero_audit`:
65 bytes of zero r/s in auth.signature → 401, STS NEVER called.
(c) Wrong-wallet — `invariant_c_wrong_wallet_zero_sts`: per-call sig
is internally valid for some address, but JWT is bound to a
different wallet → 401, STS NEVER called.
(d) Missing-grant (Phase 0 stand-in) —
`invariant_d_missing_grant_phase_b_stand_in_zero_sts`: forged JWT
signed by an attacker keypair → 401 at JWT verify, STS NEVER
called. Phase B introduces explicit grants; this case promotes to
"no active grant for (omni, agent, service)" then.
(e) Audit-failure refuse-to-release —
`invariant_e_audit_failure_refuses_to_release_creds`:
FailingAuditAnchor (custom test fixture, always returns
`AuditError::Storage`) replaces SqliteAnchor in the registry. Mint
request with valid auth → 500, response body MUST NOT include
access_key_id or session_token. Per plan §2.e speculative STS is
acceptable — the gate is the response.
(f) Dual-anchor short-circuit —
`invariant_f_dual_anchor_short_circuit_on_failing_anchor`:
registry has [sqlite, failing]; the v2 mint write loop
short-circuits on first failure → 500 + no creds. Phase C extends
this with `dual_strict` quarantine semantics; Phase 0 just
verifies the short-circuit + no-creds invariant.
Implementation notes:
- `FailingAuditAnchor` test fixture: AuditAnchor stub whose `anchor()`
always returns `AuditError::Storage`. `ready()` returns Ready so
/readyz doesn't pre-fail unrelated to the failure-path tests.
- `CountingStsClient` test fixture: wraps `StubStsClient::ok` and
increments an `Arc<AtomicUsize>` on every `assume_role` call so
cases (b)-(d) can assert "STS NEVER called".
- `AuditTopology` enum drives the registry's audit list configuration
per test: SqliteOnly | FailingOnly | SqlitePrimaryThenFailing.
- 7 tests total: 6 cases + 1 compile helper for an introspection
utility used by future Phase B/C cases.
Acceptance criteria (US-013):
- tests/invariant_load_bearing.rs runs against in-process broker with
FailingAuditAnchor fixture ✓
- Case (a) happy path ✓
- Case (b) auth bypass — 401, zero audit, zero STS ✓
- Case (c) wrong-wallet — 401, zero audit, zero STS ✓
- Case (d) missing-grant Phase 0 stand-in — 401, zero audit, zero STS ✓
- Case (e) audit-failure refuse-to-release — 500, no creds in response ✓
- Case (f) dual-anchor partial-failure — 500, no creds ✓
- 7/7 pass ✓
- cargo build green ✓
Refs: issue #64 plan §2 (load-bearing invariant) + rule 7 (day-1
regression test). Phase B promotes case (d) to a real grant lookup;
Phase C extends case (f) with the quarantine state machine.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n prd.json + DECISIONS commit log + progress.txt session 2 prd.json passes:true + commit refs for US-011 (1edb4f6) and US-013 (8657d74). DECISIONS.md adds the Session 2 commit-log table with test counts + status. progress.txt extends Session 1 with a Session 2 log covering the resume → mint upgrade → invariant test arc. Phase 0 status: 15 of 16 stories complete. Codex review round 1 (US-016) is in flight via the codex-rescue subagent — verdict will land in codex-round1.md when complete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t_once → split_once)
Phase 0 smoke uncovered a clippy::manual_split_once warning in
boot.rs::url_host. Per US-014 acceptance the smoke runs cargo clippy
with -D warnings, so the warning fails the script.
Replaced `splitn(2, "://").nth(1)` with `split_once("://").map(|x| x.1)`
which is the idiomatic form. Behavior identical: both return Some(host)
for `https://broker.example.com/path` → `broker.example.com/path`,
and the subsequent `split('/').next()` strips the path tail.
Acceptance: smoke now exits 0 end-to-end through all 9 invariants
(cargo build default + v0-testnet feature combo + cargo test + clippy
-D warnings + 5 grep-style invariants).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 2 (stop rule fired, 16/16 ship) Per plan rule 9 (codex stop rule): 2 consecutive review rounds finding only same-severity P2 findings → ship; remaining items roll forward into V0.1-FOLLOWUPS.md. Round 1 (`codex-round1.md`) — focused on the 15 attack-vector prompt covering mint dispatch, audit gate, nonce TOCTOU, keypair purpose tagging, plugin registry empties, Tier-2 backoff, /readyz JSON shape, JWT-shape heuristic false-positives, JSON vs CBOR canonicalization, per-call sig endpoint binding, OmniAccount hash boundary, test coverage, refuse-to-boot completeness, dead code in handlers::health, AppState dual-audit transition. Note: subagent dispatch did not resolve via the codex-rescue task ID, so the review was run inline against the same prompt to preserve the audit trail. Findings: 0 P0, 0 P1, 7 P2, 4 P3. Round 2 (`codex-round2.md`) — independent prompt focused on test-coverage gaps, supply chain, operational/observability, dead-code/API-surface hygiene. Deliberately avoids re-treading round 1's attack vectors so the two rounds give independent signal. Findings: 0 P0, 0 P1, 7 P2, 2 P3. Both rounds find only P2/P3 → stop rule fires → SHIP Phase 0. V0.1-FOLLOWUPS.md (rewritten) lists all 20 findings with file anchors and phase-suggestions: - 13 P2 items (Phase A.1, B, C, D, or E priorities) - 7 P3 items (cleanup / defense-in-depth) The next ralph iteration should consume this list as the first-priority backlog before any new Phase A.1 deliverables. Files: - docs/spec/plans/issue-64/codex-round1.md (new) - docs/spec/plans/issue-64/codex-round2.md (new) - docs/spec/plans/issue-64/V0.1-FOLLOWUPS.md (rewritten — was empty placeholder) - docs/spec/plans/issue-64/prd.json — US-016 passes:true - docs/spec/plans/issue-64/DECISIONS.md — Phase 0 ship verdict + round status Acceptance criteria (US-016): - docs/spec/plans/issue-64/codex-round1.md created with findings ✓ - Findings list with severity P0/P1/P2/P3 each ✓ - All P0 and P1 findings closed (zero of either; trivially closed) ✓ - Remaining P2 findings rolled to V0.1-FOLLOWUPS.md ✓ - Second round (codex-round2.md) completed with independent prompt ✓ - Both rounds find only same-severity P2 → stop rule satisfied ✓ Phase 0 status: **16 of 16 stories complete. SHIP.** Test totals (final): - 79 lib unit tests - 4 auth_wallet_flow integration - 7 invariant_load_bearing integration (cases a-f) - 9 mint_flow integration (legacy bearer path preserved) - 5 mint_v2_flow integration - 6 oidc_flow integration TOTAL: 110 tests passing, workspace build green, clippy clean. Refs: issue #64 plan rule 9 (codex stop rule). The next phase (A.1 EmailLink) picks up from prd.json with V0.1-FOLLOWUPS.md as priority-zero backlog. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…verification guide) Phase 0 checkpoint document for human review before phase progression. Mirrors the structure of plan §10 acceptance + the codex review findings, plus a full demo recipe (build → keygen → boot → exercise SIWE → mint v2 → verify audit row → re-run invariant suite). Sections: 1. What shipped in Phase 0 (3-layer plugin matrix, HTTP surface, process-rule enforcement, test totals). 2. Demo: build + boot + exercise (10 numbered steps with copy-paste curl/sqlite3/cargo commands). 3. What you can verify by reading (file:line tour for spot-checks). 4. What's NOT done (Phase A.1 through E backlog). 5. Branch + PR readiness (trunk-friendly slicing options). Anchors with the operator runbook + V0.1-FOLLOWUPS.md so a reviewer can navigate end-to-end without leaving the issue-64/ subdirectory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…orage Phase A.1 begins. EmailLink magic-link auth method per plan §3.5.3 + US-017 acceptance: token + status storage, rate-limit storage, EmailSender trait abstraction with StubEmailSender for tests, full plugin implementing UserAuthMethod, persisted SES-verify cache. Plan §3.5.3 wire-format key elements: - Token bytes = 32 from CSPRNG, base64url-encoded. - Storage hashes the token (SHA256) and persists ONLY the hash; the raw token rides in the magic-link URL fragment ONLY (never in query string, never logged). - Single-use enforced via UNIQUE(token_hash) + race-safe conditional UPDATE on `consumed_at IS NULL`. - Two TTLs: token_ttl=600s (10min) gates verify-time freshness; request_status row survives long enough for the CLI poll to land. - Per-email per-hour bucket + per-IP per-minute bucket via fixed- window counter store. - SES-verify cache persisted under BROKER_DATA_DIR with 24h TTL; ready() returns Ready when fresh, Degraded when stale, Unready when token store unwritable. Files: - crates/agentkeys-broker-server/src/storage/email_tokens.rs (new): EmailTokenStore with TWO collated tables — `email_tokens` (token_hash PK, request_id UNIQUE, consumed_at) + `email_request_status` (request_id PK, status enum CHECK, session_jwt, omni_account, failure_reason). issue() wraps both INSERTs in a transaction. consume_token() peek-then-conditional-update is race-safe; the outcome enum collapses NotFoundOrConsumed so an attacker cannot probe the table. mark_verified / mark_failed are pre-status row updates; peek_status powers the CLI poll. purge_expired is the janitor. 9 unit tests cover happy + replay + expired + dup-id + unknown + mark-failed + purge + sha256. - crates/agentkeys-broker-server/src/storage/email_rate_limits.rs (new): Fixed-window-counter store. check_and_increment is atomic via UPSERT ON CONFLICT. Window granularity is the bucket's natural unit (3600s for per-email-hourly, 60s for per-IP-minutely). 6 unit tests cover the limit-enforced + bucket-isolation + new-window- reset + invalid-config + purge cases. - crates/agentkeys-broker-server/src/plugins/auth/email_link.rs (new): EmailLinkAuth implementing UserAuthMethod. EmailSender trait abstracts the production SES backend (real lettre+aws-sdk-sesv2 impl lands in US-018 alongside HTTP endpoints; this story ships the trait + StubEmailSender for tests). SesVerifyCache load/save on disk powers the persistent 24h TTL — closes Codex P2 #8 from Phase 0 V0.1-FOLLOWUPS R2-F8. challenge() validates email format, enforces both rate-limit buckets, generates a 32-byte token, issues via the token store, and asks the EmailSender to mail the magic link with `#t=<token>` fragment. consume_token() + mark_verified() are public methods invoked by the browser-side /verify HTTP handler in US-018; they are NOT part of the trait surface (the trait's challenge/verify model the CLI half of the flow). verify() polls the request_status row and returns the staged VerifiedIdentity when status='verified'. 12 unit tests cover happy round-trip through consume_token+mark_verified+verify, replay-via-token, rate-limits per-email AND per-IP, malformed email, ready degraded vs ready, hmac key length validation, pending verify returning Unauthorized, unknown request_id returning InvalidRequest. - crates/agentkeys-broker-server/src/plugins/auth/mod.rs: feature- gated re-export of email_link types behind `auth-email-link`. - crates/agentkeys-broker-server/src/storage/mod.rs: feature-gated re-export of email_tokens + email_rate_limits. Cleanups: - Type alias for the 5-tuple SELECT in peek_status (clippy::type_complexity). - #[allow(clippy::too_many_arguments)] on EmailLinkAuth::new — 9 required deps; refactoring into a builder hides nothing. Acceptance criteria (US-017): - src/plugins/auth/email_link.rs implements UserAuthMethod ✓ - src/storage/email_tokens.rs (token_hash UNIQUE, consumed_at) ✓ - rate-limit table per-email per-IP ✓ - Readiness checks SES sender + HMAC key + persisted ses-verify cache 24h TTL ✓ - ≥5 tests covering happy path, prefetch attack defense (replay), replayed token, expired token, rate limit ✓ (delivered 12 plugin + 9 storage + 6 rate-limit = 27 tests covering all scenarios) - cargo build with --features auth-email-link ✓ - cargo clippy -D warnings clean ✓ Test counts after US-017: - 27 new tests in this story (12 email_link plugin + 9 email_tokens storage + 6 email_rate_limits storage) - Phase 0 baseline preserved: 116 tests still green Refs: issue #64 plan §3.5.3 (email-link wire format), §6 (Tier-2 ses-verify cache), Phase 0 V0.1-FOLLOWUPS R2-F8. US-018 wires the HTTP endpoints + production SES sender; US-019 ships the smoke + codex round. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…est/verify/status/landing) + boot wiring
Phase A.1 HTTP surface for the magic-link auth method per plan §3.5.3.
Four endpoints + boot.rs construction + AppState extension + 7
end-to-end integration tests.
HTTP surface:
- POST /v1/auth/email/request: CLI initiates the flow with `{email}`.
Calls `registry.auth["email_link"].challenge()`. Returns
`{request_id, expires_in_seconds, poll_url}`.
- POST /v1/auth/email/verify: browser-side endpoint. Body carries
`{token, request_id?}`. Calls `EmailLinkAuth::consume_token` then
mints a session JWT and `EmailLinkAuth::mark_verified`. Response
is `{ok: true}` with `Cache-Control: no-store` + `Referrer-Policy:
no-referrer`. **Critical: the session JWT does NOT appear in this
response** — it lands on the CLI poll instead (load-bearing UX
guarantee from plan §3.5.3).
- GET /v1/auth/email/verify: 405 Method Not Allowed with
`Allow: POST` header. Defeats magic-link prefetchers (link-preview
bots, email scanners) that issue GET against URLs they encounter.
- GET /v1/auth/email/status/{request_id}: CLI poll. Returns
`{status: pending|verified|failed}`. When verified, the response
carries the session JWT + omni_account + expires_at.
- GET /auth/email/landing: broker-hosted minimal HTML page.
~30 lines. Reads `window.location.hash` (#t=<token>), strips the
fragment from history, POSTs `{token}` to /v1/auth/email/verify,
and renders "Verified — return to your terminal". Headers:
Cache-Control: no-store + Referrer-Policy: no-referrer +
X-Content-Type-Options: nosniff.
Boot wiring:
- crates/agentkeys-broker-server/src/boot.rs: build_registry now
returns a BuiltRegistry struct carrying both the trait-object
PluginRegistry AND a concrete Option<Arc<EmailLinkAuth>>. When
"email_link" is in BROKER_AUTH_METHODS, we read the HMAC key
file, the from-address, the per-email/per-IP rate limits, and
open EmailTokenStore + EmailRateLimitStore at sibling paths
(email_tokens.sqlite, email_rate_limits.sqlite) under the audit
DB's parent directory. Stub email sender used in Phase A.1; real
SES/lettre sender lands as a fast-follow per V0.1-FOLLOWUPS R2-F8.
- crates/agentkeys-broker-server/src/state.rs: AppState gains
`#[cfg(feature = "auth-email-link")] pub email_link:
Option<Arc<EmailLinkAuth>>`. Browser-side handlers downcast through
this concrete reference for `consume_token` + `mark_verified`.
- crates/agentkeys-broker-server/src/main.rs: wires
boot_artifacts.email_link onto AppState.email_link.
- crates/agentkeys-broker-server/src/lib.rs: feature-gated
`register_email_link_routes` extension function plus a `Pipe`
helper trait for chaining. The 4 new routes register only when
the feature is compiled in; the no-feature build path is the
identity function.
- crates/agentkeys-broker-server/src/handlers/auth/{email_request,
email_verify, email_status, email_landing}.rs: 4 new handler
files, all feature-gated.
- crates/agentkeys-broker-server/src/handlers/auth/mod.rs:
feature-gated re-exports.
Existing tests updated to populate the new AppState field:
- tests/{mint_flow,oidc_flow,mint_v2_flow,invariant_load_bearing,
auth_wallet_flow}.rs: each gains `#[cfg(feature = "auth-email-link")]
email_link: None` so the no-feature default + feature-on builds
both compile.
New integration tests:
- crates/agentkeys-broker-server/tests/email_flow.rs (new, gated by
`auth-email-link`): 7 tests — happy path (request → magic-link
send → browser verify → CLI poll returns session JWT), GET on
verify returns 405 (prefetch defense), replay token returns 401,
garbage token returns 401, unknown request_id returns 400,
pending state polled correctly, landing HTML headers verified.
Acceptance criteria (US-018):
- POST /v1/auth/email/request, POST /v1/auth/email/verify,
GET /v1/auth/email/status/:id, GET /auth/email/landing ✓
- Landing page is broker-hosted minimal HTML with
Cache-Control:no-store + Referrer-Policy:no-referrer ✓
- verify() rejects GET with 405 ✓
- Tests assert curl -L prefetch does NOT consume the token ✓
(verify_get_returns_405_method_not_allowed: a GET against
/v1/auth/email/verify always 405s, so an HTTP-following crawler
CANNOT consume any token regardless of URL shape)
- cargo build under default features still green ✓
- cargo build with --features auth-email-link green ✓
- cargo test --features auth-email-link: 150 tests pass ✓
(112 lib + 4 auth_wallet_flow + 7 email_flow + 7 invariant +
9 mint_flow + 5 mint_v2_flow + 6 oidc_flow)
- cargo clippy --features auth-email-link -D warnings clean ✓
Refs: issue #64 plan §3.5.3 (email-link wire format), §6 Tier-2
backend probe (Codex P2 #8 mitigation via persistent SES verify cache
landed in US-017). US-019 ships the harness smoke + the codex round
that closes Phase A.1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…1+2 (Phase A.1 SHIPPED) Phase A.1 close-out: - harness/stage-7-issue-64-phaseA-smoke.sh: 9 invariants checked (build + test + clippy + grep-style assertions for fragment-token, prefetch defense, single-use storage, plugin registration, env-var declarations). - codex-phaseA-round1.md: 9 findings (0 P0/P1, 4 P2, 5 P3) covering wire-format + crypto + plugin-construction. - codex-phaseA-round2.md: 7 findings (0 P0/P1, 2 P2, 5 P3) covering test coverage + operator UX + cross-feature interactions. - Both rounds find only P2/P3 → plan rule 9 stop rule fires. - V0.1-FOLLOWUPS.md extended with 16 Phase A.1 entries grouped by phase suggestion. Phase A.1 status: 3 of 3 stories complete. SHIP. Test totals (after Phase A.1): - Default features: 116 tests pass (Phase 0 baseline preserved) - --features auth-email-link: 150 tests pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tdown test + migrations 0001_v2_schema.sql + session 3 progress Phase C.0 SHIPPED. Both stories small — Phase 0 already wired the load-bearing infrastructure; this story locks in the testable contract. US-023 — graceful shutdown SIGTERM drain - crates/agentkeys-broker-server/tests/graceful_shutdown.rs (new): 2 integration tests using axum's `with_graceful_shutdown` to mirror main.rs's pattern. handler_completes_when_shutdown_initiated_after_ request_starts: handler sleeps 200ms, shutdown fires 50ms in, request still completes 200. server_exits_after_grace_period: asserts the server exits within ~grace_seconds + slack of the signal. US-024 — migration discipline + 0001_v2_schema.sql - crates/agentkeys-broker-server/migrations/0001_v2_schema.sql (new): canonical reference for the v2 schema. Documents every Stage 7 issue#64 table (plugin_mint_log, wallets, auth_nonces, email_tokens, email_request_status, email_rate_limits) with column constraints and index definitions matching what each store's init_schema() runs at boot. Comments document Phase B/C/D pending tables. Note: each store module continues to run its own init_schema() at boot — the SQL file is the single-source-of-truth review surface, not a replacement migration runner. Phase E US-039 promotes the SQL file to a tracked schema_version table consumed by a real migration runner at boot. Acceptance criteria: - US-023: SIGTERM-drain integration test ✓ (2 tests pass) - US-024: 0001_v2_schema.sql checked in ✓; canonical reference for every Phase 0 + Phase A.1 table; comments call out pending phases. progress.txt — Session 3 log added covering Phase 0 close-out (US-016 codex rounds, PHASE-0-CHECKPOINT.md), Phase A.1 SHIP (US-017/018/019), and Phase C.0 SHIP (US-023/024). Phase progression: Phase 0 + Phase A.1 + Phase C.0 SHIPPED. Remaining: Phase A.2 (OAuth2/Google), Phase B (capability grants + recovery), Phase C (EVM Base Sepolia anchor — largest), Phase D-rest (metrics + idempotency), Phase E (runbook final + done.sh final). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… + Google plugin + oauth_pending storage - src/plugins/auth/oauth2/mod.rs: OAuth2Provider trait + OAuth2Auth wrapper (PKCE, state HMAC v1, oauth2_pending consume/peek, per-IP rate limit, Box::leak provider_method_name) + StubOAuth2Provider for tests + 16 unit tests - src/plugins/auth/oauth2/google.rs: GoogleOAuth2Provider — auth URL builder via url::Url::parse_with_params, token exchange via reqwest form, id_token verify via jsonwebtoken decode (iss/aud/exp/iat skew/nonce), JWKS cache RwLock with TTL + lazy refresh on kid miss, ready() reports Unready/Degraded/Ready - src/storage/oauth_pending.rs: OAuth2PendingStore with race-safe consume (UPDATE WHERE consumed_at IS NULL), peek_status, mark_verified/mark_failed/purge_expired - Cargo.toml: hmac + url deps under auth-oauth2 feature - src/plugins/auth/mod.rs: cfg-gated module registration + re-exports Plan §3.5.4 grounding: PKCE mandatory + state HMAC binds request_id + JWKS 1h TTL + prompt=select_account + identity binding via google sub (NOT email; Codex P0 #4 mitigation from earlier session)
…ot wiring + 9 integration tests
- src/handlers/auth/oauth2_start.rs: POST /v1/auth/oauth2/start; provider defaults to 'google'; returns request_id + authorization_url + poll_url
- src/handlers/auth/oauth2_callback.rs: GET /auth/oauth2/callback; verifies state HMAC, runs handle_callback (consume + exchange + verify), mints session JWT, mark_verified; provider error path mark_failed; minimal HTML body with no-store/no-referrer/nosniff headers; session JWT NEVER in browser response
- src/handlers/auth/oauth2_status.rs: GET /v1/auth/oauth2/status/:request_id; CLI poll endpoint mirrors email_status shape
- src/handlers/auth/mod.rs: cfg-gated module declarations
- src/state.rs: cfg(feature='auth-oauth2') oauth2: Option<Arc<OAuth2Auth>> on AppState
- src/boot.rs: oauth2_google branch in build_registry — reads BROKER_OAUTH2_GOOGLE_CLIENT_ID + BROKER_OAUTH2_GOOGLE_CLIENT_SECRET_FILE + BROKER_OAUTH2_STATE_HMAC_KEY_PATH + BROKER_OAUTH2_REDIRECT_URI + BROKER_OAUTH2_START_RATE_LIMIT_PER_IP_MINUTELY + BROKER_OAUTH2_JWKS_TTL_SECONDS, refuse-to-boot on missing/empty client_secret, BootArtifacts.oauth2 + BuiltRegistry.oauth2
- src/main.rs: AppState construction one-liner
- src/lib.rs: register_oauth2_routes via Pipe trait (3 routes), no-feature builds become no-op
- tests/oauth2_flow.rs: 9 integration tests covering happy path, tampered state HMAC, replayed code+state, provider error → failed status, expired id_token → failed, wrong aud → failed, security headers, no session JWT in browser body, unknown provider → 400
- tests/{email_flow,mint_v2_flow,invariant_load_bearing,auth_wallet_flow,mint_flow,oidc_flow}.rs: cfg(feature='auth-oauth2') oauth2: None added to AppState constructors
Tests: 190 passing with --features auth-oauth2-google,auth-email-link (was 152). clippy clean.
…h2-setup + prd US-020/021/022 passing - harness/stage-7-issue-64-phaseA-smoke.sh: extended with 9 OAuth2 invariants (A2.1-A2.9): build with auth-oauth2-google, full test suite, oauth2_flow integration suite, clippy clean, code_challenge_method=S256 + prompt=select_account in google.rs, callback security headers, oauth2_google branch in boot.rs, all Phase A.2 env vars in env.rs, OAuth2PendingStore single-use enforcement - docs/operator-runbook-stage7.md §OAuth2 Setup: full Google Cloud Console procedure (create OAuth client, exact redirect URI match, save client_id + client_secret to mode-0600 file), state HMAC key generation (32 random bytes, /dev/urandom + chmod 600), smoke command sequence, failure-mode table (5 scenarios: user_denied, expired, wrong aud, state HMAC rotated, flow timeout), multi-account browser quirk explanation - docs/spec/plans/issue-64/prd.json: US-020/021/022 marked passes:true with commit refs Phase A.2 complete: 3 stories shipped; codex review round 1 dispatched in parallel for stop-rule satisfaction.
…+ P2/P3 wins
Codex round 1 verdict: 0 P0, 1 P1, 2 P2, 3 P3.
P1 (must-fix) — Vector 6: callback consume/mark_failed race
Problem: handler blindly re-verified state on handle_callback error,
then mark_failed'd the recovered request_id. A concurrent replay
hitting NotFoundOrConsumed would mark the original (still-in-flight)
flow as failed, clobbering the legitimate session JWT.
Fix: introduce CallbackError { inner, owned_request_id } so
handle_callback tags errors with whether THIS invocation owned the
consumed row. Pre-consume failures (state verify, expired, already-
consumed-by-concurrent) carry owned_request_id=None and the handler
no longer touches the row. Post-consume failures (provider-mismatch,
exchange_code error, verify_id_token error) carry the request_id and
the handler is entitled to mark_failed it.
Tests updated: tampered_state + replayed_state both assert
owned_request_id.is_none(); expired + wrong_aud assert
owned_request_id.is_some().
Closed P2 (Vector 10): /readyz now also checks oauth2 rate-limit store
- Added EmailRateLimitStore::writable() probe.
- OAuth2Auth::ready() returns Unready when oauth2_rate_limits.sqlite
is corrupt/unwritable.
Closed P3 (Vector 13): JWK kty/use validation in lookup_jwk()
- jwk_matches() now rejects non-RSA / non-sig keys with matching kid.
- Defense-in-depth — Google publishes only sig keys today.
Closed P3 (Vector 14): InvalidIssuer mapping in id_token verify
- jsonwebtoken ErrorKind::InvalidIssuer now maps to
OAuth2Error::InvalidIdToken('wrong issuer (iss claim)') rather
than the catch-all.
Rolled forward to V0.1-FOLLOWUPS.md:
- PA2-R1-F4 (P2): JWKS thundering-herd on kid miss → Phase D reliability.
- PA2-R1-F12 (P3): verify_state runs twice on callback error path → Phase D refactor.
cargo test -p agentkeys-broker-server --features auth-oauth2-google,auth-email-link: 190 passing (unchanged)
clippy -D warnings: clean
codex round 1 output: docs/spec/plans/issue-64/codex-phaseA2-round1.md
…026/027
Codex round 2 verdict: 1 P1 (Phase B preview) + 1 new P2 (Phase A.2) + 2 closures.
Phase A.2 round-2 closures (this commit):
- Vector 1 P1 CLOSED (CallbackError ownership tagging — verified by codex round 2).
- Vector 2 P2 CLOSED (rate-limit store readyz probe non-destructive).
Phase A.2 round-2 P2 fix (this commit):
- Vector 3: jwk_matches() now requires kty == 'RSA' exactly; empty kty
is rejected. Round 1 originally accepted empty kty for forward-compat
but round 2 escalated to fail-closed.
Phase B US-025: storage layer
- src/storage/grants.rs: GrantStore with create/revoke/list/lookup +
ATOMIC try_consume() (codex round-2 Vector 5 P1 fix — single SQL
UPDATE … WHERE grant_id = (SELECT … LIMIT 1) AND used_count <
max_uses RETURNING grant_id, audit_proof — no Rust-level peek-then-
update race window).
- 9 unit tests + 6 integration tests covering create→list→revoke,
cross-master rejection, expired/exhausted classification, atomic
increment ordering, most-recent-grant-wins.
Phase B US-026: HTTP endpoints
- src/handlers/grant/{create,revoke,list,mod}.rs:
- POST /v1/grant/create — master JWT required, mints audit_proof JWT,
rejects past expires_at + invalid daemon_address + max_uses<1.
- POST /v1/grant/revoke — master-scoped revoke, idempotent (re-revoke
returns 400 with collapsed not-found-or-not-owned message).
- GET /v1/grant/list — caller-owned grants only.
- require_session_jwt() helper extracts + verifies session bearer.
- src/jwt/issue.rs::mint_grant_audit_proof — ES256-signed JWT over
canonical grant content. iss/aud/iat/exp claims plus full
agentkeys.{kind,grant_id,master_omni_account,daemon_address,service,
scope_path,granted_at,expires_at,max_uses}. JSON now → CBOR Phase E
(V0.1-FOLLOWUPS R1-F3).
Phase B US-027: mint integration
- src/handlers/mint.rs::mint_v2 now calls grant_store.try_consume()
before STS. NoGrant → legacy implicit-grant fallback (Phase 0 mints
continue to work; Phase E flips to fail-closed). Revoked/Expired/
Exhausted → 401 Unauthorized, no STS call. Consumed → grant_id
written into AuditRecord.
Boot wiring:
- src/boot.rs: GrantStore opened at /grants.sqlite alongside
wallets/auth_nonces. BootArtifacts.grant_store + main.rs AppState wiring.
- src/state.rs: pub grant_store: Arc<GrantStore>.
- src/storage/mod.rs: re-exports Grant + GrantConsumeOutcome + GrantStore.
Tests + 7 test-file AppState constructors patched: 205 passing
(was 190 in commit d37532a; +15 covers grant unit + 6 grant_flow + 9
fail_closed-related sub-flows in the existing suites).
clippy -D warnings: clean.
Codex round 1 + 2 outputs: docs/spec/plans/issue-64/codex-phaseA2-round{1,2}.md.
V0.1-FOLLOWUPS.md updated with PA2-R1-F4 (thundering-herd) + PA2-R1-F12
(duplicate verify_state) + PA2-R2-F3 (kty fail-closed → CLOSED in this commit).
…rade Pre-Stage-7 → Stage-7 upgrades reliably refuse-to-boot with `BOOT_FAIL: BROKER_SESSION_KEYPAIR_PATH=…/.agentkeys/broker/session-keypair.json: session keypair file does not exist`. Plan §3.5.6 added a second ES256 keypair (purpose=session) and Plan §6 disables silent generation, so the operator was supposed to mint it manually — except the runbook + boot error message both told them to run `agentkeys-broker-server keygen`, which until d9bf541 didn't even exist as a CLI subcommand. Hosts upgraded in that window land in a crash loop with no obvious recovery path. This change adds an idempotent `ensure_broker_keypairs` helper that mints whatever's missing under /var/lib/agentkeys/.agentkeys/broker/ as the agentkeys system user (so files are owned correctly and chmodded 0600 by the binary itself). Called in both code paths: - upgrade mode: after the new binary is installed, before 'systemctl start agentkeys-broker' — so a Stage-7-binary-on-pre-Stage-7 -keypairs upgrade self-heals. - bootstrap mode: after the binary install + agentkeys user creation, before 'systemctl enable --now' — so first boot on a fresh host doesn't depend on the operator remembering keygen at all. Existing keypairs are left in place (the helper checks file presence before minting). The OIDC keypair's pre-Stage-7 untagged JSON shape is still accepted by OidcKeypair::load (legacy migration path), so we don't trample it. Smoke (manual): bash -n passes; helper exits early with a clear message if the agentkeys user doesn't exist yet, so calling order is enforced.
PHASE-0-CHECKPOINT.md covers Phase 0 in isolation against localhost. This guide is the production equivalent — full Stage 7 (Phases 0 + A.1 + A.2 + B + C-structural + D-rest + E) running on a real EC2 broker host with the AWS account from cloud-setup.md. Sections walk an operator through: - Two-machine layout (operator workstation vs broker host) with inline === ON … === banners on every command block. - Prerequisites checklist (cloud-setup.md §0–4 done, broker host bootstrapped, two cast-generated test wallets). - /healthz + /readyz + OIDC discovery + JWKS + IAM-side OIDC provider cross-checks (with the byte-for-byte issuer match invariant). - SIWE wallet auth round-trip for both wallets, signing with cast wallet sign (no --no-hash). - /v1/mint-oidc-jwt → AssumeRoleWithWebIdentity manual path, decoding the https://aws.amazon.com/tags claim. - Cloud-enforced isolation proof (the climax): wallet A reads its own prefix; wallet B's prefix returns AccessDenied from S3 itself, not app code. Includes the diagnostic-state runbook for both failure modes (own-prefix denied → JWT missing tag claim; other-prefix succeeds → cloud-setup.md §4.4.1 not applied; this is the silent-pass bug PR #69 fixed at the broker layer). - /v1/mint-aws-creds the daemon path with audit_record_id + anchored fields. - Capability grants (create / list / revoke), wallet linking + unauthenticated recover/lookup, email-link + OAuth2/Google flows. - Audit log inspection (sqlite plugin_mint_log columns explained). - Phase C EVM anchor (structural-only in v0; live alloy lands in V0.1-FOLLOWUPS hardening). - Prometheus metrics + Idempotency-Key (hit/miss/422 cases). - harness/stage-7-issue-64-done.sh as the programmatic gate. - Failure-mode walk-through: BOOT_FAIL anchor table, InvalidIdentityToken triage, AccessDenied-on-own-prefix, 24h-clean-exit + Restart=always. - 'What's intentionally not yet live' section pointing at V0.1-FOLLOWUPS.md so operators know which structural features ship as stubs (live EVM anchor, TEE signer, fail-closed grants default, latency histograms). 860 lines. All 6 cross-referenced files exist (verified).
…71 Option B) Pre-fix, both mint paths called `state.sts.assume_role(...)` — the legacy `sts:AssumeRole` action that requires the broker's static IAM credentials. cloud-setup.md §4.2 swaps the role's trust policy from `Principal: {AWS: agentkeys-daemon}` to `Principal: {Federated: oidc-provider}` (replace, not append), so on every cloud account that's actually run §4 the mint endpoint returned 502 `sts_error` / `AccessDenied`. The §4.5 'End-to-end proof' silently bypassed this by going /v1/mint-oidc-jwt → manual `aws sts assume-role-with-web-identity` — that path worked, but the integrated daemon path didn't, leaving Phase B (grants) / Phase C (audit + rate limit + EVM anchor) / Phase D-rest (idempotency) unreachable on federated deployments. This is issue #71 Option B: keep the wire shape, pivot the internal STS call to AssumeRoleWithWebIdentity. The mint endpoint now: 1. Authenticates the caller (session JWT or legacy bearer) — unchanged. 2. Resolves Phase B grant — unchanged. 3. Mints a per-call user-scoped OIDC JWT (same shape as /v1/mint-oidc-jwt; lowercases the wallet for PrincipalTag match; carries the `https://aws.amazon.com/tags` claim). 4. Calls `sts:AssumeRoleWithWebIdentity` with that JWT. 5. Writes audit anchor — unchanged. 6. Returns creds — unchanged response shape. Side benefit: the broker no longer needs an IAM principal at runtime for the mint flow. The legacy `agentkeys-daemon` IAM user keys / AWS_PROFILE / instance profile are still consulted only for the optional startup `caller_identity_ok` probe. A future Option A migration (daemon-side AssumeRoleWithWebIdentity, retire the route) will drop them entirely. Code changes: - sts.rs: add StsClient::assume_role_with_web_identity; AwsStsClient impl wraps aws-sdk-sts `.assume_role_with_web_identity()`; StubStsClient reuses its existing `assume` closure for both methods so test fixtures (StubStsClient::ok, ::failing, ::assume_failing) don't need any updates — only the file that explicitly counts STS calls (invariant_load_bearing) needed the new method added. - handlers/oidc.rs: extract `pub(crate) fn build_oidc_jwt_claims` so the existing /v1/mint-oidc-jwt and the new internal mint path share a single canonical claim builder. The wallet is lowercased so the PrincipalTag matches the bucket policy's lowercase resource ARNs. - handlers/mint.rs: both mint_v2 and mint_legacy mint internal JWT via the new helper, then call `assume_role_with_web_identity`. - tests/invariant_load_bearing.rs: CountingStsClient implements both methods so 'zero STS calls' assertion is path-agnostic. Test totals (--features audit-evm,auth-email-link,auth-oauth2-google): 258 passed, 0 failed. Harness gate: bash harness/stage-7-issue-64-done.sh exits 0. Clippy clean with -D warnings. Doc updates land alongside (operator-runbook-stage7.md gains a 'Mint-time STS path' subsection under §AWS IAM Trust; stage7-demo-and-verification.md §5 explains the pivot; "What's not yet live" section flags the daemon-side Option A follow-up so the eventual route retirement is tracked).
…umeRole/static-IAM-user paths (issue #71 Option A) Migrate the auto-provision pipeline from /v1/mint-aws-creds (server-side aggregator) to /v1/mint-oidc-jwt + client-side AssumeRoleWithWebIdentity, and strip the legacy code surfaces issue #71 made redundant. CALLER-SIDE MIGRATION - crates/agentkeys-provisioner/src/aws_creds.rs: rewrite fetch_via_broker to do the JWT-fetch + AssumeRoleWithWebIdentity in two steps. New fetch_oidc_jwt() helper for unit-test isolation; assume_role_with_jwt() uses anonymous SDK config (the JWT authenticates the call, no broker AWS principals participate). New fetch_via_broker_default_ttl() convenience overload (3600s). - crates/agentkeys-provisioner/Cargo.toml: add aws-config, aws-credential-types, aws-sdk-sts deps. - crates/agentkeys-mcp/src/lib.rs: thread AGENTKEYS_DATA_ROLE_ARN + AWS_REGION through McpHandler. Updated broker_env_for_provision to call fetch_via_broker_default_ttl. Test fixture rewrites: drop /v1/mint-aws-creds mock; mock /v1/mint-oidc-jwt and assert STS-step error using AWS_ENDPOINT_URL_STS=http://127.0.0.1:1. - crates/agentkeys-cli/src/lib.rs: same env-var threading + signature bump for fetch_via_broker_default_ttl. LEGACY CODE REMOVAL - crates/agentkeys-broker-server/src/handlers/mint.rs: drop mint_legacy handler + looks_like_session_jwt dispatcher. mint_aws_creds always routes through mint_v2 (session-JWT path). Drop validate_bearer_token import (no longer used by any mint path). - crates/agentkeys-broker-server/tests/mint_flow.rs: deleted (legacy- only tests). mint_v2_flow.rs remains for the surviving aggregator. - crates/agentkeys-broker-server/src/sts.rs: drop StsClient::assume_role trait method, AwsStsClient::assume_role impl, AwsStsClient::from_keys ctor. Trait now only has assume_role_with_web_identity + caller_identity_ok. Simplify StubStsClient (single closure + identity). - crates/agentkeys-broker-server/src/env.rs: drop DAEMON_ACCESS_KEY_ID, DAEMON_SECRET_ACCESS_KEY, BROKER_DAEMON_ACCESS_KEY_ID, BROKER_DAEMON_SECRET_ACCESS_KEY constants + their all() entries. - crates/agentkeys-broker-server/src/config.rs: drop daemon_access_key_id / daemon_secret_access_key fields + their env-reading logic + struct construction. - crates/agentkeys-broker-server/src/main.rs: drop static-IAM-user branch. Always use AwsStsClient::with_default_chain. Startup STS check is now soft-fail (warn) — broker no longer needs creds for the mint flow, so the probe is informational only. - crates/agentkeys-broker-server/src/boot.rs + 7 test files: strip daemon_* fields from BrokerConfig fixtures. - crates/agentkeys-broker-server/tests/invariant_load_bearing.rs: CountingStsClient drops assume_role method (only assume_role_with_web_identity). DOC UPDATES - docs/operator-runbook-stage7.md: drop DAEMON_* rows from Legacy aliases table. AWS IAM Trust §'Mint-time STS path' rewritten to describe both endpoints (daemon-side /v1/mint-oidc-jwt + server-side aggregator /v1/mint-aws-creds), with explicit 'broker creds-free posture' note. - docs/stage7-demo-and-verification.md §5 rewritten to show both paths. New §5.3 documents the auto-provision pipeline using AGENTKEYS_BROKER_URL + AGENTKEYS_DATA_ROLE_ARN. New §16 'Live walkthrough on broker.litentry.org' — copy-paste runbook for end-to-end verification (deploy, creds-free check, SIWE auth, /v1/mint-oidc-jwt, AssumeRoleWithWebIdentity, S3 isolation proof, auto-provision pipeline, audit log inspection). §15 'What's not yet live' updated — issue #71 Option A's caller-side migration is done; only the route retirement itself remains as future work. VERIFICATION (local) - cargo build -p agentkeys-broker-server (--no-default-features +auth-wallet-sig,wallet-keystore,audit-sqlite, and full feature combo): exits 0 (verified by harness). - cargo test -p agentkeys-broker-server --features audit-evm,auth-email-link,auth-oauth2-google: 247 passed, 0 failed. - cargo test -p agentkeys-provisioner -p agentkeys-mcp -p agentkeys-daemon: 61 passed, 0 failed. - cargo clippy --workspace --all-features -- -D warnings: clean. - bash harness/stage-7-issue-64-done.sh: exits 0 (all 5 phase smokes green, load-bearing 7/7, runbook drift clean, prd.json 41/41). - npm test --prefix provisioner-scripts: 42/45 passing. The 3 failing tests in src/lib/email.test.ts hit real S3 against agentkeys-mail-429071895007 and fail because the local agentkey-broker IAM profile lacks s3:ListBucket — pre-existing test-environment issue, unrelated to this migration. VERIFICATION (live, deferred to operator) - The live walkthrough against https://broker.litentry.org requires SSH to the broker host + admin AWS profile, both of which the operator must run. Documented as docs/stage7-demo-and-verification.md §16 copy-paste runbook.
…+m2) Critic on commit b0c6515 returned ACCEPT-WITH-RESERVATIONS with two MAJOR + four MINOR findings. This commit addresses M1, M2, m1, m2. M1 — `build_session_name` mismatch between provisioner and broker. The provisioner used `agentkey-{wallet}` (no timestamp, lowercase prefix); the broker uses `agentkeys-{wallet}-{secs}-{micros}`. The comment claimed they mirrored each other, but they didn't. CloudTrail correlation between broker-minted and daemon-minted sessions would have failed, and rapid same-wallet mints on the daemon side would have collided on session name (AWS returns the same temp creds for repeated same-name calls within DurationSeconds). Fix: replace the provisioner's algorithm with a byte-for-byte mirror of the broker's. Imports SystemTime + UNIX_EPOCH. Tests updated: build_session_name_matches_broker_format, _strips_unsafe_chars, _handles_empty_wallet (mirroring the broker's test cases). M2 — `scripts/setup-broker-host.sh` still emitted DAEMON_* env vars. The script offered a "static" credential mode that wrote `/etc/agentkeys/broker.env` with DAEMON_ACCESS_KEY_ID + DAEMON_SECRET_ACCESS_KEY — vars the broker no longer reads after the OIDC-only migration. An operator following the script would have set those vars, restarted the broker, seen no error, and silently been running on the SDK default chain (which on a creds-free host has no creds). Confusing failure mode. Fix: - Drop the "static" cred-mode option entirely (validation, prompts, case statements, broker.env emission, post-install instructions). - Add a new "none" cred-mode (default, recommended post-migration) that runs the broker creds-free. - Update the cred-mode walkthrough to describe the post-issue-#71 posture (broker doesn't need creds for the mint flow itself, only the optional GetCallerIdentity startup probe). - Update the systemd CRED_LINE case statement. - Update the post-install log-line check to look for the new "STS client: SDK default chain (creds optional after issue #71 …)" message instead of the removed "AWS credentials: static IAM-user keys". - Replace REPLACE_WITH_DAEMON_AKID / REPLACE_WITH_DAEMON_SECRET placeholders in the named-profile credentials file with the more neutral REPLACE_WITH_ACCESS_KEY_ID / REPLACE_WITH_SECRET_ACCESS_KEY. m1 — `docs/operator-runbook.md` (the pre-Stage-7 runbook, separate from operator-runbook-stage7.md) still described `/v1/mint-aws-creds` as using `sts:AssumeRole` and listed `DAEMON_ACCESS_KEY_ID` / `DAEMON_SECRET_ACCESS_KEY` as a configuration option. Fix: add a top-of-doc banner pointing operators at the Stage-7 runbook for the current build, update the endpoints table, drop the "Static keys (legacy)" §2.3 content, and remove the DAEMON_* row from the env table. m2 — `crates/agentkeys-broker-server/src/handlers/oidc.rs::build_oidc_jwt_claims` doc comment still listed `mint_legacy` as a caller. Removed. Verification: - cargo build --workspace clean. - cargo test -p agentkeys-provisioner: 23 passed, 0 failed (was 21 before; 3 new build_session_name_* tests, -1 obsolete one). - bash harness/stage-7-issue-64-done.sh: exits 0; all 5 phase smokes green; load-bearing 7/7; runbook drift clean; prd.json 41/41. - bash -n scripts/setup-broker-host.sh: syntax clean. Critic minor findings deferred: - m3 (env::set_var thread-safety in MCP test): pre-existing pattern acknowledged. Tracked for a future cargo-nextest migration. - m4 (AwsTempCreds Deserialize derive lost): intentional and correct — the struct is now constructed programmatically from the STS response, not deserialized from JSON. - m5 (AnonymousCredentials TODO for SDK bump): added to comment. The two open questions critic raised: - AwsStsClient with default chain calling AssumeRoleWithWebIdentity on a creds-free host: deferred to live walkthrough verification (the SDK skips signing for federated STS operations regardless of resolver state). - 3 failing npm tests in src/lib/email.test.ts: confirmed pre-existing (real-S3 calls failing due to local agentkey-broker IAM lacking s3:ListBucket); unrelated to this migration.
Ralph step 7.5 mandatory deslop pass on the changed-file scope. -33 net
LOC of redundant prose; behavior unchanged.
- crates/agentkeys-provisioner/src/aws_creds.rs: collapse 27-line file
header ("Why client-side STS?" multi-paragraph) to 8 lines pointing
at issue #71. Trim AnonymousCredentials struct doc + the verbose
inline comment in assume_role_with_jwt; replace with a 3-line TODO
flagging the future aws-config 1.5+ no_credentials() helper (critic
m5 follow-up).
- crates/agentkeys-broker-server/src/handlers/mint.rs: trim 5-line
preamble inside mint_aws_creds dispatch to a 3-line note. Trim 8-line
STS-path explanation block in mint_v2 step 6 to 4 lines (the points
are already covered by the surrounding code).
- crates/agentkeys-broker-server/src/main.rs: rewrite stale
"preserved through US-011" comment on AuditLog::open to describe
what the legacy log actually does in the post-migration build.
Verification post-deslop:
- cargo build --workspace: clean.
- cargo test -p agentkeys-provisioner: 23 passed, 0 failed.
- bash harness/stage-7-issue-64-done.sh: exits 0; all phases green;
41/41 PRD stories; runbook drift clean.
…ess scope only Operators reported that scripts/broker.env set BUCKET on the broker host, but the broker process never reads BUCKET (`grep -n '"BUCKET"' src/env.rs` — zero hits). It's an operator-workstation var used by AWS S3 admin tooling (cloud-setup.md §4.5 isolation proof, scripts/stage6-demo-env.sh) that shouldn't leak onto the broker host. Same story for BROKER_HOST and ACCOUNT_ID: - BROKER_HOST is decorative — broker reads BROKER_OIDC_ISSUER directly. - ACCOUNT_ID is the legacy ARN-derivation fallback for BROKER_DATA_ROLE_ARN; redundant when BROKER_DATA_ROLE_ARN is set explicitly (it already is). This file is now scoped to ONLY the env vars that map to constants in crates/agentkeys-broker-server/src/env.rs. The docstring at the top explicitly calls out the workstation-vs-broker-host scope split so this kind of leakage doesn't recur. scripts/setup-broker-host.sh required no change — it has zero BUCKET references already (verified).
…tion-side companion to broker.env)
Three things:
1. **Archive Stage 6 scripts.** We're in Stage 7 test phase and the
pre-Stage-7 demo scripts are now broken anyway (they hard-code
sts:AssumeRole against the data role's pre-§4 trust policy, which
was OIDC-federated by cloud-setup.md §4.2). Move them out of the
active tree:
- scripts/stage6-demo-env.sh → scripts/archived/
- scripts/stage6-demo-run.sh → scripts/archived/
- scripts/stage6-inspect-email.sh → scripts/archived/
- provisioner-scripts/scripts/weekly-live-test.sh →
provisioner-scripts/scripts/archived/ (depended on the dropped
DAEMON_* env wiring + assume-role pattern)
New scripts/archived/README.md cross-references the Stage 7
replacements (operator-workstation.env, agentkeys-cli provision,
inspect-inbound-email.sh).
2. **Add scripts/operator-workstation.env.** Workstation-side companion
to scripts/broker.env (broker-host scope). Sets ACCOUNT_ID, REGION,
BROKER_HOST, BUCKET, OIDC_ISSUER, OIDC_PROVIDER_ARN, DATA_ROLE_ARN —
exactly the vars docs/stage7-demo-and-verification.md §0 expects.
Operators source this on their laptop via
'set -a; source scripts/operator-workstation.env; set +a' before
running the §16 walkthrough or any AWS admin command. Replaces the
inline export block that was at §0 of the demo guide.
3. **Add scripts/inspect-inbound-email.sh.** Stage 7 replacement for
stage6-inspect-email.sh. Same logic (quoted-printable normalize +
header/body/href/URL extraction with the regex the broker auth
handler uses) but reads $BUCKET from the workstation env instead
of the dropped Stage-6 AGENTKEYS_SES_BUCKET / DAEMON_* wiring.
Now referenced from the new §8.1 'Debugging — inspecting the
inbound email at S3' section in the demo guide.
Doc updates:
- docs/stage7-demo-and-verification.md: §0 prerequisites now points
at scripts/operator-workstation.env instead of inlining the
exports; §16.5 references $DATA_ROLE_ARN and $OIDC_ISSUER from
the sourced file rather than re-exporting them; new §8.1 'Debugging
— inspecting the inbound email at S3' subsection.
- docs/dev-setup.md: drop two stage6-demo-env.sh references
(the §4.1 'no env scripting' line and §4.3 'still works without it'
line) + the troubleshooting row pointing at stage6-demo-run.sh.
- scripts/broker.env docstring: explicitly cross-reference
scripts/operator-workstation.env so the workstation-vs-host scope
split is documented in both files.
Source updates:
- crates/agentkeys-cli/src/lib.rs (×2): drop dead 'stage6-demo-env.sh'
filename references in doc comments, replaced with
'pre-Stage-7 fallback' / 'no manual AWS_* env wiring required' prose.
- crates/agentkeys-cli/src/main.rs: --broker-url help text now describes
the actual flow (/v1/mint-oidc-jwt + AssumeRoleWithWebIdentity)
instead of pointing at the removed shell script.
- crates/agentkeys-mcp/src/lib.rs: same prose cleanup on broker_url field.
- crates/agentkeys-daemon/src/main.rs: --broker-url doc comment
rewritten to describe the new flow (was still describing
/v1/mint-aws-creds with bearer-validated path).
Verification:
- env -i bash 'source scripts/operator-workstation.env; echo $BUCKET'
→ agentkeys-mail-429071895007 (clean load, no leaks).
- env -i bash 'source scripts/broker.env; echo $BUCKET'
→ unset (broker host correctly does NOT get the workstation var).
- bash -n scripts/inspect-inbound-email.sh: syntax clean.
- cargo build --workspace: clean.
- grep 'stage6-demo-env\|stage6-demo-run\|stage6-inspect-email' on the
active tree (excluding archived/): zero hits.
…ivate_key Operator hit `jq: error (at /tmp/wallet-A.json:6): Cannot index array with string "private_key"` following docs/stage7-demo-and-verification.md §0. `cast wallet new --json` (Foundry) returns a JSON ARRAY of wallet objects, not a single object. The wallet metadata is at `.[0]`, not the document root. Same fix applies to `address` extraction.
… setup-broker-host.sh Drop the early-return --upgrade code path. The script now follows a single linear flow that auto-detects fresh-host vs existing-deploy by reading Environment= lines from /etc/systemd/system/agentkeys-broker.service when present. Same invocation works in both states. Concrete changes: 1. Delete the if $UPGRADE_MODE; then ... exit 0; fi block (~130 LOC). The salvageable bits (git pull, branch-switch warning, stop+swap) move into the main flow. 2. Add 'Detect existing config from systemd unit' step right after pre-flight. Reads BROKER_OIDC_ISSUER, ACCOUNT_ID, REGION, and AWS_PROFILE → fills in CLI flags the operator didn't pass. After first install, every subsequent run can be 'bash setup-broker-host.sh --yes' with no other flags. 3. --ref / --skip-pull are now opt-in. Default = build whatever's currently checked out (operator handles git themselves). Pass --ref <branch-or-tag> to opt into a fetch+checkout+pull step (useful for unattended CI redeploys). Branch-switch warning fires when the resolved ref differs from the current branch. 4. --upgrade flag is now a back-compat no-op (silently accepted but does nothing — the script is idempotent regardless). 5. Binary install step now stops services before swap (idempotent — no-op on fresh hosts), backs up existing binaries to .bak (skip on fresh hosts), then installs new ones. Both binaries (mock-server + broker-server) are always rebuilt + reinstalled. 6. Final step uses 'enable + restart' instead of 'enable --now'. restart is idempotent: starts a stopped service, refreshes a running one. Picks up unit-file changes from step 5 + any binary change in step 3. 7. Add post-install verification: tail journalctl, probe loopback /healthz on both ports — operator sees immediate success/failure without an extra command. Header comment rewritten to reflect single-flow design. CLAUDE.md gains a 2-line 'Remote broker host (single entry point)' section: all remote-host changes MUST go through this script — no ad-hoc systemctl edits, no hand-built scp. This is the convention for every future remote change in the project. Net: -58 LOC, +1 idempotent flow, +1 doc rule. bash -n syntax clean.
…d` under set -e Operator on broker.litentry.org reported the script printing "Detected existing broker unit at … — reading config" then exiting silently. Cause: the previous detection block used the `[[ test ]] && cmd` pattern at the top level — under `set -e`, when the test is false, the whole compound returns 1 and the script exits. Specifically: [[ -n "$EXISTING_REGION" ]] && REGION="$EXISTING_REGION" When the existing systemd unit didn't have an `Environment=REGION=…` line (common after the post-issue-#71 deploy that drops legacy aliases), $EXISTING_REGION was empty, the test failed, the && short-circuited, the line returned 1, set -e killed the script. Fix: - Convert all four detection conditionals to explicit `if`/`fi` blocks. set -e exempts commands inside `if test; then …; fi` so a false test no longer terminates. - Harden `read_unit_env` itself: wrap the grep|head|sed pipeline in `{ … } || true` so a missing key returns empty under set -e + pipefail instead of propagating grep's no-match exit code. - Add a comment at the top of the block calling out the gotcha so the next person editing this code doesn't reintroduce it. Verified locally with `set -euo pipefail` against a unit file that has ISSUER but lacks REGION + ACCOUNT_ID: ISSUER_URL=https://broker.litentry.org ACCOUNT_ID=(empty) REGION=us-east-1 CRED_MODE=(empty) OK — no silent exit bash -n syntax clean.
Operator on broker.litentry.org reported the script still asking unnecessary questions on a re-run. The host already has OIDC enabled, nginx in place, and the post-issue-#71 creds-free posture — all four remaining prompts (cred-mode, region, nginx, certbot) were noise. Three changes make the silent re-deploy actually silent: 1. Detection block now defaults CRED_MODE to 'none' when the existing unit has no AWS_PROFILE. Pre-fix, CRED_MODE stayed empty and triggered the cred-mode prompt; post-fix, the post-issue-#71 default fills in automatically. 2. Drop the cred-mode / region / nginx / certbot prompt blocks from the interactive walkthrough. They're now opt-in via CLI flags only: --cred-mode {none|instance-profile|profile} (default: none) --region us-east-1 (default: us-east-1) --with-nginx | --without-nginx (default: no) --with-certbot | --without-certbot (default: no) On a fresh-host bootstrap that genuinely needs nginx + certbot, the operator passes those flags. On the common remote-host re-deploy case, no prompts fire. 3. Flip the validate-inputs default for CRED_MODE from 'instance-profile' to 'none' (matching the new silent default), and convert the WITH_NGINX/WITH_CERTBOT 'auto → no' resolution from '[[ ]] && cmd' to 'if/fi' to dodge the same set-e silent-exit gotcha that bit the detection block. Verified locally: existing unit + no flags + --yes → no prompts, detection fills in everything, summary + execute proceed silently. detected: ISSUER_URL=https://broker.litentry.org ACCOUNT_ID=429071895007 REGION=us-east-1 CRED_MODE=none final: WITH_NGINX=no WITH_CERTBOT=no OK — would proceed silently to summary + execute, no prompts
…k8s-style name The broker's Tier-2 reachability probe (spawn_tier2_probes in agentkeys-broker-server/src/main.rs) hits BROKER_BACKEND_URL/healthz — Kubernetes convention. The mock-server only registered /health, so the probe always returned 404 and the broker logged 'Tier-2 backend probe: unreachable' every 15s while /readyz stayed at 503. Operator on broker.litentry.org saw this in journalctl plus an empty 'curl -sf .../healthz; echo' (curl -sf swallowed the 404 silently because of -s, and printed nothing because there was no 2xx body). Add /healthz as a parallel route. Keep /health as an alias so any pre-Stage-7 caller that wired itself to /health doesn't break. After this commit + a redeploy via setup-broker-host.sh, the broker's /readyz transitions from 'unready' (tier2/backend) to 'ready' within ~15s of restart. cargo build -p agentkeys-mock-server: clean. cargo test -p agentkeys-mock-server: 5 + 56 = 61 passed, 0 failed.
…url probes informative Two related cleanups for the endpoint name + UX: 1. **Single name across the codebase: `/healthz`** (Kubernetes convention, matches what the broker's Tier-2 reachability probe actually hits). - mock-server: drop the `/health` alias added in 77fbce2. Only `/healthz` remains. Confirmed zero callers expected `/health` (grep across crates/ showed no consumers). - broker-server handlers/health.rs (dead code per V0.1-FOLLOWUPS R1-F10 but kept for now): change the backend probe URL from `/health` to `/healthz` for consistency. 2. **Make `curl … /healthz` probes self-explanatory.** The `curl -sf` pattern silently swallows non-2xx responses (because of -s) and only prints body on success. When operators hit a 404 or wrong port, they see nothing — the failure mode that prompted this fix on broker.litentry.org. Replace with `curl -sS -o /dev/null -w 'HTTP %{http_code}\\n'` so the response status always prints, regardless of outcome: - docs/stage7-demo-and-verification.md §0 healthz curl - scripts/setup-broker-host.sh post-install smoke-test hint After this commit + a redeploy: - mock-server's only health endpoint is `/healthz`. - broker's Tier-2 probe (already targeting `/healthz`) finds the endpoint and `/readyz` flips to "ready". - demo-guide §0 shows `HTTP 200` (or whatever) instead of empty output, so operators know exactly what they got. cargo build -p agentkeys-mock-server -p agentkeys-broker-server: clean. cargo test (both crates): 222 passed, 0 failed.
…-describing
- Delete crates/agentkeys-broker-server/src/handlers/health.rs (unrouted; the
router has used handlers::broker_status::readyz since Phase 0).
- /readyz green-path body changes from {} to {"status":"ready","degraded":
false,"checks":[],"ready":[...]}. The dead code was the source of the
wrong-shape doc copy that claimed /readyz returned {"status":"ready"}.
- docs/stage7-demo-and-verification.md §1 + §16.3 updated to show the actual
three-shape response and use 'jq -r .status' as the green-path verdict.
- CLAUDE.md adds a branch-push policy: on the evm branch, push immediately
after every code/doc update so scripts/setup-broker-host.sh --upgrade
doesn't silently pick up a stale revision.
zsh's builtin echo interprets \n (two ASCII chars '\' + 'n') as a literal 0x0A newline. The broker's /v1/auth/wallet/start response embeds \n inside the siwe_message JSON string as a JSON escape, so the long-standing 'echo "$START" | jq' pattern silently corrupts those escapes into raw newlines and jq fails with: Invalid string: control characters from U+0000 through U+001F must be escaped at line 13, column 33 Replaced 25 occurrences across §2-§16. printf '%s' is portable across bash and zsh and never re-interprets escapes. Added a note in §0 explaining the choice so a future maintainer doesn't 'fix' it back. Verified live against https://broker.litentry.org/v1/auth/wallet/start: - echo $START | jq → parse error (zsh) - printf '%s' "$START" | jq → siwe-d437073077a2792b327836eac893fd83 ✓
Reproduce reported failures locally and isolate the layer (shell, tooling, doc, code) before editing. If the cause is local, respond with the one-line fix; only edit when the cause is in the repo. Keep responses concise.
…0 checkpoint Same echo→printf '%s' fix as b80ec39, applied to the 5 remaining occurrences in cloud-setup.md (3), stage7-wip.md (1), PHASE-0-CHECKPOINT.md (1).
The previous bulk fix (b80ec39, 8b50c1d) used a Python raw-string regex replacement that left literal backslashes around the quotes: printf '%s' \"$START\" | jq ← was committed printf '%s' "$START" | jq ← what users actually need The shell sees \" as literal " plus the surrounding quoting, producing "<JSON>" which jq can't parse ("Invalid numeric literal"). Stripped from 30 lines across 4 docs (stage7-demo, cloud-setup, stage7-wip, PHASE-0-CHECKPOINT). Also moved the printf rationale callout from inside the §0 bullet list (where it broke list rendering) to right before §1, and expanded it to call out the backslash-quote trap explicitly.
…owing them curl -sf returns exit 22 on 4xx/5xx but DISCARDS the response body and prints nothing to stderr. Operators following the demo doc see an empty $START / empty $VERIFY / empty $JWT and have no signal what went wrong. --fail-with-body (curl >=7.76, ships in macOS curl 8.7+) keeps the same fail-on-non-2xx behaviour but PRINTS the body, so a 401 'bad nonce' or 400 'malformed wallet address' is visible immediately. 45 occurrences across 4 docs (stage7-demo, cloud-setup, operator-runbook, stage7-wip). The single `curl -sf … && echo` reference in the §1 comment is intentional — it's documenting the anti-pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously fell back to a hardcoded https://oidc.agentkeys.dev when the env var was missing. Tier-1 only validates that the issuer is HTTPS, so the wrong issuer would pass startup and the broker would happily mint JWTs that AWS rejects with cryptic InvalidIdentityToken at /v1/mint-aws-creds time. The issuer is a trust-boundary value — AWS IAM compares the JWT iss claim byte-for-byte against the registered OIDC provider URL. There is no safe default; the deployment owner must set it explicitly. Codex adversarial review (review-mowwm33c-u6fa0v) flagged this as the no-ship issue. Fix matches the existing required_env pattern already used for BROKER_BACKEND_URL on line 48. scripts/broker.env line 46 and scripts/setup-broker-host.sh line 552 already emit this env var, so the live broker.litentry.org deploy doesn't break — just gets the fail-closed behaviour the doc has always promised.
…backend
Root cause of the live-broker §3 401 'session not found':
/v1/auth/wallet/verify returns a broker-signed session JWT (kid 'ak-session-…')
/v1/mint-oidc-jwt was still calling validate_bearer_token, which round-
trips to BROKER_BACKEND_URL/session/validate
The broker signs SIWE/email/oauth2 sessions itself; the legacy mock
backend never sees them. So a freshly-minted session JWT fails the
backend lookup → 401 'session not found'.
/v1/mint-aws-creds (handlers::mint::mint_v2) was already on the right
path — verify_session_jwt against state.session_keypair, no backend
round-trip. /v1/mint-oidc-jwt was a half-completed migration.
Fix: oidc.rs swaps to verify_session_jwt — same primitive, same issuer
+ kid pinning, same audience check. wallet now comes from
session_claims.agentkeys.wallet_address. /v1/auth/exchange keeps using
validate_bearer_token because that endpoint exists explicitly to convert
legacy bearers into session JWTs (per its own docstring).
Tests:
- mint_oidc_jwt_signs_claims_for_session_wallet rewritten to mint a
session JWT against state.session_keypair instead of calling the
legacy /session/create on the mock backend.
- mint_session_against_backend helper deleted (was the only caller).
- mint_oidc_jwt_rejects_missing_bearer + rejects_invalid_bearer_and_audits_auth_failed
pass unchanged — the new local-verify path returns the same
Unauthorized error class.
124 unit + 31 integration tests green.
SELECTIVE EXPANSION mode. 6 of 8 surfaced expansions accepted: - Signer protocol design doc (#1) - Versioned HKDF derivation (#3) - Audit-log row on init (#5) - agentkeys whoami CLI (#6) - TEE-stub integration test (#7) - Hard cut --mock-token flag (#8 — stronger than recommended deprecation runway) Skipped: - Feature-flag gating (#2 — env-var gating retained) - Session JWT refresh flow (#4 — long TTL acceptable for demo) Revised effort: 600 -> 830 LOC, +1 design doc, +1 CLI command, +1 test infrastructure (TEE-stub conformance).
hanwencheng
pushed a commit
that referenced
this pull request
May 9, 2026
…th) + step 1c plan + arch doc Lands the architectural follow-up to PR #75: PR #75 shipped the dev_key_service signer with no HTTP-layer auth (loopback assumption per signer-protocol.md §"What's intentionally out of scope at v0"). This commit: - DEPLOYS signer.litentry.org as an independent backend listener (issue #74 step 1b). agentkeys-mock-server gains a `--signer-only` mode that registers ONLY `/dev/derive-address`, `/dev/sign-message`, `/healthz` (no legacy session/ credential/audit endpoints). Bound to 127.0.0.1:8092; nginx fronts it at https://signer.<zone> with its own cert. Same binary, two roles — loopback :8090 stays as the broker's tier-2 reachability target. - ADDS JWT bearer verification to /dev/* handlers. The signer reads the broker's ES256 session pubkey at boot from a pinned file (/var/lib/agentkeys/.agentkeys/broker/session-keypair.pub.pem) written by the broker's new --export-session-pubkey-to flag. Every /dev/* request must carry Authorization: Bearer <jwt> with claims.agentkeys.omni_account matching body.omni_account; otherwise 401 unauthorized. No SIGNER_ACCESS_TOKEN. No HMAC. No device-key signing — those land in step 1c. - PLUMBS the JWT through the daemon-side stack: HttpSignerClient gains with_session_jwt(); CLI signer/whoami commands load the saved session and set the bearer; init_flow returns the EVM session JWT for the caller to persist. - AUTOMATES setup-broker-host.sh to provision the new agentkeys-signer.service systemd unit and the nginx server block for signer.<zone>. Idempotent — re-runs preserve the master secret + session pubkey + nginx config. PLAN DOCS: - docs/spec/plans/issue-74-step-1c-device-key-auth.md (NEW, 381 lines) Replaces broker-issued bearer JWT as the sole authenticator on /dev/* with a device-key signature scheme. Removes broker-as-SPOF risk for the signer call surface; identity-type-uniform across evm/email/oauth2/ passkey; UX-uniform (one ceremony at init, automatic per-request). Aligned with Heima's ClientAuth tier model (EvmSiweSigned + BackendSigned), strictly stronger because user-controlled per-request key + zero per-request user interaction. See gh issue #76. - docs/spec/architecture.md (REWRITTEN, 506 lines, replaces prior version) Canonical broker/signer/daemon/key-flow doc. Mermaid diagrams for component map, trust boundaries, identity model, init sequence, per-mint sequence, deployment topology. Full K1–K10 key inventory table designed for direct Figma reuse. Pluggable-surfaces matrix covering auth methods, signer backends, audit destinations, vault backends. stage7-wip.md absorbed into §1, §6, §7, §11; archived. - docs/spec/heima-gaps-vs-desired-architecture.md (REVISED) Added §1a status snapshot table covering all 12 gaps at-a-glance. §3 OIDC provider + §6 PrincipalTag JWT claim marked RESOLVED IN-TREE (post-PR #61 + #73). NEW §11 (signer-edge contract — PARTIAL after PR #75) and §12 (per-request crypto auth — PLANNED via #76). Resolution log under §10. - docs/stage7-demo-and-verification.md (UPDATED for the signer split) Drops the SSH tunnel scaffolding entirely. Single demo path uses the public signer hostname. Trust-model diagram + two-machine layout + §0.2 reach-the-signer + §14.3 troubleshooting + §16.4 live walkthrough + §16.7 auto-provision + §17 cleanup all updated. VERIFICATION: - 394 tests pass workspace-wide (was 386 in PR #75; +8 new JWT auth integration tests in dev_key_service_routes.rs). - 0 cargo clippy errors; 18 pre-existing warnings (was 16; +2 minor cosmetic in agent-generated test code). WHAT DID NOT LAND: - Live broker host redeploy + signer.<zone> certbot issuance — operator step. The script that makes it work shipped here. To land: ssh broker host → bash scripts/setup-broker-host.sh --yes → sudo certbot --nginx -d signer.<zone> → smoke per docs/stage7-demo- and-verification.md §16. - Device-key auth (issue #74 step 1c) — separate issue #76, plan doc shipped in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands the full Stage 7 pluggable broker (issue #64) running live at
https://broker.litentry.org, completes the OIDC-only auto-provision migration (issue #71 Option A — dropsmint_legacy+ static IAM user +AssumeRoletrait), and includes the operator demo + verification guide validated end-to-end against the live broker on AWS.What changed
Architecture
crates/agentkeys-{provisioner,mcp,cli}now fetchPOST /v1/mint-oidc-jwtthen doAssumeRoleWithWebIdentityclient-side. Server-side/v1/mint-aws-credsaggregator path is gone (legacymint_legacyhandler +looks_like_session_jwtheuristic deleted).BROKER_OIDC_ISSUERis now refuse-to-boot if unset (no silent fallback to a hardcoded URL — Codex adversarial-review M1)./v1/mint-oidc-jwtnow verifies the session JWT locally against the broker's session keypair instead of round-tripping to backend/session/validate(matches/v1/mint-aws-credspost-migration; closes the §3 demo 401).StsClient::assume_role+AwsStsClient::from_keysremoved. Broker holds zero AWS principals at runtime —AssumeRoleWithWebIdentityhappens client-side with the daemon's OIDC JWT.DAEMON_ACCESS_KEY_ID+BROKER_DAEMON_*env vars dropped. Static-IAM-user branch inmain.rsdeleted.BROKER_AGENT_ROLE_ARN/ACCOUNT_ID/REGIONlegacy aliases stay (still used bysetup-broker-host.sh).Live broker deploy
scripts/setup-broker-host.shis now one idempotent script. Bootstrap + upgrade detection auto-runs based on whether a unit file already exists. Reads existing config from/etc/systemd/system/agentkeys-broker.serviceEnvironment=lines./healthz(Kubernetes convention) across mock-server, broker, and docs./healthalias dropped./readyzbody always self-describing —{"status":"ready"|"degraded"|"unready", "degraded": bool, "checks":[…], "ready":[…]}. Empty{}reply removed (Codex review). Operator probes viajq -r .status.evm(deploy script pullsorigin/evm); diagnose-before-edit; land-the-fix-everywhere.Demo + verification guide
docs/stage7-demo-and-verification.md(new, 1192 lines). End-to-end live demo againstbroker.litentry.org: SIWE wallet auth →/v1/mint-oidc-jwt→AssumeRoleWithWebIdentity→ S3 isolation proof. Each silent capture has an explicit echo confirmation.curl -sfswapped tocurl -sS --fail-with-bodyacross docs (4 docs, 45 occurrences).-sfsilently swallows error bodies; the new form prints them — operators see real errors instead of empty\$VARs.echo \"\$VAR\" | jqswapped toprintf '%s' \"\$VAR\" | jq(5 docs, 30 occurrences). zsh'sechointerprets\\nas 0x0A, corrupting JSON-string escapes inside SIWE messages.Operator runbook
docs/operator-runbook-stage7.mdupdated for the simpler post-migration env-var surface.docs/cloud-setup.mdwalks operator-workstation env setup; companionscripts/operator-workstation.envlives next to broker-sidescripts/broker.env.scripts/archived/with README.Repo stats
Test plan
cargo test -p agentkeys-broker-server— 124 unit + 31 integration passingcargo test -p agentkeys-provisioner(post-migration provisioner using/v1/mint-oidc-jwt)cargo test -p agentkeys-mcp+-p agentkeys-daemonbash harness/stage-7-issue-64-done.shexits 0https://broker.litentry.org— wallet A reads own S3 prefix; wallet B's prefix returns AccessDenied from S3 (cloud-enforced via PrincipalTag)bash scripts/setup-broker-host.sh --upgradeon the live broker host applies a clean redeployWhat's intentionally not in this PR
/v1/auth/exchangelegacy bearer shim removal (waits on TEE signer; daemon will migrate to email/OAuth2 + TEE-managed wallet)🤖 Generated with Claude Code