feat(AGX1-263): agent_api_keys route migration to FGAC (404 collapse + two-factor)#252
Merged
Merged
Conversation
8 tasks
b94f0cb to
d15bc88
Compare
…AGENT_API_KEYS_DUAL_WRITE flag Mirrors the AGX1-274 task dual-write pattern (PR #246) for agent_api_keys. - Adds creator_user_id / creator_service_account_id / spark_authz_zedtoken columns to agent_api_keys, with CHECK constraint and concurrent indexes. - On create, when FGAC_AGENT_API_KEYS_DUAL_WRITE is enabled for the caller's account, calls authorization_service.grant(AgentexResource.api_key(id)) BEFORE the Postgres write. Grant failure aborts the create. - On delete, best-effort revoke after the Postgres delete. Failures are logged but do not block the delete. - Adds AgentexResourceType.api_key and AgentexResource.api_key(...) factory. - Creates src/utils/feature_flags.py with both FGAC_TASKS_DUAL_WRITE and FGAC_AGENT_API_KEYS_DUAL_WRITE (file does not exist on main yet; if PR #246 lands first this becomes a rebase concern). Structural divergence from tasks: agent_api_keys have no service layer, so the dual-write logic lives in AgentAPIKeysUseCase rather than a separate service. This keeps the call site simple and avoids inventing a new layer. Route layer (read-side auth checks) is out of scope; that's PR B (AGX1-273). agentex-auth spark_mapping.py update is a sibling-repo concern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the parent_agent cascade gap surfaced on scaleapi/agentex#354. The api_key dual-write (AGX1-272, PR #248) currently calls grant() which writes the owner edge in SpiceDB but NOT the parent_agent edge. The agent_api_key schema requires `read = ... & parent_agent->read & ...`, so every downstream read/update fails closed without that edge. This PR adds register_resource/deregister_resource (Port + adapter + service) and swaps the api_keys use case from grant→register_resource with parent=AgentexResource.agent(agent_id). Now the owner edge and parent_agent edge are written atomically. Stack: - scaleapi/scaleapi#144657 — sgp-authz 0.7.0 (parent_resource kwarg). - scaleapi/agentex#355 — agentex-auth Port + adapter + HTTP routes. - #248 — original AGX1-272 dual-write (this stacks on it). - THIS PR — extends #248 to use the parent-aware path. Changes: - Port: abstract register_resource(resource, parent=None) and deregister_resource(resource). - Adapter proxy: POST /v1/authz/register and /v1/authz/deregister. - Service: mirror existing grant/revoke pattern (principal_context override, _bypass support, parent in log line for cascade debugging). - Use case: swap grant→register_resource passing parent=agent; swap revoke→deregister_resource. except Exception wrappers preserved (fail-closed on register, best-effort on deregister). - Tests: rename mocks to register_resource/deregister_resource; assert the parent edge is passed correctly. Test plan: - pytest agentex/tests/integration/services/test_agent_api_key_service_dual_write.py → 8 / 8 pass. - New test ``test_create_api_key_calls_grant_when_flag_on`` asserts parent.type == AgentexResourceType.agent and parent.selector == agent.id. Other resource types' grant→register_resource swap is out of scope. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9668f1a to
e72df68
Compare
d15bc88 to
0cbba3d
Compare
…L_WRITE
Per team discussion: rather than maintain a parallel env-var flag system
in scale-agentex, route api_key dual-write flag checks through
egp-api-backend's existing flag service. One source of truth across services,
single flip surface for ops, fewer per-env env-var allowlists to keep in sync.
Changes:
- EnvVarKeys.EGP_API_BACKEND_URL — new env var for the egp-api-backend
base URL. Used by the new HTTP-backed flag provider.
- FeatureFlagProvider rewritten as an HTTP client of egp-api-backend's
GET /feature-flag/{id} endpoint:
* Forwards x-api-key / x-user-id / x-service-account-id /
x-selected-account-id from the caller's principal_context so the
endpoint's REQUIRE_IDENTITY_AND_OPTIONAL_ACCOUNT policy admits the
request.
* Coerces the response's `value` field to bool.
* Fails closed to False on any error (config missing, no identity,
non-2xx, transport failure, JSON parse failure) — the legacy
no-Spark code path is the safe default.
* `is_enabled` is now async (HTTP call). Signature is
`is_enabled(name, *, principal_context, account_id)`.
- AgentAPIKeysUseCase: both call sites now await is_enabled and pass
principal_context. _deregister grabs principal_context from
self.authorization_service.
- Test fixtures: mock FeatureFlagProvider directly (Mock with
is_enabled = AsyncMock(return_value=flag_on)) so dual-write tests stay
hermetic. The pre-existing FeatureFlagProvider() no-arg constructions
in test_agents_api_keys_use_case.py and integration_client.py now pass
egp_api_backend_url=None (provider returns False without it, matching
the prior "flag never enabled in unit tests" behavior).
Out of scope:
- Migrating Asher's FGAC_TASKS_DUAL_WRITE flag check off env vars.
That's task-team-owned and we leave their existing pattern alone per
the team discussion (new-work-only).
- Caching the flag response. Each is_enabled is a fresh HTTP call.
Egp-api-backend's flag endpoint is fast and the caller paths are
already crossing the network for the actual register/deregister, so
one extra round-trip is acceptable for now. Add caching later if
load profiling shows it matters.
Test plan:
- uv run pytest agentex/tests/integration/services/test_agent_api_key_service_dual_write.py — 8/8 pass.
- Existing 4 unrelated test_agents_api_keys_use_case.py docker-fixture
errors predate this commit (verified via `git stash`).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
0cbba3d to
4335ca7
Compare
Three rounds of Harvey's review feedback together pointed at the same concern: scale-agentex (OSS) should not be coupled to egp-api-backend (proprietary) feature-flag service or Spark-specific schema concepts. Changes: - Drop FeatureFlagProvider and the egp-api-backend HTTP query entirely. scale-agentex now calls register_resource/deregister_resource unconditionally. agentex-auth's per-account routing (FGAC_AGENTEX_AUTH_SPARK, scaleapi/agentex#353) decides whether the call lands on Spark AuthZ or falls back to legacy SGP for the caller's account. - Drop the creator_user_id / creator_service_account_id columns from the agent_api_keys table. They were only used as a runtime guard inside the auth-registration helper. Moved that check inline. - Drop the spark_authz_zedtoken column. The name leaked SpiceDB-specific terminology into the OSS schema, and the column was always None. - Delete the Alembic migration entirely. migration_history.txt re-pointed to head=6c942325c828. - Rename helper methods: _register_api_key_in_spark_authz -> _register_api_key_in_auth, same for deregister. Test plan: 6/6 dual-write integration tests pass. Pre-existing docker-fixture errors in test_agents_api_keys_use_case.py unrelated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…se and two-factor mutations Mirrors AGX1-275 (PR #249) for agent_api_keys. Wires Spark AuthZ checks into every api_key route, collapses denials to 404 (so name/id probes can't distinguish "present in another tenant" from "absent"), and relies on SpiceDB's transitive expansion of api_key.{update,delete} (= editor & parent_agent->update & tenant_gate) for two-factor mutations rather than issuing two explicit checks at the route layer. - src/utils/agent_api_key_authorization.py (new): _check_api_key_or_collapse_to_404 — catches AuthorizationError, raises ItemDoesNotExist. Same shape as Asher's task helper. - src/utils/authorization_shortcuts.py: DAuthorizedId routes AgentexResourceType.api_key through the wrap. (DAuthorizedName isn't used for api_keys; the name lookup is (agent_id, name, api_key_type), not a single globally-unique path param — the route handlers call the collapse helper inline instead.) - src/api/routes/agent_api_keys.py: * POST: explicit agent.update on parent (no api_key resource yet). * GET list: DAuthorizedResourceIds + filter; None passes through. * GET /name/{name}: inline collapse helper. * GET /{id}: DAuthorizedId(api_key, read). * DELETE /{id}: DAuthorizedId(api_key, delete). Two-factor via SpiceDB schema (api_key.delete expands to parent_agent.update); no second route-layer check. * DELETE /name/{api_key_name}: inline collapse helper. - tests/unit/api/test_agent_api_keys_authz.py (new): 12 tests, all pass. Stacked on dhruv/agx1-272-agent-api-keys-dual-write (PR A). Does NOT touch dual-write logic. Does NOT modify agentex-auth. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4335ca7 to
d394ab9
Compare
…api-keys-route-migration # Conflicts: # agentex/src/api/routes/agent_api_keys.py # agentex/src/api/schemas/authorization_types.py # agentex/src/domain/use_cases/agent_api_keys_use_case.py # agentex/src/utils/authorization_shortcuts.py # agentex/tests/integration/services/test_agent_api_key_service_dual_write.py # agentex/tests/unit/use_cases/test_agents_api_keys_use_case.py
…ination
The route previously fetched up to ``limit`` rows from the repo, then
intersected with ``authorized_api_key_ids`` in Python. Two bugs:
1. Pagination broken: fetching limit=50 then filtering can yield <50
rows even when more authorized api_keys exist further in the table.
Subsequent pages skip the wrong offset.
2. Wasted DB work: unauthorized rows are loaded just to be discarded.
Match tasks.py's pattern: forward ``authorized_ids`` into the use case
as an ``id`` kwarg, which the base postgres adapter translates into a
``WHERE id IN (...)`` clause. Filter + limit + offset all apply at the
SQL layer.
Edge case: an empty list of authorized ids must short-circuit to ``[]``,
otherwise the base ``create_where_clauses_from_filters`` skips the
empty-Sequence filter and returns an unfiltered result.
Unit tests updated to assert the use case is called with the expected
``id=`` kwarg rather than testing post-fetch filtering.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The merge from main brought in #255's TaskChildResourceType.message, which threaded a new ``message_repository`` parameter through the ``_ensure_authorized_id`` closure (between ``state_repository`` and ``resource_id``). Three TestDAuthorizedIdApiKeyWrap cases called the closure with only 4 positional args, so ``resource_id`` was bound to its ``Path(...)`` default and Pydantic rejected the resulting AgentexResource. Add a MagicMock for ``message_repository`` at each call site. The api_key branch doesn't read it (it's only used in the task-child parent-resolution path), but FastAPI's dep system requires the positional slot. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rpatel-scale
approved these changes
Jun 2, 2026
Greptile P1 / security: both name-route handlers previously produced two
structurally different 404 bodies depending on whether the row existed.
- Absent row -> HTTPException(404, f"Agent api_key '{name}' not found
for agent ID {agent.id}")
- Present + denied -> ItemDoesNotExist(f"Item with id '{api_key_id}'
does not exist.")
A cross-tenant caller who can resolve an agent_id could diff the two
response bodies and probe which (agent, name, type) tuples exist on
another tenant -- exactly the existence-leak the 404 collapse was meant
to prevent.
Fix:
- Add identifier-free constant API_KEY_NOT_FOUND_MESSAGE in
agent_api_key_authorization.py; use it from both the helper's
denied-resource raise and from both name routes' absent-row raise.
- Both paths now raise ItemDoesNotExist(API_KEY_NOT_FOUND_MESSAGE) --
byte-for-byte identical 404 bodies.
Test: new test_absent_and_denied_404_bodies_are_identical locks in the
property by exercising both paths through get_agent_api_key_by_name and
asserting str(exc) == API_KEY_NOT_FOUND_MESSAGE for both.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
TOCTOU fix (Greptile P2): the delete-by-name handler resolved name -> id for the auth check but then issued the delete by name. If the row were replaced between check and delete, the auth check would have evaluated the old id but the mutation would land on the new one. Switch the delete to use the resolved ``existing.id`` so check and delete target the same row. Comment cleanup: - Helper docstring trimmed from 22 lines to 4; defer rationale to its sibling helper. - Route file: four multi-line block comments collapsed to one line each. - Test file: module docstring + 9 class/method docstrings tightened to single declarative lines; intent stays, prose drops ~60%.
harvhan
approved these changes
Jun 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires Spark AuthZ into all six
agent_api_keysroutes. Mirrors Asher's task-route pattern from #249.404so callers can't probe cross-tenant existence.GETlist filters to api_keys the caller can read (id filter pushed into the repo for correct pagination).POSTdoes an explicitagent.updatecheck on the parent — only enforcement surface at create time, since no api_key resource exists yet.delete) rely on SpiceDB's transitiveapi_key.delete = editor & parent_agent->update & tenant_gateexpansion. Theparent_agentedge is populated by #248'sregister_resource(parent=agent).Stack
api_keymappingLinear: AGX1-263.
Tests
tests/unit/api/test_agent_api_keys_authz.py— 12 tests covering the collapse helper,DAuthorizedIdrouting, name-route inline collapse, list filtering, and create-parent-check.Out of scope
Greptile Summary
This PR wires Spark AuthZ (FGAC) into all six
agent_api_keysroutes, mirroring the task-route pattern from #249. All authorization denials collapse to404to prevent cross-tenant key-name probing, and the delete-by-name route now resolves name→id before checking auth (closing the TOCTOU window flagged in a prior review).POSTchecksagent.updateon the parent agent (no api_key resource exists yet);GET /{id}andDELETE /{id}useDAuthorizedId;GET /name/{name}andDELETE /name/{name}use the inline_check_api_key_or_collapse_to_404helper;GET /usesDAuthorizedResourceIdswith the id list pushed into the SQL filter for correct pagination.ItemDoesNotExist(API_KEY_NOT_FOUND_MESSAGE), making the bodies byte-for-byte identical and closing the existence-leak vector.AgentAPIKeysUseCase.listgains an optionalidfilter with an explicit empty-list short-circuit to prevent the base repo from convertingid=[]into an unfiltered query.Confidence Score: 5/5
The FGAC wiring is correct across all six routes; the 404-collapse and TOCTOU fixes flagged in prior reviews have been addressed.
The authorization logic is sound: denials are consistently collapsed to 404 with identical bodies, delete-by-name now deletes by resolved ID rather than by name, and the empty-list short-circuit prevents an accidental unfiltered query. The only findings are a misleading (but vacuous) test assertion and an untyped parameter — neither affects runtime behavior.
The test assertion in test_delete_by_name_handler_collapses_denial_to_404 checks a removed method rather than the one now used for deletion; worth a quick fix before the tests are relied upon for regression coverage.
Important Files Changed
Sequence Diagram
sequenceDiagram participant Client participant Route as agent_api_keys route participant SpiceDB as SpiceDB (AuthZ) participant DB as Postgres (repo) Note over Route,DB: GET /{id} / DELETE /{id} — DAuthorizedId dep Client->>Route: "GET /agent_api_keys/{id}" Route->>SpiceDB: "check(api_key:{id}, read)" alt denied SpiceDB-->>Route: AuthorizationError Route-->>Client: 404 ItemDoesNotExist else allowed SpiceDB-->>Route: OK Route->>DB: get(id) DB-->>Route: entity Route-->>Client: 200 AgentAPIKey end Note over Route,DB: GET /name/{name} — inline collapse Client->>Route: "GET /agent_api_keys/name/{name}" Route->>DB: get_by_agent_id_and_name(...) alt absent DB-->>Route: None Route-->>Client: 404 Agent api_key not found. else present DB-->>Route: entity (with id) Route->>SpiceDB: "check(api_key:{id}, read)" alt denied SpiceDB-->>Route: AuthorizationError Route-->>Client: 404 Agent api_key not found. else allowed SpiceDB-->>Route: OK Route-->>Client: 200 AgentAPIKey end end Note over Route,DB: DELETE /name/{name} — resolve→check→delete Client->>Route: "DELETE /agent_api_keys/name/{name}" Route->>DB: get_by_agent_id_and_name(...) alt absent DB-->>Route: None Route-->>Client: 404 Agent api_key not found. else present DB-->>Route: entity (with id) Route->>SpiceDB: "check(api_key:{id}, delete)" alt denied SpiceDB-->>Route: AuthorizationError Route-->>Client: 404 Agent api_key not found. else allowed SpiceDB-->>Route: OK Route->>DB: "delete(id=entity.id)" Route-->>Client: 200 deleted end end Note over Route,DB: POST / — parent agent.update check Client->>Route: POST /agent_api_keys Route->>DB: agent_use_case.get(...) Route->>SpiceDB: "check(agent:{id}, update)" alt denied SpiceDB-->>Route: AuthorizationError 403 Route-->>Client: 403 else allowed SpiceDB-->>Route: OK Route->>DB: create api_key Route-->>Client: 201 CreateAPIKeyResponse endComments Outside Diff (2)
agentex/src/api/routes/agent_api_keys.py, line 163-179 (link)The name routes produce two structurally different 404 responses depending on whether the key exists in the database. If the key is absent,
HTTPExceptionis raised with"Agent api_key '{name}' not found for agent ID {agent.id}". If the key exists but the caller is denied,ItemDoesNotExistis raised with"Item with id '{api_key_id}' does not exist.". Both return HTTP 404, but a cross-tenant caller who can resolve anagent_idcan probe for key-name existence by comparing the response body — exactly the information leak the collapse is meant to prevent. The same issue exists indelete_agent_api_key_by_name(line ~264). To close it, both "not found" paths should produce an identical detail string (e.g. both raiseItemDoesNotExistwith the same generic message).Prompt To Fix With AI
agentex/src/api/routes/agent_api_keys.py, line 260-278 (link)The auth check resolves the key to an
idand passes that ID to SpiceDB, but the subsequent delete is issued by name (delete_by_agent_id_and_key_name). If the key is deleted and a new key with the same name is created between the lookup and the delete, the auth check will have evaluated the old key's ID (which the caller may have legitimately had delete rights on) but the mutation will affect the new key. The practical risk is low, but it is a real race between a non-atomic read-auth-mutate sequence. Deleting by the resolvedidinstead of by name would close it.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (4): Last reviewed commit: "refactor(AGX1-263): close delete-by-name..." | Re-trigger Greptile