Wire AiGateway telemetry to real changes#71
Conversation
Co-Authored-By: Codex <noreply@openai.com>
🤖 Codex PR ReviewPlease ensure a human reviewer checks this PR before merging. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eee6dedc02
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| except BridgeError as exc: | ||
| return str(exc) |
There was a problem hiding this comment.
Catch telemetry transport failures
This is intended to be best-effort, but request_codex_service_json() can raise non-BridgeError exceptions such as urllib.error.URLError/timeouts from urlopen() or JSONDecodeError for a bad gateway response. When CODEX_AUDIT_SERVICE_URL is configured and the telemetry endpoint is unreachable or returns invalid JSON, that exception escapes after the PR has been created and before the issue comment is posted, turning an otherwise successful remediation into a failed run. Catch the transport/parse failures here and return them as warnings too.
Useful? React with 👍 / 👎.
| get_health_monitor().record( | ||
| "/v1/ai/execute/jobs/run", | ||
| time.time() - started, | ||
| job["status"] == "succeeded", |
There was a problem hiding this comment.
Keep async job runtime out of endpoint latency
When a normal async Codex job runs longer than the health monitor's latency thresholds, this records the entire background job duration as endpoint latency, so /v1/ai/health can mark the service degraded/unhealthy after successful long-running jobs even though the HTTP endpoint is healthy. The async execute contract permits long jobs, so this should be tracked as a separate job-duration metric or excluded from the health latency status.
Useful? React with 👍 / 👎.
| before_metrics={str(k): float(v) for k, v in payload.get("before_metrics", {}).items()}, | ||
| source_repo=str(payload.get("source_repository", "")), | ||
| source_repo=source_repo, | ||
| external_url=str(payload.get("external_url", "")), |
There was a problem hiding this comment.
Reject unsafe external change URLs
When an allowed workflow registers a change with an external_url using a non-HTTP scheme, this value is persisted and later rendered directly as an anchor href in the dashboard. Since the field is intended to link to a PR, validate it here as https:// (or at least HTTP(S)) before storing it so the internal dashboard does not become a clickable unsafe-scheme/script sink.
Useful? React with 👍 / 👎.
Summary
Why
The dashboard was reachable, but after deployment it had no records because no producer was writing autonomous change events, and async job execution did not report runtime health. This connects the existing monthly remediation workflow to the dashboard's feedback endpoints without blocking the main remediation path if telemetry registration fails.
Validation
python3 -m ruff check .python3 -m pytest tests -qnpx -y node@22 --experimental-default-type=module --test cloudflare/codex-audit-proxy/tests/index.test.mjsnpx -y node@22 --experimental-default-type=module --test cloudflare/ai-gateway-dash/tests/index.test.mjsgit diff --check