Skip to content

feat(runner): optional MLflow tracing parallel to Langfuse#1263

Open
etirelli wants to merge 12 commits intoambient-code:mainfrom
etirelli:mlflow_obs
Open

feat(runner): optional MLflow tracing parallel to Langfuse#1263
etirelli wants to merge 12 commits intoambient-code:mainfrom
etirelli:mlflow_obs

Conversation

@etirelli
Copy link
Copy Markdown
Contributor

@etirelli etirelli commented Apr 9, 2026

Summary

Adds optional MLflow GenAI tracing alongside the existing Langfuse path. Langfuse remains the default when OBSERVABILITY_BACKENDS is unset, so current reports, evals, and feedback (ambient:langfuse_trace / traceId) behave as before unless operators opt in.

Backward compatibility: LANGFUSE_MASK_MESSAGES / shared masking uses the same rules as before this PR (e.g. redact long strings, allow-listed keys including metadata unchanged). Session spec.environmentVariables cannot override operator-injected env vars that come from SecretKeyRef (Langfuse + MLflow), so platform observability config is not bypassed.

What changed

Runner (ambient-runner)

  • OBSERVABILITY_BACKENDS: comma list langfuse, mlflow (default: langfuse only).
  • MLflowSessionTracer (mlflow_observability.py): mirrors turn / tool boundaries with mlflow.start_span, usage attributes, session summary span, error cleanup, and flush.
  • observability_config.py / observability_privacy.py: backend selection and shared message masking (LANGFUSE_MASK_MESSAGES still applies to both paths).
  • ObservabilityManager: orchestrates Langfuse + MLflow; mlflow_tracing_active property; per-backend try/except so one backend failing does not disable the other.
  • pyproject.toml: optional extra mlflow-observability (mlflow[kubernetes]>=3.11, opentelemetry-exporter-otlp); included in all (Docker [all] install).
  • README: MLflow env vars and OTLP (OTEL_*, MLFLOW_TRACE_ENABLE_OTLP_DUAL_EXPORT).
  • Tests: test_observability_config.py, test_observability_mlflow_integration.py; _extract_assistant_text tolerates missing claude_agent_sdk so pytest passes without the claude extra.

Operator & manifests

  • Copies ambient-admin-mlflow-observability-secret into session namespaces when MLFLOW_TRACING_ENABLED is set on the operator (same pattern as Langfuse).
  • Injects MLFLOW_TRACING_ENABLED, MLFLOW_TRACKING_URI, MLFLOW_EXPERIMENT_NAME, optional MLFLOW_TRACKING_AUTH / MLFLOW_WORKSPACE, OBSERVABILITY_BACKENDS via secretKeyRef.
  • replaceOrAppendEnvVars: entries already using ValueFrom (secret-backed) are not replaced by spec.environmentVariables, so user env cannot override Langfuse or MLflow keys wired from secrets.
  • When the MLflow observability secret is in use, runner pods use ServiceAccountName: ambient-session-<session> with token automount so MLFLOW_TRACKING_AUTH=kubernetes-namespaced can supply JWT + workspace headers (per MLflow 3.11). Session Role grants experiments (mlflow.kubeflow.org) for Kubeflow-style workspace auth (tune for your CRD group).
  • Deletes copied secret on session cleanup (mirrors Langfuse rules).
  • operator-deployment.yaml: optional env from the MLflow observability secret.
  • Example secret: components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example.

Capabilities API

  • FrameworkCapabilities.tracing: e.g. langfuse, mlflow, or langfuse,mlflow via tracing_capability_label.

How to enable (high level)

  1. Create ambient-admin-mlflow-observability-secret in the operator namespace (see example YAML).
  2. Set operator env MLFLOW_TRACING_ENABLED=true (and ensure the secret exists) so the operator copies it and wires runner env.
  3. Runner image must include MLflow deps (already true for [all] builds).

Testing

  • cd components/runners/ambient-runner && uv run pytest tests/ (full suite; includes observability + privacy tests).
  • Operator: run go test ./... with a Go toolchain matching go.mod (e.g. 1.24+), including TestReplaceOrAppendEnvVarsPreservesValueFrom.

Commits (squash on merge)

Branch includes the feature commit plus follow-ups, e.g. MLflow k8s auth / SA automount / experiment RBAC, ValueFrom env precedence, assistant-text test fix, and masking left aligned with historical Langfuse behaviour—see git log on the PR branch for the exact list.

Summary by CodeRabbit

  • New Features

    • Optional MLflow observability: per-session MLflow tracing, selectable backends (langfuse, mlflow), backend selection and privacy utilities, and conditional runner service-account automount when namespaced auth is used.
  • Bug Fixes

    • Prevents user env vars from overriding operator-managed observability vars; preserves platform-managed settings and protects injected names.
  • Documentation

    • Expanded docs for multi-backend observability, MLflow setup, OTLP export, and masking options.
  • Tests

    • Added tests for backend selection, MLflow integration, and privacy masking.
  • Chores

    • Example MLflow secret manifest, RBAC for per-session MLflow access, and packaging extra for MLflow.

etirelli and others added 2 commits April 9, 2026 12:38
Add OBSERVABILITY_BACKENDS (default langfuse-only), MLflowSessionTracer with
manual spans mirroring turn/tool boundaries, and mlflow_tracing_active helper.

Operator copies ambient-admin-mlflow-observability-secret into session namespaces
and injects MLFLOW_* / OBSERVABILITY_BACKENDS env. Document OTLP variables in
README; add example secret manifest and mlflow-observability optional extra
(mlflow>=3.9, opentelemetry-exporter-otlp).

Langfuse behaviour and feedback trace IDs are unchanged when MLflow is off.

Signed-off-by: Edson Tirelli <etirelli@redhat.com>
Made-with: Cursor
end_turn() always calls _extract_assistant_text(); guard ImportError and fall\nback to blocks with a .text attribute so unit tests pass without the claude extra.

Signed-off-by: Ambient Code Bot <bot@ambient-code.local>
Made-with: Cursor
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

Adds optional MLflow observability: operator copies an admin MLflow secret into session namespaces and injects secret-backed env vars; runner gains backend selection, privacy masking, MLflowSessionTracer, docs, tests, packaging extra; RBAC and cleanup adjusted for MLflow lifecycle.

Changes

Cohort / File(s) Summary
Kubernetes & Operator Integration
components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example, components/manifests/base/core/operator-deployment.yaml, components/operator/internal/handlers/sessions.go, components/operator/internal/handlers/reconciler.go
Adds example MLflow observability Secret; operator reads secret keys, copies/deletes per-session MLflow secret (excluding the stopping session), injects secret-backed env vars into runner pods, conditionally enables per-session ServiceAccount for kubernetes-namespaced auth, and updates RBAC for MLflow experiments.
Runner Observability Core
components/runners/ambient-runner/ambient_runner/observability.py, components/runners/ambient-runner/ambient_runner/observability_config.py, components/runners/ambient-runner/ambient_runner/observability_privacy.py, components/runners/ambient-runner/ambient_runner/mlflow_observability.py
Adds backend selection config, privacy masking utilities, new MLflowSessionTracer, and refactors ObservabilityManager to initialize/use Langfuse and/or MLflow, unify turn/tool handling, and expose mlflow_tracing_active and tracing_capability_label.
Bridges & Runtime Integration
components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py
ClaudeBridge.capabilities() now returns a prioritized tracing label (tracing_capability_label, langfuse, or mlflow) instead of a boolean.
Docs, Packaging & Tests
components/runners/ambient-runner/README.md, components/runners/ambient-runner/ambient_runner/AGENTS.md, components/runners/ambient-runner/pyproject.toml, components/runners/ambient-runner/tests/*
Docs updated for multi-backend observability; added mlflow-observability extra and included in all; added unit tests for backend selection, MLflow init, privacy masking, and env/replaceOrAppendEnvVars behavior.
Runner New Module & Tests
components/runners/ambient-runner/ambient_runner/mlflow_observability.py, components/runners/ambient-runner/tests/test_observability_mlflow_integration.py, components/runners/ambient-runner/tests/test_observability_config.py, components/runners/ambient-runner/tests/test_privacy_masking.py
Adds MLflowSessionTracer implementation and tests validating initialization, backend selection, privacy masking, and env-var protection behavior.
Tests & Call-site Updates
components/operator/internal/handlers/sessions_test.go
Updates deleteAmbientVertexSecret test call signatures to include new arg; adds tests verifying replaceOrAppendEnvVars preserves operator-managed ValueFrom entries and protected env var names.
Example Manifest
components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example
New example Opaque Secret with documented MLFLOW_* keys and OBSERVABILITY_BACKENDS (langfuse,mlflow), plus inline operator/runner semantics and auth notes.

Sequence Diagram(s)

sequenceDiagram
    participant Operator as Operator (sessions.go)
    participant K8s as Kubernetes API
    participant Runner as Runner Pod (ambient-runner)
    participant MLflow as MLflow Server

    Operator->>K8s: Create session namespace
    Operator->>K8s: Get ambient-admin-mlflow-observability-secret
    alt secret present and MLFLOW_TRACING_ENABLED
        Operator->>K8s: Copy secret into session namespace
        K8s-->>Operator: Secret copied
        Operator->>K8s: Create runner pod with secret-backed env vars + per-session SA
        K8s->>Runner: Start pod (env injected)
        Runner->>Runner: ObservabilityManager.initialize()
        alt OBSERVABILITY_BACKENDS includes mlflow
            Runner->>MLflow: Configure tracking URI & experiment
            Runner->>MLflow: Emit turn/tool spans via MLflowSessionTracer
        end
    else secret missing or disabled
        Operator-->>Runner: Runner starts without MLflow envs
    end
    Operator->>K8s: Detect session termination
    Operator->>K8s: Delete copied MLflow secret (if no other active sessions)
Loading

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Performance And Algorithmic Complexity ❌ Error PR introduces three sequential unpaginated Kubernetes List operations in session cleanup, causing 3× full namespace scans without pagination limits or consolidation. Add ListOptions{Limit: 500} pagination and consolidate three separate list calls into single List+filter to reduce scans from 3× to 1× per session stop.
Docstring Coverage ⚠️ Warning Docstring coverage is 64.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title follows Conventional Commits format with correct type (feat) and scope (runner), and clearly summarizes the main change: adding optional MLflow tracing alongside Langfuse.
Security And Secret Handling ✅ Passed PR implements MLflow tracing with secure secret handling: K8s Secrets use OwnerReferences, secrets never logged plaintext, environment variables injected via secretKeyRef, MLFLOW_TRACKING_URI validated, RBAC properly configured, sensitive payloads redacted, no hardcoded credentials.
Kubernetes Resource Safety ✅ Passed File path located at components/operator/internal/handlers/sessions.go contains secret-copying logic requiring inspection for OwnerReferences handling.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 1128-1173: The current SecretKeyRef-backed MLflow env vars
(MLFLOW_TRACING_ENABLED, MLFLOW_TRACKING_URI, MLFLOW_EXPERIMENT_NAME,
OBSERVABILITY_BACKENDS) are later overwritten by replaceOrAppendEnvVars(base,
userEnvVars); remove the ability for userEnvVars to override these by either (A)
filtering out these keys from userEnvVars before calling replaceOrAppendEnvVars,
or (B) changing replaceOrAppendEnvVars to skip replacing envs that already have
a ValueFrom (i.e., when existing.EnvVar.ValueFrom != nil). Update the code that
prepares base/userEnvVars (reference variables base and userEnvVars and function
replaceOrAppendEnvVars) to enforce this precedence so the SecretKeyRef-backed
entries cannot be overwritten by user-supplied spec.environmentVariables.

In `@components/runners/ambient-runner/ambient_runner/observability_privacy.py`:
- Around line 21-24: The masking logic currently only redacts strings longer
than 50 chars (checks on variable `data`) and also lets `metadata` pass through;
change the sanitizer so that when masking is enabled any string value (including
short strings) is redacted rather than only long ones, and ensure `metadata` is
recursively sanitized/removed before export. Locate the string handling
conditional that uses `isinstance(data, str)` and the `metadata` passthrough and
replace the length-check branch with a policy that returns a redaction token
(e.g., "[REDACTED FOR PRIVACY]") for all strings when masking is on, and ensure
nested structures (dicts/lists/metadata objects) are traversed and their string
values are sanitized the same way.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dde5612b-3a15-4efd-9a4a-e906612fee08

📥 Commits

Reviewing files that changed from the base of the PR and between d5555f7 and d361577.

⛔ Files ignored due to path filters (1)
  • components/runners/ambient-runner/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (13)
  • components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example
  • components/manifests/base/core/operator-deployment.yaml
  • components/operator/internal/handlers/sessions.go
  • components/runners/ambient-runner/README.md
  • components/runners/ambient-runner/ambient_runner/AGENTS.md
  • components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py
  • components/runners/ambient-runner/ambient_runner/mlflow_observability.py
  • components/runners/ambient-runner/ambient_runner/observability.py
  • components/runners/ambient-runner/ambient_runner/observability_config.py
  • components/runners/ambient-runner/ambient_runner/observability_privacy.py
  • components/runners/ambient-runner/pyproject.toml
  • components/runners/ambient-runner/tests/test_observability_config.py
  • components/runners/ambient-runner/tests/test_observability_mlflow_integration.py

@Gkrumbach07 Gkrumbach07 added the ambient-code:managed PR managed by AI automation label Apr 9, 2026
@Gkrumbach07
Copy link
Copy Markdown
Contributor

@ambient-code

- Protect SecretKeyRef-backed env vars from user override: replaceOrAppendEnvVars
  now skips replacing entries that have ValueFrom set, preventing user-supplied
  spec.environmentVariables from hijacking platform-managed secrets (Langfuse,
  MLflow, etc.)
- Fix privacy masking to redact all strings regardless of length and recursively
  sanitize metadata, closing a leak of short sensitive values
- Add MLFLOW_TRACKING_AUTH env var injection for MLflow 3.11+ Kubernetes-namespaced
  auth (per reviewer feedback)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ambient-code
Copy link
Copy Markdown
Contributor

ambient-code bot commented Apr 9, 2026

Review comments addressed

Pushed b8aedc5 with fixes for three review findings:

1. Protect SecretKeyRef-backed env vars from user override (CodeRabbit — Critical)

replaceOrAppendEnvVars now skips replacing env vars that have ValueFrom set. This prevents spec.environmentVariables from overriding platform-managed secrets (Langfuse keys, MLflow credentials, OBSERVABILITY_BACKENDS).

2. Fix privacy masking to redact all strings (CodeRabbit — Major)

privacy_mask_message_data now redacts all strings regardless of length (removed the >50 char threshold). metadata is now recursively sanitized instead of passed through untouched.

3. Add MLFLOW_TRACKING_AUTH for K8s-namespaced auth (@mprahl)

Added MLFLOW_TRACKING_AUTH to the secret example (defaulting to kubernetes-namespaced) and wired it as a secretKeyRef env var in the operator session handler, per MLflow 3.11+ Kubernetes auth docs.

Note: The SDD boundary check failure is a GitHub Actions permissions issue (Resource not accessible by integration), not a code problem — unrelated to this PR's changes.

🤖 Session

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 2025-2027: The MLflow secret cleanup is skipped because
deleteAmbientMlflowObservabilitySecret() returns early when the session is still
marked Running/Creating; change the helper to accept a force (bool) parameter
and, when force==true, skip the “active session” short-circuit so the secret is
deleted regardless of session state. Update calls from
deletePodAndPerPodService() to pass force=true (and leave other callers
unchanged or pass false), and ensure handleAgenticSessionEvent()’s normal stop
path still calls the helper with force as appropriate so the
ambient-admin-mlflow-observability-secret is reliably removed on stop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 97ccfbb1-1059-45f5-a3bb-d457b86ea32e

📥 Commits

Reviewing files that changed from the base of the PR and between d361577 and b8aedc5.

📒 Files selected for processing (3)
  • components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example
  • components/operator/internal/handlers/sessions.go
  • components/runners/ambient-runner/ambient_runner/observability_privacy.py
✅ Files skipped from review due to trivial changes (1)
  • components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example

Ambient Code Bot added 5 commits April 9, 2026 14:04
- Bump mlflow-observability extra to mlflow[kubernetes]>=3.11
- Inject MLFLOW_TRACKING_AUTH and MLFLOW_WORKSPACE from admin secret
- When MLflow observability secret is used: session SA + automount token
- Grant mlflow.kubeflow.org/experiments get,list,update on session Role
- Log auth mode and warn if k8s token path missing
- Document in README, example secret, and PR notes

Made-with: Cursor
…rvability masking

- replaceOrAppendEnvVars skips overrides when base entry uses ValueFrom
- Redact all non-empty strings; recurse metadata; add role to metric allow-list
- Tests: replaceOrAppendEnvVars + updated privacy expectations
- Docstrings on observability_config helpers (CodeRabbit doc coverage)

Made-with: Cursor
…t opt-in

Default masking again matches pre–CodeRabbit Langfuse: >50 char strings,
metadata key passthrough. LANGFUSE_MASK_MESSAGES=strict enables full string
redaction and recursive metadata scrubbing. Document in README; tests cover both.

Made-with: Cursor
Keep a single backward-compatible mask (50-char heuristic, metadata passthrough).

Made-with: Cursor
Exclude the initiating session from Running/Creating/Pending counts so
deletePodAndPerPodService can remove shared Vertex/Langfuse/MLflow secrets
when the last active workload is the session being stopped (CodeRabbit).

Also run MLflow observability secret cleanup from TransitionToStopped alongside
Langfuse/Vertex.

Made-with: Cursor
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/operator/internal/handlers/sessions.go (1)

1590-1610: ⚠️ Potential issue | 🟠 Major

Block overrides of operator-managed auth flags too.

This helper now protects ValueFrom entries, but operator-managed plain envs are still replaceable here. In this file, USE_VERTEX and CLAUDE_CODE_USE_VERTEX are injected as normal Values, so a session can override them via spec.environmentVariables and force the runner into a credential mode that does not match the operator's secret setup.

Possible fix
 func replaceOrAppendEnvVars(base []corev1.EnvVar, extra []corev1.EnvVar) []corev1.EnvVar {
 	for _, ev := range extra {
+		switch ev.Name {
+		case "USE_VERTEX", "CLAUDE_CODE_USE_VERTEX":
+			continue
+		}
 		replaced := false
 		for i := range base {
 			if base[i].Name != ev.Name {
 				continue
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/operator/internal/handlers/sessions.go` around lines 1590 - 1610,
The current replaceOrAppendEnvVars allows overriding operator-managed plain
Value env vars; add a protection list for operator-controlled names (at minimum
"USE_VERTEX" and "CLAUDE_CODE_USE_VERTEX") and treat those names as
non-overridable the same way you treat ValueFrom entries. Update
replaceOrAppendEnvVars: before overwriting base[i].Value/ValueFrom, check if
base[i].Name is in the protected set and if so mark replaced = true and break
without changing it; this prevents extras from replacing those specific
operator-managed flags while keeping the existing ValueFrom protection logic
intact.
♻️ Duplicate comments (1)
components/runners/ambient-runner/ambient_runner/observability_privacy.py (1)

25-57: ⚠️ Potential issue | 🟠 Major

Don't pass metadata through unmasked.

Line 31 treats metadata as fully trusted and copies it verbatim. Any user-controlled strings nested there bypass masking and will be exported to Langfuse/MLflow in clear text even when masking is enabled.

Possible fix
             if key in (
                 "usage",
                 "usage_details",
-                "metadata",
                 "model",
                 "turn",
                 "input_tokens",
@@
                 "tool_name",
                 "is_error",
                 "level",
             ):
                 masked[key] = value
+            elif key == "metadata":
+                masked[key] = privacy_mask_message_data(value)
             elif key in ("content", "text", "input", "output", "prompt", "completion"):
                 if isinstance(value, str) and len(value) > 50:
                     masked[key] = "[REDACTED FOR PRIVACY]"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/runners/ambient-runner/ambient_runner/observability_privacy.py`
around lines 25 - 57, The code treats "metadata" as trusted and copies it
verbatim; remove "metadata" from the passthrough list in the conditional inside
privacy_mask_message_data (or the surrounding function that iterates
data.items()) so that when key == "metadata" you call
privacy_mask_message_data(value) (i.e., apply the same recursive masking as for
other untrusted fields) rather than copying it unchanged; locate the tuple
containing keys like "usage","model","turn", etc., remove "metadata" from that
tuple (or add an explicit branch that calls privacy_mask_message_data for
"metadata") to ensure nested user strings are masked before exporting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example`:
- Around line 4-10: Remove the duplicate MLFLOW_TRACKING_AUTH entries in both
the stringData mapping and the kubectl create secret CLI example: in the YAML
example (stringData) ensure MLFLOW_TRACKING_AUTH appears only once, and in the
CLI snippet replace the second --from-literal=MLFLOW_TRACKING_AUTH=... with
--from-literal=MLFLOW_WORKSPACE=<optional-workspace> so the example documents
the optional workspace setting; keep other keys like MLFLOW_TRACING_ENABLED,
MLFLOW_TRACKING_URI, MLFLOW_EXPERIMENT_NAME and OBSERVABILITY_BACKENDS
unchanged.

---

Outside diff comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 1590-1610: The current replaceOrAppendEnvVars allows overriding
operator-managed plain Value env vars; add a protection list for
operator-controlled names (at minimum "USE_VERTEX" and "CLAUDE_CODE_USE_VERTEX")
and treat those names as non-overridable the same way you treat ValueFrom
entries. Update replaceOrAppendEnvVars: before overwriting
base[i].Value/ValueFrom, check if base[i].Name is in the protected set and if so
mark replaced = true and break without changing it; this prevents extras from
replacing those specific operator-managed flags while keeping the existing
ValueFrom protection logic intact.

---

Duplicate comments:
In `@components/runners/ambient-runner/ambient_runner/observability_privacy.py`:
- Around line 25-57: The code treats "metadata" as trusted and copies it
verbatim; remove "metadata" from the passthrough list in the conditional inside
privacy_mask_message_data (or the surrounding function that iterates
data.items()) so that when key == "metadata" you call
privacy_mask_message_data(value) (i.e., apply the same recursive masking as for
other untrusted fields) rather than copying it unchanged; locate the tuple
containing keys like "usage","model","turn", etc., remove "metadata" from that
tuple (or add an explicit branch that calls privacy_mask_message_data for
"metadata") to ensure nested user strings are masked before exporting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: db3a1034-9755-4c60-a2f1-4e400eceaab7

📥 Commits

Reviewing files that changed from the base of the PR and between b8aedc5 and 6d55d13.

⛔ Files ignored due to path filters (1)
  • components/runners/ambient-runner/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example
  • components/operator/internal/handlers/reconciler.go
  • components/operator/internal/handlers/sessions.go
  • components/operator/internal/handlers/sessions_test.go
  • components/runners/ambient-runner/README.md
  • components/runners/ambient-runner/ambient_runner/mlflow_observability.py
  • components/runners/ambient-runner/ambient_runner/observability_config.py
  • components/runners/ambient-runner/ambient_runner/observability_privacy.py
  • components/runners/ambient-runner/pyproject.toml
  • components/runners/ambient-runner/tests/test_privacy_masking.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • components/runners/ambient-runner/README.md
  • components/runners/ambient-runner/pyproject.toml
  • components/runners/ambient-runner/ambient_runner/observability_config.py

…ed env vars

- Remove duplicate MLFLOW_TRACKING_AUTH entries from example secret
  (CLI and stringData), replacing second CLI occurrence with MLFLOW_WORKSPACE
- Add operatorProtectedEnvVars set in replaceOrAppendEnvVars to prevent
  user spec.environmentVariables from overriding USE_VERTEX and
  CLAUDE_CODE_USE_VERTEX (plain Value, not SecretKeyRef)
- Add test for operator-managed name protection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ambient-code
Copy link
Copy Markdown
Contributor

ambient-code bot commented Apr 9, 2026

Review comments addressed

Pushed 8d0e56c with fixes for two remaining review findings:

1. Remove duplicate MLFLOW_TRACKING_AUTH (CodeRabbit — Minor)

Removed duplicate MLFLOW_TRACKING_AUTH entries from both the kubectl CLI example and stringData in the example secret manifest. Replaced second CLI occurrence with MLFLOW_WORKSPACE to document the optional workspace setting.

2. Protect operator-managed plain Value env vars (CodeRabbit — outside-diff)

Added operatorProtectedEnvVars set (USE_VERTEX, CLAUDE_CODE_USE_VERTEX) to replaceOrAppendEnvVars so that user-supplied spec.environmentVariables cannot override these operator-managed flags (which use plain Value, not ValueFrom). Added test coverage.

🤖 Session

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/operator/internal/handlers/sessions.go (2)

1587-1623: ⚠️ Potential issue | 🔴 Critical

BACKEND_API_URL is still user-overridable.

This change protects secret-backed envs, but Line 1291 still lets spec.environmentVariables replace plain operator-managed values. In this file, BACKEND_API_URL is injected by the operator and used by the runner for backend credential/API calls, so a session can redirect those requests off-cluster while keeping the operator-issued bot token.

Suggested fix
 var operatorProtectedEnvVars = map[string]struct{}{
-	"USE_VERTEX":             {},
-	"CLAUDE_CODE_USE_VERTEX": {},
+	"USE_VERTEX":             {},
+	"CLAUDE_CODE_USE_VERTEX": {},
+	"BACKEND_API_URL":        {},
 }

As per coding guidelines, "Flag only errors, security risks, or functionality-breaking problems."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/operator/internal/handlers/sessions.go` around lines 1587 - 1623,
The BACKEND_API_URL plain env var is still user-overridable; update the
operatorProtectedEnvVars map so operator-injected plain values cannot be
replaced by spec.environmentVariables by adding "BACKEND_API_URL" (and any other
operator-managed non-ValueFrom names) to operatorProtectedEnvVars; since
replaceOrAppendEnvVars consults operatorProtectedEnvVars, this change will
prevent replaceOrAppendEnvVars from replacing base entries with that name (no
other code changes required).

2474-2488: ⚠️ Potential issue | 🟠 Major

Existing Roles will not reliably receive the new MLflow permission.

The new experiments rule only helps on fresh Role creation. In the AlreadyExists path (line 2486), the code updates a freshly constructed rbacv1.Role without the live object's resourceVersion; Kubernetes rejects that update, so pre-existing session Roles keep the old rules and MLflow Kubernetes auth fails after upgrade.

Fetch the existing Role first, copy updated rules into it, then update with the live object:

Fix
 	if _, err := config.K8sClient.RbacV1().Roles(sessionNamespace).Create(context.TODO(), role, v1.CreateOptions{}); err != nil {
 		if errors.IsAlreadyExists(err) {
-			// Update existing role to ensure latest permissions
-			if _, err := config.K8sClient.RbacV1().Roles(sessionNamespace).Update(context.TODO(), role, v1.UpdateOptions{}); err != nil {
+			existingRole, getErr := config.K8sClient.RbacV1().Roles(sessionNamespace).Get(context.TODO(), roleName, v1.GetOptions{})
+			if getErr != nil {
+				return fmt.Errorf("get Role for update: %w", getErr)
+			}
+			existingRole.Rules = role.Rules
+			existingRole.OwnerReferences = role.OwnerReferences
+			if _, err := config.K8sClient.RbacV1().Roles(sessionNamespace).Update(context.TODO(), existingRole, v1.UpdateOptions{}); err != nil {
 				return fmt.Errorf("update Role: %w", err)
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/operator/internal/handlers/sessions.go` around lines 2474 - 2488,
When handling an AlreadyExists error after creating the Role, don't call Update
with the freshly constructed role object; instead fetch the live Role via
config.K8sClient.RbacV1().Roles(sessionNamespace).Get(...), copy or replace its
Rules with the new role.Rules (and any other intented fields), then call Update
on that fetched object so the live resourceVersion is preserved; adjust the
AlreadyExists branch around Roles(...).Create to Get -> modify Rules -> Update
to ensure existing Roles receive the new MLflow "experiments" permission.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 887-900: The current logic sets runnerSATokenAutomount and
ServiceAccountName whenever ambientMlflowObsSecretCopied is true, which mounts a
namespace-scoped token even if MLFLOW_TRACKING_AUTH isn't
"kubernetes-namespaced"; change the condition so you only set
runnerSATokenAutomount = true and runnerPodSAName =
fmt.Sprintf("ambient-session-%s", name) when ambientMlflowObsSecretCopied AND
the session's MLFLOW_TRACKING_AUTH equals "kubernetes-namespaced" (or equivalent
indicator used by your config), leaving runnerPodSAName empty and
AutomountServiceAccountToken false otherwise; update the podSpec construction
(the variables used there: runnerSATokenAutomount, runnerPodSAName) accordingly
so only sessions that require the kube token get the per-session ServiceAccount
and automount.

---

Outside diff comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 1587-1623: The BACKEND_API_URL plain env var is still
user-overridable; update the operatorProtectedEnvVars map so operator-injected
plain values cannot be replaced by spec.environmentVariables by adding
"BACKEND_API_URL" (and any other operator-managed non-ValueFrom names) to
operatorProtectedEnvVars; since replaceOrAppendEnvVars consults
operatorProtectedEnvVars, this change will prevent replaceOrAppendEnvVars from
replacing base entries with that name (no other code changes required).
- Around line 2474-2488: When handling an AlreadyExists error after creating the
Role, don't call Update with the freshly constructed role object; instead fetch
the live Role via config.K8sClient.RbacV1().Roles(sessionNamespace).Get(...),
copy or replace its Rules with the new role.Rules (and any other intented
fields), then call Update on that fetched object so the live resourceVersion is
preserved; adjust the AlreadyExists branch around Roles(...).Create to Get ->
modify Rules -> Update to ensure existing Roles receive the new MLflow
"experiments" permission.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b915d16c-4571-4785-a423-77f8f9f954f9

📥 Commits

Reviewing files that changed from the base of the PR and between 6d55d13 and 8d0e56c.

📒 Files selected for processing (3)
  • components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example
  • components/operator/internal/handlers/sessions.go
  • components/operator/internal/handlers/sessions_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example
  • components/operator/internal/handlers/sessions_test.go

… only

- Only enable ServiceAccount token automount and per-session SA when
  MLFLOW_TRACKING_AUTH is "kubernetes-namespaced" (read from the copied
  secret), not for all MLflow-secret sessions. This avoids granting an
  unnecessary namespace-scoped token to sessions using other auth methods.
- Remove duplicate MLFLOW_TRACKING_AUTH env var entry.

Addresses CodeRabbit review on PR ambient-code#1263.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ambient-code
Copy link
Copy Markdown
Contributor

ambient-code bot commented Apr 9, 2026

Pushed commit ad3c68c addressing the remaining CodeRabbit review comment:

  1. SA token automount scoped to kubernetes-namespaced auth only — The operator now reads MLFLOW_TRACKING_AUTH from the copied secret and only enables AutomountServiceAccountToken + per-session ServiceAccount when the value is kubernetes-namespaced. Sessions using other auth methods no longer get an unnecessary namespace-scoped token.

  2. Removed duplicate MLFLOW_TRACKING_AUTH env var — The env var was listed twice in the base slice; removed the duplicate entry.

🤖 Session

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 2468-2474: The current code adds MLflow Experiment verbs to the
default runner token binding (the APIGroups/Resources block that grants
"mlflow.kubeflow.org" "experiments" get/list/update) which is created by
regenerateRunnerToken() and mounted into every runner at
/var/run/secrets/ambient; change this so the MLflow CRD permission is NOT
included in the default token: either remove that APIGroups/Resources entry from
the regenerateRunnerToken() token creation path, or conditionally add it only
when creating the specific pod identity/service-account used for
Kubernetes-backed MLflow auth (i.e., detect MLflow-backed sessions and attach
the rule to that service account instead of the default ambient token), ensuring
only the pod identity used for MLflow receives the experiments get/list/update
rights.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 851e0bda-dcd4-4532-b9e1-77ed38b5555e

📥 Commits

Reviewing files that changed from the base of the PR and between 8d0e56c and ad3c68c.

📒 Files selected for processing (1)
  • components/operator/internal/handlers/sessions.go

from mlflow.entities import SpanType

with mlflow.start_span(
name="Claude Code - Session Metrics",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the agent harness used as runner always Claude Code?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we also have gemini, but its dynamic. we manage a layer above all frameworks that bridge them together

exp_name = os.getenv("MLFLOW_EXPERIMENT_NAME", "ambient-code-sessions").strip()
if not exp_name:
exp_name = "ambient-code-sessions"
mlflow.set_experiment(exp_name)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean there is one global experiment for all runners / sessions or it's properly injected / set by ambient for each runner / agent?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I need to better understand the relationship in ambient between runners and sessions. I agree with you, it would probably be best to support one experiment per session.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ambient has a structure of: workspace (namespace) -> session (CR) -> pod (reused claude sdk session) -> turns/runs

i would make one experiment per namespace

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that make sense to have experiment stable mapping to agent id for example, and then sessions for each "run / instance" of this agent?

Basically something like namespace - tenant > agent - experiment > sessions.

@Gkrumbach07
Copy link
Copy Markdown
Contributor

Gkrumbach07 commented Apr 10, 2026

@ambient-code can you review on if this PR supports mlflow observability across all frameworks instead of just claude code sdk. we also support gemini cli and in the future others. make sure this is not JUST claude code sdk traceing. add inline comments to this PR for issues u find

also fix the go format issue at internal/handlers/sessions.go by doing gofmt -w .

- Accept both "kubernetes" and "kubernetes-namespaced" values for
  MLFLOW_TRACKING_AUTH when deciding whether to enable SA token
  automount, matching the runner's behavior.
- Fix gofmt alignment in operatorProtectedEnvVars map.

Addresses CodeRabbit review and CI lint failure on PR ambient-code#1263.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ambient-code
Copy link
Copy Markdown
Contributor

ambient-code bot commented Apr 10, 2026

Pushed commit 85a197e addressing CI failure and CodeRabbit review:

  1. Fix gofmt formatting — Alignment in operatorProtectedEnvVars map was off, causing Go Lint - Operator CI failure.

  2. Handle both kubernetes and kubernetes-namespaced auth types — The SA token automount condition now accepts both values via a switch statement, matching the runner's behavior. Addresses CodeRabbit comment.

Note: The "SDD boundary check" failure is a GitHub permissions issue (Resource not accessible by integration), not related to this PR's code.

🤖 Session

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/operator/internal/handlers/sessions.go (1)

1433-1452: ⚠️ Potential issue | 🟠 Major

Scope MLflow Experiment RBAC to only MLflow K8s-auth sessions

Line 2468 adds mlflow.kubeflow.org/experiments get/list/update to the default per-session role, and Line 1451 mounts that SA token into every runner pod. This grants MLflow mutation capability to non-MLflow sessions too. Gate this rule to sessions that actually require MLFLOW_TRACKING_AUTH=kubernetes-namespaced (or split into a dedicated SA/Role path).

As per coding guidelines, "Flag only errors, security risks, or functionality-breaking problems."

Also applies to: 2468-2474

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/operator/internal/handlers/sessions.go` around lines 1433 - 1452,
The current code mounts the "runner-token" Secret (runnerTokenSecretName) into
every "ambient-code-runner" pod and the session role grants
mlflow.kubeflow.org/experiments get/list/update broadly; restrict this so only
sessions that actually use MLflow k8s auth get the token and RBAC: either (1)
add a conditional around the volume/VolumeMount code to only append the
"runner-token" volume and mount when the session's environment has
MLFLOW_TRACKING_AUTH == "kubernetes-namespaced", or (2) split the flow to create
a dedicated ServiceAccount/Role/RoleBinding that includes
mlflow.kubeflow.org/experiments permissions and only assign that SA to sessions
with MLFLOW_TRACKING_AUTH set; ensure you update where the Role is created (the
role that currently includes mlflow.kubeflow.org/experiments) to be scoped to
that dedicated SA path if you choose option (2).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 1581-1587: The file sessions.go is not gofmt-formatted causing CI
to fail; run gofmt to fix formatting (e.g., run `gofmt -w .` from the
components/operator directory or `gofmt -w
components/operator/internal/handlers/sessions.go`) so the declaration of
operatorProtectedEnvVars and surrounding code are properly formatted and the
lint pipeline passes.

---

Duplicate comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 1433-1452: The current code mounts the "runner-token" Secret
(runnerTokenSecretName) into every "ambient-code-runner" pod and the session
role grants mlflow.kubeflow.org/experiments get/list/update broadly; restrict
this so only sessions that actually use MLflow k8s auth get the token and RBAC:
either (1) add a conditional around the volume/VolumeMount code to only append
the "runner-token" volume and mount when the session's environment has
MLFLOW_TRACKING_AUTH == "kubernetes-namespaced", or (2) split the flow to create
a dedicated ServiceAccount/Role/RoleBinding that includes
mlflow.kubeflow.org/experiments permissions and only assign that SA to sessions
with MLFLOW_TRACKING_AUTH set; ensure you update where the Role is created (the
role that currently includes mlflow.kubeflow.org/experiments) to be scoped to
that dedicated SA path if you choose option (2).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7ac04190-e259-45ab-a6e4-f96c966b1353

📥 Commits

Reviewing files that changed from the base of the PR and between 8d0e56c and ad3c68c.

📒 Files selected for processing (1)
  • components/operator/internal/handlers/sessions.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/operator/internal/handlers/sessions.go (1)

2471-2485: ⚠️ Potential issue | 🟠 Major

Role permissions won't be updated on namespace redeployment.

The code calls Roles(...).Update(..., role, ...) with a freshly constructed object that lacks metadata.resourceVersion. Kubernetes requires the resourceVersion for optimistic concurrency control, so this Update fails. The Secret update path (lines 2540–2558) shows the correct pattern: fetch the existing object first via Get(), then update from that copy.

Result: Upgraded namespaces keep the old Role without MLflow experiments verbs, causing runtime auth failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/operator/internal/handlers/sessions.go` around lines 2471 - 2485,
The Role update fails because the freshly constructed role lacks
metadata.resourceVersion; change the update path that calls
config.K8sClient.RbacV1().Roles(sessionNamespace).Update(..., role, ...) to
first Get the existing Role (via
config.K8sClient.RbacV1().Roles(sessionNamespace).Get(...)), copy/merge the
desired rules into that returned object (or set
existingRole.ObjectMeta.ResourceVersion onto your role variable) and then call
Update with the retrieved/updated object so Kubernetes optimistic concurrency
succeeds; ensure you reference the existing Role before updating to preserve
resourceVersion and other server-set fields.
♻️ Duplicate comments (1)
components/operator/internal/handlers/sessions.go (1)

2471-2477: ⚠️ Potential issue | 🟠 Major

These MLflow Experiment verbs are still granted to every runner token.

regenerateRunnerToken() is used for all sessions, and that session token is still mounted into every runner at /var/run/secrets/ambient later in this file. Adding mlflow.kubeflow.org/experiments here broadens non-MLflow sessions to namespace-scoped MLflow mutation rights as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/operator/internal/handlers/sessions.go` around lines 2471 - 2477,
The MLflow Experiment RBAC entry in the pod's token grants namespace-scoped
MLflow mutation rights to every session because regenerateRunnerToken() creates
a single token that is mounted into all runners at /var/run/secrets/ambient;
restrict this by making the MLflow APIGroups/Resources/Verbs entry conditional
or scoped only for sessions that actually need MLflow (e.g., add a flag or check
in regenerateRunnerToken() or the session creation path to only include the
mlflow.kubeflow.org/experiments rule for MLflow-enabled sessions), or generate a
separate token function for MLflow-enabled sessions and ensure only those tokens
are mounted into runners that require MLflow access rather than the shared
ambient token.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 315-317: The MLflow secret cleanup call
deleteAmbientMlflowObservabilitySecret(sessionNamespace, name) is placed inside
the "Stopped" block but is never reached because handleAgenticSessionEvent
returns early for phase == "Stopped"; move or duplicate the cleanup so it runs
on all stop paths — either invoke deleteAmbientMlflowObservabilitySecret
immediately before the early return in handleAgenticSessionEvent when phase ==
"Stopped", or add the same call into the Creating->stop-request path (the code
that currently calls deletePodAndPerPodService) so sessions that jump straight
to Stopped also remove the ambient-admin-mlflow-observability-secret; ensure you
use the same deleteCtx, sessionNamespace and name variables and log failures as
done elsewhere.

---

Outside diff comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 2471-2485: The Role update fails because the freshly constructed
role lacks metadata.resourceVersion; change the update path that calls
config.K8sClient.RbacV1().Roles(sessionNamespace).Update(..., role, ...) to
first Get the existing Role (via
config.K8sClient.RbacV1().Roles(sessionNamespace).Get(...)), copy/merge the
desired rules into that returned object (or set
existingRole.ObjectMeta.ResourceVersion onto your role variable) and then call
Update with the retrieved/updated object so Kubernetes optimistic concurrency
succeeds; ensure you reference the existing Role before updating to preserve
resourceVersion and other server-set fields.

---

Duplicate comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 2471-2477: The MLflow Experiment RBAC entry in the pod's token
grants namespace-scoped MLflow mutation rights to every session because
regenerateRunnerToken() creates a single token that is mounted into all runners
at /var/run/secrets/ambient; restrict this by making the MLflow
APIGroups/Resources/Verbs entry conditional or scoped only for sessions that
actually need MLflow (e.g., add a flag or check in regenerateRunnerToken() or
the session creation path to only include the mlflow.kubeflow.org/experiments
rule for MLflow-enabled sessions), or generate a separate token function for
MLflow-enabled sessions and ensure only those tokens are mounted into runners
that require MLflow access rather than the shared ambient token.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4ad54f89-3c75-482d-9d10-1dbdd5f725b8

📥 Commits

Reviewing files that changed from the base of the PR and between ad3c68c and 85a197e.

📒 Files selected for processing (1)
  • components/operator/internal/handlers/sessions.go

Copy link
Copy Markdown
Contributor

@Gkrumbach07 Gkrumbach07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of framework-agnostic compatibility (per @Gkrumbach07 request).

The MLflow tracing implementation has several Claude Code-specific hardcoded references that should be generalized for multi-framework support (Gemini CLI, future frameworks). See inline comments for specifics.

🤖 Session


gen = mlflow.start_span(
name="claude_interaction",
span_type=SpanType.CHAIN,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code-specific: Span name "claude_interaction" is hardcoded. For multi-framework support (Gemini, etc.), this should be framework-agnostic — e.g. "llm_interaction" or parameterized from the bridge/framework identity.

from mlflow.entities import SpanType

with mlflow.start_span(
name="Claude Code - Session Metrics",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code-specific: Span name "Claude Code - Session Metrics" and attribute "ambient.source": "claude-code-metrics" are hardcoded to Claude Code. These should be framework-agnostic (e.g. "Session Metrics" / "ambient-runner-metrics") to work correctly with Gemini and future frameworks.

}

# Build tags list
tags = ["claude-code", f"namespace:{namespace}"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code-specific: Langfuse tag "claude-code" and span name "Claude Code - Session Metrics" (line 885) are hardcoded. These should be parameterized or use a generic name so traces from Gemini sessions are not incorrectly labeled as Claude Code.

exp_name = os.getenv("MLFLOW_EXPERIMENT_NAME", "ambient-code-sessions").strip()
if not exp_name:
exp_name = "ambient-code-sessions"
mlflow.set_experiment(exp_name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re @astefanutti question: This uses a single global experiment name (default ambient-code-sessions) shared across all sessions/runners. The session_id distinguishes individual sessions within traces, but all sessions land in the same experiment. A per-session experiment would be possible but may cause experiment sprawl — the current approach seems reasonable as long as session_id tagging is sufficient for filtering.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gkrumbach07 ideally from the angle of AI first initiative for example, there would be a stable experiment to agent mapping.

The Stopped-phase cleanup block is unreachable due to the early return
at line 248 for terminal phases. The Creating→Stopped transition path
(pod not found + stop requested) skips deletePodAndPerPodService, so
copied Langfuse and MLflow secrets could leak in the session namespace.

Add explicit secret cleanup before returning on that path.

Addresses CodeRabbit review on PR ambient-code#1263.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ambient-code ambient-code bot added ambient-code:needs-human AI circuit breaker fired, needs human attention and removed ambient-code:managed PR managed by AI automation labels Apr 10, 2026
@ambient-code
Copy link
Copy Markdown
Contributor

ambient-code bot commented Apr 10, 2026

Circuit breaker: retry_count=3 — handing off to a human

This PR has required 3 rounds of fixes across multiple sessions. The code fixes are functional (CI is green), but ongoing CodeRabbit and human review comments keep surfacing new issues:

Fixes applied in this session:

  1. SA token automount scoped to kubernetes-namespaced auth only (commit ad3c68c)
  2. Handle both kubernetes and kubernetes-namespaced auth types + gofmt fix (commit 85a197e)
  3. Cleanup observability secrets on Creating→Stopped path (commit 16e534a)

Outstanding items requiring human judgment:

  • Framework-agnostic naming — hardcoded "Claude Code" span names and tags in mlflow_observability.py and observability.py (review comments from @Gkrumbach07). These are naming/design decisions that need team alignment.
  • Experiment-per-namespace vs global@astefanutti and @Gkrumbach07 discussing MLflow experiment structure (namespace→agent→session mapping). Design decision needed.
  • Unreachable Stopped-phase block (lines 262-320) — the entire block is dead code due to early return at line 248. The Langfuse cleanup there was also unreachable before this PR. Consider removing or restructuring.
  • SDD boundary check CI failure is a GitHub Actions permissions issue (Resource not accessible by integration), unrelated to this PR.

Adding ambient-code:needs-human label and stopping automated work.

🤖 Session

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
components/operator/internal/handlers/sessions.go (2)

2482-2488: ⚠️ Potential issue | 🟠 Major

Keep MLflow Experiment permissions off the default runner token.

regenerateRunnerToken() is still used for every session, and that token is always mounted into the runner at /var/run/secrets/ambient (Lines 1450-1469). Adding mlflow.kubeflow.org/experiments get/list/update here gives non-MLflow sessions namespace-scoped MLflow mutation rights too. Bind this rule only to the pod identity used for ambientMlflowObsSecretCopied && mlflowK8sAuth.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/operator/internal/handlers/sessions.go` around lines 2482 - 2488,
The added MLflow RBAC rule in the APIGroups/Resources/Verbs block grants
mlflow.kubeflow.org/experiments get/list/update to the default runner token
which is regenerated by regenerateRunnerToken() and mounted into every session;
restrict this rule so it is only applied when the pod identity used by ambient
MLflow is active (i.e., when ambientMlflowObsSecretCopied && mlflowK8sAuth are
true). Modify the logic that builds the RBAC binding (the code path that
includes the shown APIGroups block) to conditionally attach this rule only for
the pod identity used by ambient MLflow (check the same flags or selector used
for ambientMlflowObsSecretCopied/mlflowK8sAuth), or move the rule into the
specific Role/RoleBinding created for that pod identity instead of the default
runner token Role.

1154-1219: ⚠️ Potential issue | 🔴 Critical

Block user-supplied MLflow envs when the operator did not inject them.

replaceOrAppendEnvVars only protects names already present in base. If ambientMlflowObsSecretCopied is false, a session can still set OBSERVABILITY_BACKENDS=mlflow, MLFLOW_TRACING_ENABLED=true, and MLFLOW_TRACKING_URI=... in spec.environmentVariables, and the runner will enable MLflow from those values. That bypasses the operator opt-in and can route prompts/tool output to an arbitrary tracking endpoint. Cross-file evidence: ambient_runner/observability_config.py:13-38 and ambient_runner/mlflow_observability.py:34-96.

Also applies to: 1297-1299

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/operator/internal/handlers/sessions.go` around lines 1154 - 1219,
The code currently allows sessions to set MLflow-related env vars in
spec.environmentVariables when ambientMlflowObsSecretCopied is false, bypassing
operator opt-in; update the env merging logic (where base is combined with
spec.environmentVariables and replaceOrAppendEnvVars is used) to drop or
override any user-provided env vars with these names: OBSERVABILITY_BACKENDS,
MLFLOW_TRACING_ENABLED, MLFLOW_TRACKING_URI, MLFLOW_TRACKING_AUTH,
MLFLOW_EXPERIMENT_NAME, MLFLOW_WORKSPACE whenever ambientMlflowObsSecretCopied
is false; implement this by filtering spec.environmentVariables before calling
replaceOrAppendEnvVars (or by adding a guard in replaceOrAppendEnvVars) so
user-supplied keys are removed/ignored unless ambientMlflowObsSecretCopied is
true, referencing the ambientMlflowObsSecretCopied variable and the
replaceOrAppendEnvVars merge point in sessions.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 426-434: The Creating→Stopped cleanup currently deletes Langfuse
and MLflow copied secrets but misses the Vertex secret; add a call to delete the
Vertex observability secret in the same cleanup block—invoke the existing helper
(e.g., deleteAmbientVertexObservabilitySecret(cleanupCtx, sessionNamespace,
name)) alongside deleteAmbientLangfuseSecret and
deleteAmbientMlflowObservabilitySecret and log a warning on error consistent
with the other two calls so the ambient-vertex copy is removed on early stop.

---

Duplicate comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 2482-2488: The added MLflow RBAC rule in the
APIGroups/Resources/Verbs block grants mlflow.kubeflow.org/experiments
get/list/update to the default runner token which is regenerated by
regenerateRunnerToken() and mounted into every session; restrict this rule so it
is only applied when the pod identity used by ambient MLflow is active (i.e.,
when ambientMlflowObsSecretCopied && mlflowK8sAuth are true). Modify the logic
that builds the RBAC binding (the code path that includes the shown APIGroups
block) to conditionally attach this rule only for the pod identity used by
ambient MLflow (check the same flags or selector used for
ambientMlflowObsSecretCopied/mlflowK8sAuth), or move the rule into the specific
Role/RoleBinding created for that pod identity instead of the default runner
token Role.
- Around line 1154-1219: The code currently allows sessions to set
MLflow-related env vars in spec.environmentVariables when
ambientMlflowObsSecretCopied is false, bypassing operator opt-in; update the env
merging logic (where base is combined with spec.environmentVariables and
replaceOrAppendEnvVars is used) to drop or override any user-provided env vars
with these names: OBSERVABILITY_BACKENDS, MLFLOW_TRACING_ENABLED,
MLFLOW_TRACKING_URI, MLFLOW_TRACKING_AUTH, MLFLOW_EXPERIMENT_NAME,
MLFLOW_WORKSPACE whenever ambientMlflowObsSecretCopied is false; implement this
by filtering spec.environmentVariables before calling replaceOrAppendEnvVars (or
by adding a guard in replaceOrAppendEnvVars) so user-supplied keys are
removed/ignored unless ambientMlflowObsSecretCopied is true, referencing the
ambientMlflowObsSecretCopied variable and the replaceOrAppendEnvVars merge point
in sessions.go.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3b8f8209-046b-4cbe-8928-909897265a64

📥 Commits

Reviewing files that changed from the base of the PR and between 85a197e and 16e534a.

📒 Files selected for processing (1)
  • components/operator/internal/handlers/sessions.go

Comment on lines +426 to +434
// Cleanup copied observability secrets (may have been copied in a prior reconciliation)
cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cleanupCancel()
if err := deleteAmbientLangfuseSecret(cleanupCtx, sessionNamespace, name); err != nil {
log.Printf("Warning: Failed to cleanup langfuse secret from %s during Creating→Stopped: %v", sessionNamespace, err)
}
if err := deleteAmbientMlflowObservabilitySecret(cleanupCtx, sessionNamespace, name); err != nil {
log.Printf("Warning: Failed to cleanup mlflow secret from %s during Creating→Stopped: %v", sessionNamespace, err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Creating→Stopped cleanup still misses ambient-vertex.

This stop path now deletes the Langfuse and MLflow copies, but if Vertex was enabled earlier in reconciliation, ambient-vertex can still remain in the session namespace when pod creation never completed. That leaves a copied secret behind on the same early-stop path this change is fixing.

Proposed fix
 				if err := deleteAmbientLangfuseSecret(cleanupCtx, sessionNamespace, name); err != nil {
 					log.Printf("Warning: Failed to cleanup langfuse secret from %s during Creating→Stopped: %v", sessionNamespace, err)
 				}
+				if err := deleteAmbientVertexSecret(cleanupCtx, sessionNamespace, name); err != nil {
+					log.Printf("Warning: Failed to cleanup %s secret from %s during Creating→Stopped: %v", types.AmbientVertexSecretName, sessionNamespace, err)
+				}
 				if err := deleteAmbientMlflowObservabilitySecret(cleanupCtx, sessionNamespace, name); err != nil {
 					log.Printf("Warning: Failed to cleanup mlflow secret from %s during Creating→Stopped: %v", sessionNamespace, err)
 				}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/operator/internal/handlers/sessions.go` around lines 426 - 434,
The Creating→Stopped cleanup currently deletes Langfuse and MLflow copied
secrets but misses the Vertex secret; add a call to delete the Vertex
observability secret in the same cleanup block—invoke the existing helper (e.g.,
deleteAmbientVertexObservabilitySecret(cleanupCtx, sessionNamespace, name))
alongside deleteAmbientLangfuseSecret and deleteAmbientMlflowObservabilitySecret
and log a warning on error consistent with the other two calls so the
ambient-vertex copy is removed on early stop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ambient-code:needs-human AI circuit breaker fired, needs human attention

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants