feat(runner): optional MLflow tracing parallel to Langfuse#1263
feat(runner): optional MLflow tracing parallel to Langfuse#1263etirelli wants to merge 12 commits intoambient-code:mainfrom
Conversation
Add OBSERVABILITY_BACKENDS (default langfuse-only), MLflowSessionTracer with manual spans mirroring turn/tool boundaries, and mlflow_tracing_active helper. Operator copies ambient-admin-mlflow-observability-secret into session namespaces and injects MLFLOW_* / OBSERVABILITY_BACKENDS env. Document OTLP variables in README; add example secret manifest and mlflow-observability optional extra (mlflow>=3.9, opentelemetry-exporter-otlp). Langfuse behaviour and feedback trace IDs are unchanged when MLflow is off. Signed-off-by: Edson Tirelli <etirelli@redhat.com> Made-with: Cursor
end_turn() always calls _extract_assistant_text(); guard ImportError and fall\nback to blocks with a .text attribute so unit tests pass without the claude extra. Signed-off-by: Ambient Code Bot <bot@ambient-code.local> Made-with: Cursor
📝 WalkthroughWalkthroughAdds optional MLflow observability: operator copies an admin MLflow secret into session namespaces and injects secret-backed env vars; runner gains backend selection, privacy masking, MLflowSessionTracer, docs, tests, packaging extra; RBAC and cleanup adjusted for MLflow lifecycle. Changes
Sequence Diagram(s)sequenceDiagram
participant Operator as Operator (sessions.go)
participant K8s as Kubernetes API
participant Runner as Runner Pod (ambient-runner)
participant MLflow as MLflow Server
Operator->>K8s: Create session namespace
Operator->>K8s: Get ambient-admin-mlflow-observability-secret
alt secret present and MLFLOW_TRACING_ENABLED
Operator->>K8s: Copy secret into session namespace
K8s-->>Operator: Secret copied
Operator->>K8s: Create runner pod with secret-backed env vars + per-session SA
K8s->>Runner: Start pod (env injected)
Runner->>Runner: ObservabilityManager.initialize()
alt OBSERVABILITY_BACKENDS includes mlflow
Runner->>MLflow: Configure tracking URI & experiment
Runner->>MLflow: Emit turn/tool spans via MLflowSessionTracer
end
else secret missing or disabled
Operator-->>Runner: Runner starts without MLflow envs
end
Operator->>K8s: Detect session termination
Operator->>K8s: Delete copied MLflow secret (if no other active sessions)
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 1128-1173: The current SecretKeyRef-backed MLflow env vars
(MLFLOW_TRACING_ENABLED, MLFLOW_TRACKING_URI, MLFLOW_EXPERIMENT_NAME,
OBSERVABILITY_BACKENDS) are later overwritten by replaceOrAppendEnvVars(base,
userEnvVars); remove the ability for userEnvVars to override these by either (A)
filtering out these keys from userEnvVars before calling replaceOrAppendEnvVars,
or (B) changing replaceOrAppendEnvVars to skip replacing envs that already have
a ValueFrom (i.e., when existing.EnvVar.ValueFrom != nil). Update the code that
prepares base/userEnvVars (reference variables base and userEnvVars and function
replaceOrAppendEnvVars) to enforce this precedence so the SecretKeyRef-backed
entries cannot be overwritten by user-supplied spec.environmentVariables.
In `@components/runners/ambient-runner/ambient_runner/observability_privacy.py`:
- Around line 21-24: The masking logic currently only redacts strings longer
than 50 chars (checks on variable `data`) and also lets `metadata` pass through;
change the sanitizer so that when masking is enabled any string value (including
short strings) is redacted rather than only long ones, and ensure `metadata` is
recursively sanitized/removed before export. Locate the string handling
conditional that uses `isinstance(data, str)` and the `metadata` passthrough and
replace the length-check branch with a policy that returns a redaction token
(e.g., "[REDACTED FOR PRIVACY]") for all strings when masking is on, and ensure
nested structures (dicts/lists/metadata objects) are traversed and their string
values are sanitized the same way.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: dde5612b-3a15-4efd-9a4a-e906612fee08
⛔ Files ignored due to path filters (1)
components/runners/ambient-runner/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (13)
components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.examplecomponents/manifests/base/core/operator-deployment.yamlcomponents/operator/internal/handlers/sessions.gocomponents/runners/ambient-runner/README.mdcomponents/runners/ambient-runner/ambient_runner/AGENTS.mdcomponents/runners/ambient-runner/ambient_runner/bridges/claude/bridge.pycomponents/runners/ambient-runner/ambient_runner/mlflow_observability.pycomponents/runners/ambient-runner/ambient_runner/observability.pycomponents/runners/ambient-runner/ambient_runner/observability_config.pycomponents/runners/ambient-runner/ambient_runner/observability_privacy.pycomponents/runners/ambient-runner/pyproject.tomlcomponents/runners/ambient-runner/tests/test_observability_config.pycomponents/runners/ambient-runner/tests/test_observability_mlflow_integration.py
components/runners/ambient-runner/ambient_runner/observability_privacy.py
Show resolved
Hide resolved
components/runners/ambient-runner/ambient_runner/mlflow_observability.py
Show resolved
Hide resolved
- Protect SecretKeyRef-backed env vars from user override: replaceOrAppendEnvVars now skips replacing entries that have ValueFrom set, preventing user-supplied spec.environmentVariables from hijacking platform-managed secrets (Langfuse, MLflow, etc.) - Fix privacy masking to redact all strings regardless of length and recursively sanitize metadata, closing a leak of short sensitive values - Add MLFLOW_TRACKING_AUTH env var injection for MLflow 3.11+ Kubernetes-namespaced auth (per reviewer feedback) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Review comments addressedPushed 1. Protect SecretKeyRef-backed env vars from user override (CodeRabbit — Critical)
2. Fix privacy masking to redact all strings (CodeRabbit — Major)
3. Add MLFLOW_TRACKING_AUTH for K8s-namespaced auth (@mprahl)Added
🤖 Session |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 2025-2027: The MLflow secret cleanup is skipped because
deleteAmbientMlflowObservabilitySecret() returns early when the session is still
marked Running/Creating; change the helper to accept a force (bool) parameter
and, when force==true, skip the “active session” short-circuit so the secret is
deleted regardless of session state. Update calls from
deletePodAndPerPodService() to pass force=true (and leave other callers
unchanged or pass false), and ensure handleAgenticSessionEvent()’s normal stop
path still calls the helper with force as appropriate so the
ambient-admin-mlflow-observability-secret is reliably removed on stop.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 97ccfbb1-1059-45f5-a3bb-d457b86ea32e
📒 Files selected for processing (3)
components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.examplecomponents/operator/internal/handlers/sessions.gocomponents/runners/ambient-runner/ambient_runner/observability_privacy.py
✅ Files skipped from review due to trivial changes (1)
- components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example
- Bump mlflow-observability extra to mlflow[kubernetes]>=3.11 - Inject MLFLOW_TRACKING_AUTH and MLFLOW_WORKSPACE from admin secret - When MLflow observability secret is used: session SA + automount token - Grant mlflow.kubeflow.org/experiments get,list,update on session Role - Log auth mode and warn if k8s token path missing - Document in README, example secret, and PR notes Made-with: Cursor
…rvability masking - replaceOrAppendEnvVars skips overrides when base entry uses ValueFrom - Redact all non-empty strings; recurse metadata; add role to metric allow-list - Tests: replaceOrAppendEnvVars + updated privacy expectations - Docstrings on observability_config helpers (CodeRabbit doc coverage) Made-with: Cursor
…t opt-in Default masking again matches pre–CodeRabbit Langfuse: >50 char strings, metadata key passthrough. LANGFUSE_MASK_MESSAGES=strict enables full string redaction and recursive metadata scrubbing. Document in README; tests cover both. Made-with: Cursor
Keep a single backward-compatible mask (50-char heuristic, metadata passthrough). Made-with: Cursor
Exclude the initiating session from Running/Creating/Pending counts so deletePodAndPerPodService can remove shared Vertex/Langfuse/MLflow secrets when the last active workload is the session being stopped (CodeRabbit). Also run MLflow observability secret cleanup from TransitionToStopped alongside Langfuse/Vertex. Made-with: Cursor
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/operator/internal/handlers/sessions.go (1)
1590-1610:⚠️ Potential issue | 🟠 MajorBlock overrides of operator-managed auth flags too.
This helper now protects
ValueFromentries, but operator-managed plain envs are still replaceable here. In this file,USE_VERTEXandCLAUDE_CODE_USE_VERTEXare injected as normalValues, so a session can override them viaspec.environmentVariablesand force the runner into a credential mode that does not match the operator's secret setup.Possible fix
func replaceOrAppendEnvVars(base []corev1.EnvVar, extra []corev1.EnvVar) []corev1.EnvVar { for _, ev := range extra { + switch ev.Name { + case "USE_VERTEX", "CLAUDE_CODE_USE_VERTEX": + continue + } replaced := false for i := range base { if base[i].Name != ev.Name { continue }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/operator/internal/handlers/sessions.go` around lines 1590 - 1610, The current replaceOrAppendEnvVars allows overriding operator-managed plain Value env vars; add a protection list for operator-controlled names (at minimum "USE_VERTEX" and "CLAUDE_CODE_USE_VERTEX") and treat those names as non-overridable the same way you treat ValueFrom entries. Update replaceOrAppendEnvVars: before overwriting base[i].Value/ValueFrom, check if base[i].Name is in the protected set and if so mark replaced = true and break without changing it; this prevents extras from replacing those specific operator-managed flags while keeping the existing ValueFrom protection logic intact.
♻️ Duplicate comments (1)
components/runners/ambient-runner/ambient_runner/observability_privacy.py (1)
25-57:⚠️ Potential issue | 🟠 MajorDon't pass
metadatathrough unmasked.Line 31 treats
metadataas fully trusted and copies it verbatim. Any user-controlled strings nested there bypass masking and will be exported to Langfuse/MLflow in clear text even when masking is enabled.Possible fix
if key in ( "usage", "usage_details", - "metadata", "model", "turn", "input_tokens", @@ "tool_name", "is_error", "level", ): masked[key] = value + elif key == "metadata": + masked[key] = privacy_mask_message_data(value) elif key in ("content", "text", "input", "output", "prompt", "completion"): if isinstance(value, str) and len(value) > 50: masked[key] = "[REDACTED FOR PRIVACY]"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/runners/ambient-runner/ambient_runner/observability_privacy.py` around lines 25 - 57, The code treats "metadata" as trusted and copies it verbatim; remove "metadata" from the passthrough list in the conditional inside privacy_mask_message_data (or the surrounding function that iterates data.items()) so that when key == "metadata" you call privacy_mask_message_data(value) (i.e., apply the same recursive masking as for other untrusted fields) rather than copying it unchanged; locate the tuple containing keys like "usage","model","turn", etc., remove "metadata" from that tuple (or add an explicit branch that calls privacy_mask_message_data for "metadata") to ensure nested user strings are masked before exporting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example`:
- Around line 4-10: Remove the duplicate MLFLOW_TRACKING_AUTH entries in both
the stringData mapping and the kubectl create secret CLI example: in the YAML
example (stringData) ensure MLFLOW_TRACKING_AUTH appears only once, and in the
CLI snippet replace the second --from-literal=MLFLOW_TRACKING_AUTH=... with
--from-literal=MLFLOW_WORKSPACE=<optional-workspace> so the example documents
the optional workspace setting; keep other keys like MLFLOW_TRACING_ENABLED,
MLFLOW_TRACKING_URI, MLFLOW_EXPERIMENT_NAME and OBSERVABILITY_BACKENDS
unchanged.
---
Outside diff comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 1590-1610: The current replaceOrAppendEnvVars allows overriding
operator-managed plain Value env vars; add a protection list for
operator-controlled names (at minimum "USE_VERTEX" and "CLAUDE_CODE_USE_VERTEX")
and treat those names as non-overridable the same way you treat ValueFrom
entries. Update replaceOrAppendEnvVars: before overwriting
base[i].Value/ValueFrom, check if base[i].Name is in the protected set and if so
mark replaced = true and break without changing it; this prevents extras from
replacing those specific operator-managed flags while keeping the existing
ValueFrom protection logic intact.
---
Duplicate comments:
In `@components/runners/ambient-runner/ambient_runner/observability_privacy.py`:
- Around line 25-57: The code treats "metadata" as trusted and copies it
verbatim; remove "metadata" from the passthrough list in the conditional inside
privacy_mask_message_data (or the surrounding function that iterates
data.items()) so that when key == "metadata" you call
privacy_mask_message_data(value) (i.e., apply the same recursive masking as for
other untrusted fields) rather than copying it unchanged; locate the tuple
containing keys like "usage","model","turn", etc., remove "metadata" from that
tuple (or add an explicit branch that calls privacy_mask_message_data for
"metadata") to ensure nested user strings are masked before exporting.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: db3a1034-9755-4c60-a2f1-4e400eceaab7
⛔ Files ignored due to path filters (1)
components/runners/ambient-runner/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (10)
components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.examplecomponents/operator/internal/handlers/reconciler.gocomponents/operator/internal/handlers/sessions.gocomponents/operator/internal/handlers/sessions_test.gocomponents/runners/ambient-runner/README.mdcomponents/runners/ambient-runner/ambient_runner/mlflow_observability.pycomponents/runners/ambient-runner/ambient_runner/observability_config.pycomponents/runners/ambient-runner/ambient_runner/observability_privacy.pycomponents/runners/ambient-runner/pyproject.tomlcomponents/runners/ambient-runner/tests/test_privacy_masking.py
🚧 Files skipped from review as they are similar to previous changes (3)
- components/runners/ambient-runner/README.md
- components/runners/ambient-runner/pyproject.toml
- components/runners/ambient-runner/ambient_runner/observability_config.py
components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example
Show resolved
Hide resolved
…ed env vars - Remove duplicate MLFLOW_TRACKING_AUTH entries from example secret (CLI and stringData), replacing second CLI occurrence with MLFLOW_WORKSPACE - Add operatorProtectedEnvVars set in replaceOrAppendEnvVars to prevent user spec.environmentVariables from overriding USE_VERTEX and CLAUDE_CODE_USE_VERTEX (plain Value, not SecretKeyRef) - Add test for operator-managed name protection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Review comments addressedPushed 1. Remove duplicate MLFLOW_TRACKING_AUTH (CodeRabbit — Minor)Removed duplicate 2. Protect operator-managed plain Value env vars (CodeRabbit — outside-diff)Added 🤖 Session |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
components/operator/internal/handlers/sessions.go (2)
1587-1623:⚠️ Potential issue | 🔴 Critical
BACKEND_API_URLis still user-overridable.This change protects secret-backed envs, but Line 1291 still lets
spec.environmentVariablesreplace plain operator-managed values. In this file,BACKEND_API_URLis injected by the operator and used by the runner for backend credential/API calls, so a session can redirect those requests off-cluster while keeping the operator-issued bot token.Suggested fix
var operatorProtectedEnvVars = map[string]struct{}{ - "USE_VERTEX": {}, - "CLAUDE_CODE_USE_VERTEX": {}, + "USE_VERTEX": {}, + "CLAUDE_CODE_USE_VERTEX": {}, + "BACKEND_API_URL": {}, }As per coding guidelines, "Flag only errors, security risks, or functionality-breaking problems."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/operator/internal/handlers/sessions.go` around lines 1587 - 1623, The BACKEND_API_URL plain env var is still user-overridable; update the operatorProtectedEnvVars map so operator-injected plain values cannot be replaced by spec.environmentVariables by adding "BACKEND_API_URL" (and any other operator-managed non-ValueFrom names) to operatorProtectedEnvVars; since replaceOrAppendEnvVars consults operatorProtectedEnvVars, this change will prevent replaceOrAppendEnvVars from replacing base entries with that name (no other code changes required).
2474-2488:⚠️ Potential issue | 🟠 MajorExisting Roles will not reliably receive the new MLflow permission.
The new
experimentsrule only helps on freshRolecreation. In theAlreadyExistspath (line 2486), the code updates a freshly constructedrbacv1.Rolewithout the live object'sresourceVersion; Kubernetes rejects that update, so pre-existing session Roles keep the old rules and MLflow Kubernetes auth fails after upgrade.Fetch the existing Role first, copy updated rules into it, then update with the live object:
Fix
if _, err := config.K8sClient.RbacV1().Roles(sessionNamespace).Create(context.TODO(), role, v1.CreateOptions{}); err != nil { if errors.IsAlreadyExists(err) { - // Update existing role to ensure latest permissions - if _, err := config.K8sClient.RbacV1().Roles(sessionNamespace).Update(context.TODO(), role, v1.UpdateOptions{}); err != nil { + existingRole, getErr := config.K8sClient.RbacV1().Roles(sessionNamespace).Get(context.TODO(), roleName, v1.GetOptions{}) + if getErr != nil { + return fmt.Errorf("get Role for update: %w", getErr) + } + existingRole.Rules = role.Rules + existingRole.OwnerReferences = role.OwnerReferences + if _, err := config.K8sClient.RbacV1().Roles(sessionNamespace).Update(context.TODO(), existingRole, v1.UpdateOptions{}); err != nil { return fmt.Errorf("update Role: %w", err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/operator/internal/handlers/sessions.go` around lines 2474 - 2488, When handling an AlreadyExists error after creating the Role, don't call Update with the freshly constructed role object; instead fetch the live Role via config.K8sClient.RbacV1().Roles(sessionNamespace).Get(...), copy or replace its Rules with the new role.Rules (and any other intented fields), then call Update on that fetched object so the live resourceVersion is preserved; adjust the AlreadyExists branch around Roles(...).Create to Get -> modify Rules -> Update to ensure existing Roles receive the new MLflow "experiments" permission.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 887-900: The current logic sets runnerSATokenAutomount and
ServiceAccountName whenever ambientMlflowObsSecretCopied is true, which mounts a
namespace-scoped token even if MLFLOW_TRACKING_AUTH isn't
"kubernetes-namespaced"; change the condition so you only set
runnerSATokenAutomount = true and runnerPodSAName =
fmt.Sprintf("ambient-session-%s", name) when ambientMlflowObsSecretCopied AND
the session's MLFLOW_TRACKING_AUTH equals "kubernetes-namespaced" (or equivalent
indicator used by your config), leaving runnerPodSAName empty and
AutomountServiceAccountToken false otherwise; update the podSpec construction
(the variables used there: runnerSATokenAutomount, runnerPodSAName) accordingly
so only sessions that require the kube token get the per-session ServiceAccount
and automount.
---
Outside diff comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 1587-1623: The BACKEND_API_URL plain env var is still
user-overridable; update the operatorProtectedEnvVars map so operator-injected
plain values cannot be replaced by spec.environmentVariables by adding
"BACKEND_API_URL" (and any other operator-managed non-ValueFrom names) to
operatorProtectedEnvVars; since replaceOrAppendEnvVars consults
operatorProtectedEnvVars, this change will prevent replaceOrAppendEnvVars from
replacing base entries with that name (no other code changes required).
- Around line 2474-2488: When handling an AlreadyExists error after creating the
Role, don't call Update with the freshly constructed role object; instead fetch
the live Role via config.K8sClient.RbacV1().Roles(sessionNamespace).Get(...),
copy or replace its Rules with the new role.Rules (and any other intented
fields), then call Update on that fetched object so the live resourceVersion is
preserved; adjust the AlreadyExists branch around Roles(...).Create to Get ->
modify Rules -> Update to ensure existing Roles receive the new MLflow
"experiments" permission.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b915d16c-4571-4785-a423-77f8f9f954f9
📒 Files selected for processing (3)
components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.examplecomponents/operator/internal/handlers/sessions.gocomponents/operator/internal/handlers/sessions_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example
- components/operator/internal/handlers/sessions_test.go
… only - Only enable ServiceAccount token automount and per-session SA when MLFLOW_TRACKING_AUTH is "kubernetes-namespaced" (read from the copied secret), not for all MLflow-secret sessions. This avoids granting an unnecessary namespace-scoped token to sessions using other auth methods. - Remove duplicate MLFLOW_TRACKING_AUTH env var entry. Addresses CodeRabbit review on PR ambient-code#1263. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Pushed commit ad3c68c addressing the remaining CodeRabbit review comment:
🤖 Session |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 2468-2474: The current code adds MLflow Experiment verbs to the
default runner token binding (the APIGroups/Resources block that grants
"mlflow.kubeflow.org" "experiments" get/list/update) which is created by
regenerateRunnerToken() and mounted into every runner at
/var/run/secrets/ambient; change this so the MLflow CRD permission is NOT
included in the default token: either remove that APIGroups/Resources entry from
the regenerateRunnerToken() token creation path, or conditionally add it only
when creating the specific pod identity/service-account used for
Kubernetes-backed MLflow auth (i.e., detect MLflow-backed sessions and attach
the rule to that service account instead of the default ambient token), ensuring
only the pod identity used for MLflow receives the experiments get/list/update
rights.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 851e0bda-dcd4-4532-b9e1-77ed38b5555e
📒 Files selected for processing (1)
components/operator/internal/handlers/sessions.go
| from mlflow.entities import SpanType | ||
|
|
||
| with mlflow.start_span( | ||
| name="Claude Code - Session Metrics", |
There was a problem hiding this comment.
Is the agent harness used as runner always Claude Code?
There was a problem hiding this comment.
no, we also have gemini, but its dynamic. we manage a layer above all frameworks that bridge them together
| exp_name = os.getenv("MLFLOW_EXPERIMENT_NAME", "ambient-code-sessions").strip() | ||
| if not exp_name: | ||
| exp_name = "ambient-code-sessions" | ||
| mlflow.set_experiment(exp_name) |
There was a problem hiding this comment.
Does that mean there is one global experiment for all runners / sessions or it's properly injected / set by ambient for each runner / agent?
There was a problem hiding this comment.
yeah, I need to better understand the relationship in ambient between runners and sessions. I agree with you, it would probably be best to support one experiment per session.
There was a problem hiding this comment.
So ambient has a structure of: workspace (namespace) -> session (CR) -> pod (reused claude sdk session) -> turns/runs
i would make one experiment per namespace
There was a problem hiding this comment.
Would that make sense to have experiment stable mapping to agent id for example, and then sessions for each "run / instance" of this agent?
Basically something like namespace - tenant > agent - experiment > sessions.
|
@ambient-code can you review on if this PR supports mlflow observability across all frameworks instead of just claude code sdk. we also support gemini cli and in the future others. make sure this is not JUST claude code sdk traceing. add inline comments to this PR for issues u find also fix the go format issue at internal/handlers/sessions.go by doing gofmt -w . |
- Accept both "kubernetes" and "kubernetes-namespaced" values for MLFLOW_TRACKING_AUTH when deciding whether to enable SA token automount, matching the runner's behavior. - Fix gofmt alignment in operatorProtectedEnvVars map. Addresses CodeRabbit review and CI lint failure on PR ambient-code#1263. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Pushed commit 85a197e addressing CI failure and CodeRabbit review:
Note: The "SDD boundary check" failure is a GitHub permissions issue ( 🤖 Session |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/operator/internal/handlers/sessions.go (1)
1433-1452:⚠️ Potential issue | 🟠 MajorScope MLflow Experiment RBAC to only MLflow K8s-auth sessions
Line 2468 adds
mlflow.kubeflow.org/experimentsget/list/updateto the default per-session role, and Line 1451 mounts that SA token into every runner pod. This grants MLflow mutation capability to non-MLflow sessions too. Gate this rule to sessions that actually requireMLFLOW_TRACKING_AUTH=kubernetes-namespaced(or split into a dedicated SA/Role path).As per coding guidelines, "Flag only errors, security risks, or functionality-breaking problems."
Also applies to: 2468-2474
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/operator/internal/handlers/sessions.go` around lines 1433 - 1452, The current code mounts the "runner-token" Secret (runnerTokenSecretName) into every "ambient-code-runner" pod and the session role grants mlflow.kubeflow.org/experiments get/list/update broadly; restrict this so only sessions that actually use MLflow k8s auth get the token and RBAC: either (1) add a conditional around the volume/VolumeMount code to only append the "runner-token" volume and mount when the session's environment has MLFLOW_TRACKING_AUTH == "kubernetes-namespaced", or (2) split the flow to create a dedicated ServiceAccount/Role/RoleBinding that includes mlflow.kubeflow.org/experiments permissions and only assign that SA to sessions with MLFLOW_TRACKING_AUTH set; ensure you update where the Role is created (the role that currently includes mlflow.kubeflow.org/experiments) to be scoped to that dedicated SA path if you choose option (2).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 1581-1587: The file sessions.go is not gofmt-formatted causing CI
to fail; run gofmt to fix formatting (e.g., run `gofmt -w .` from the
components/operator directory or `gofmt -w
components/operator/internal/handlers/sessions.go`) so the declaration of
operatorProtectedEnvVars and surrounding code are properly formatted and the
lint pipeline passes.
---
Duplicate comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 1433-1452: The current code mounts the "runner-token" Secret
(runnerTokenSecretName) into every "ambient-code-runner" pod and the session
role grants mlflow.kubeflow.org/experiments get/list/update broadly; restrict
this so only sessions that actually use MLflow k8s auth get the token and RBAC:
either (1) add a conditional around the volume/VolumeMount code to only append
the "runner-token" volume and mount when the session's environment has
MLFLOW_TRACKING_AUTH == "kubernetes-namespaced", or (2) split the flow to create
a dedicated ServiceAccount/Role/RoleBinding that includes
mlflow.kubeflow.org/experiments permissions and only assign that SA to sessions
with MLFLOW_TRACKING_AUTH set; ensure you update where the Role is created (the
role that currently includes mlflow.kubeflow.org/experiments) to be scoped to
that dedicated SA path if you choose option (2).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7ac04190-e259-45ab-a6e4-f96c966b1353
📒 Files selected for processing (1)
components/operator/internal/handlers/sessions.go
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/operator/internal/handlers/sessions.go (1)
2471-2485:⚠️ Potential issue | 🟠 MajorRole permissions won't be updated on namespace redeployment.
The code calls
Roles(...).Update(..., role, ...)with a freshly constructed object that lacksmetadata.resourceVersion. Kubernetes requires the resourceVersion for optimistic concurrency control, so this Update fails. The Secret update path (lines 2540–2558) shows the correct pattern: fetch the existing object first viaGet(), then update from that copy.Result: Upgraded namespaces keep the old Role without MLflow
experimentsverbs, causing runtime auth failures.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/operator/internal/handlers/sessions.go` around lines 2471 - 2485, The Role update fails because the freshly constructed role lacks metadata.resourceVersion; change the update path that calls config.K8sClient.RbacV1().Roles(sessionNamespace).Update(..., role, ...) to first Get the existing Role (via config.K8sClient.RbacV1().Roles(sessionNamespace).Get(...)), copy/merge the desired rules into that returned object (or set existingRole.ObjectMeta.ResourceVersion onto your role variable) and then call Update with the retrieved/updated object so Kubernetes optimistic concurrency succeeds; ensure you reference the existing Role before updating to preserve resourceVersion and other server-set fields.
♻️ Duplicate comments (1)
components/operator/internal/handlers/sessions.go (1)
2471-2477:⚠️ Potential issue | 🟠 MajorThese MLflow Experiment verbs are still granted to every runner token.
regenerateRunnerToken()is used for all sessions, and that session token is still mounted into every runner at/var/run/secrets/ambientlater in this file. Addingmlflow.kubeflow.org/experimentshere broadens non-MLflow sessions to namespace-scoped MLflow mutation rights as well.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/operator/internal/handlers/sessions.go` around lines 2471 - 2477, The MLflow Experiment RBAC entry in the pod's token grants namespace-scoped MLflow mutation rights to every session because regenerateRunnerToken() creates a single token that is mounted into all runners at /var/run/secrets/ambient; restrict this by making the MLflow APIGroups/Resources/Verbs entry conditional or scoped only for sessions that actually need MLflow (e.g., add a flag or check in regenerateRunnerToken() or the session creation path to only include the mlflow.kubeflow.org/experiments rule for MLflow-enabled sessions), or generate a separate token function for MLflow-enabled sessions and ensure only those tokens are mounted into runners that require MLflow access rather than the shared ambient token.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 315-317: The MLflow secret cleanup call
deleteAmbientMlflowObservabilitySecret(sessionNamespace, name) is placed inside
the "Stopped" block but is never reached because handleAgenticSessionEvent
returns early for phase == "Stopped"; move or duplicate the cleanup so it runs
on all stop paths — either invoke deleteAmbientMlflowObservabilitySecret
immediately before the early return in handleAgenticSessionEvent when phase ==
"Stopped", or add the same call into the Creating->stop-request path (the code
that currently calls deletePodAndPerPodService) so sessions that jump straight
to Stopped also remove the ambient-admin-mlflow-observability-secret; ensure you
use the same deleteCtx, sessionNamespace and name variables and log failures as
done elsewhere.
---
Outside diff comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 2471-2485: The Role update fails because the freshly constructed
role lacks metadata.resourceVersion; change the update path that calls
config.K8sClient.RbacV1().Roles(sessionNamespace).Update(..., role, ...) to
first Get the existing Role (via
config.K8sClient.RbacV1().Roles(sessionNamespace).Get(...)), copy/merge the
desired rules into that returned object (or set
existingRole.ObjectMeta.ResourceVersion onto your role variable) and then call
Update with the retrieved/updated object so Kubernetes optimistic concurrency
succeeds; ensure you reference the existing Role before updating to preserve
resourceVersion and other server-set fields.
---
Duplicate comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 2471-2477: The MLflow Experiment RBAC entry in the pod's token
grants namespace-scoped MLflow mutation rights to every session because
regenerateRunnerToken() creates a single token that is mounted into all runners
at /var/run/secrets/ambient; restrict this by making the MLflow
APIGroups/Resources/Verbs entry conditional or scoped only for sessions that
actually need MLflow (e.g., add a flag or check in regenerateRunnerToken() or
the session creation path to only include the mlflow.kubeflow.org/experiments
rule for MLflow-enabled sessions), or generate a separate token function for
MLflow-enabled sessions and ensure only those tokens are mounted into runners
that require MLflow access rather than the shared ambient token.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4ad54f89-3c75-482d-9d10-1dbdd5f725b8
📒 Files selected for processing (1)
components/operator/internal/handlers/sessions.go
Gkrumbach07
left a comment
There was a problem hiding this comment.
Review of framework-agnostic compatibility (per @Gkrumbach07 request).
The MLflow tracing implementation has several Claude Code-specific hardcoded references that should be generalized for multi-framework support (Gemini CLI, future frameworks). See inline comments for specifics.
🤖 Session
|
|
||
| gen = mlflow.start_span( | ||
| name="claude_interaction", | ||
| span_type=SpanType.CHAIN, |
There was a problem hiding this comment.
Claude Code-specific: Span name "claude_interaction" is hardcoded. For multi-framework support (Gemini, etc.), this should be framework-agnostic — e.g. "llm_interaction" or parameterized from the bridge/framework identity.
| from mlflow.entities import SpanType | ||
|
|
||
| with mlflow.start_span( | ||
| name="Claude Code - Session Metrics", |
There was a problem hiding this comment.
Claude Code-specific: Span name "Claude Code - Session Metrics" and attribute "ambient.source": "claude-code-metrics" are hardcoded to Claude Code. These should be framework-agnostic (e.g. "Session Metrics" / "ambient-runner-metrics") to work correctly with Gemini and future frameworks.
| } | ||
|
|
||
| # Build tags list | ||
| tags = ["claude-code", f"namespace:{namespace}"] |
There was a problem hiding this comment.
Claude Code-specific: Langfuse tag "claude-code" and span name "Claude Code - Session Metrics" (line 885) are hardcoded. These should be parameterized or use a generic name so traces from Gemini sessions are not incorrectly labeled as Claude Code.
| exp_name = os.getenv("MLFLOW_EXPERIMENT_NAME", "ambient-code-sessions").strip() | ||
| if not exp_name: | ||
| exp_name = "ambient-code-sessions" | ||
| mlflow.set_experiment(exp_name) |
There was a problem hiding this comment.
Re @astefanutti question: This uses a single global experiment name (default ambient-code-sessions) shared across all sessions/runners. The session_id distinguishes individual sessions within traces, but all sessions land in the same experiment. A per-session experiment would be possible but may cause experiment sprawl — the current approach seems reasonable as long as session_id tagging is sufficient for filtering.
There was a problem hiding this comment.
@Gkrumbach07 ideally from the angle of AI first initiative for example, there would be a stable experiment to agent mapping.
The Stopped-phase cleanup block is unreachable due to the early return at line 248 for terminal phases. The Creating→Stopped transition path (pod not found + stop requested) skips deletePodAndPerPodService, so copied Langfuse and MLflow secrets could leak in the session namespace. Add explicit secret cleanup before returning on that path. Addresses CodeRabbit review on PR ambient-code#1263. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Circuit breaker: retry_count=3 — handing off to a humanThis PR has required 3 rounds of fixes across multiple sessions. The code fixes are functional (CI is green), but ongoing CodeRabbit and human review comments keep surfacing new issues: Fixes applied in this session:
Outstanding items requiring human judgment:
Adding 🤖 Session |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
components/operator/internal/handlers/sessions.go (2)
2482-2488:⚠️ Potential issue | 🟠 MajorKeep MLflow Experiment permissions off the default runner token.
regenerateRunnerToken()is still used for every session, and that token is always mounted into the runner at/var/run/secrets/ambient(Lines 1450-1469). Addingmlflow.kubeflow.org/experimentsget/list/updatehere gives non-MLflow sessions namespace-scoped MLflow mutation rights too. Bind this rule only to the pod identity used forambientMlflowObsSecretCopied && mlflowK8sAuth.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/operator/internal/handlers/sessions.go` around lines 2482 - 2488, The added MLflow RBAC rule in the APIGroups/Resources/Verbs block grants mlflow.kubeflow.org/experiments get/list/update to the default runner token which is regenerated by regenerateRunnerToken() and mounted into every session; restrict this rule so it is only applied when the pod identity used by ambient MLflow is active (i.e., when ambientMlflowObsSecretCopied && mlflowK8sAuth are true). Modify the logic that builds the RBAC binding (the code path that includes the shown APIGroups block) to conditionally attach this rule only for the pod identity used by ambient MLflow (check the same flags or selector used for ambientMlflowObsSecretCopied/mlflowK8sAuth), or move the rule into the specific Role/RoleBinding created for that pod identity instead of the default runner token Role.
1154-1219:⚠️ Potential issue | 🔴 CriticalBlock user-supplied MLflow envs when the operator did not inject them.
replaceOrAppendEnvVarsonly protects names already present inbase. IfambientMlflowObsSecretCopiedis false, a session can still setOBSERVABILITY_BACKENDS=mlflow,MLFLOW_TRACING_ENABLED=true, andMLFLOW_TRACKING_URI=...inspec.environmentVariables, and the runner will enable MLflow from those values. That bypasses the operator opt-in and can route prompts/tool output to an arbitrary tracking endpoint. Cross-file evidence:ambient_runner/observability_config.py:13-38andambient_runner/mlflow_observability.py:34-96.Also applies to: 1297-1299
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/operator/internal/handlers/sessions.go` around lines 1154 - 1219, The code currently allows sessions to set MLflow-related env vars in spec.environmentVariables when ambientMlflowObsSecretCopied is false, bypassing operator opt-in; update the env merging logic (where base is combined with spec.environmentVariables and replaceOrAppendEnvVars is used) to drop or override any user-provided env vars with these names: OBSERVABILITY_BACKENDS, MLFLOW_TRACING_ENABLED, MLFLOW_TRACKING_URI, MLFLOW_TRACKING_AUTH, MLFLOW_EXPERIMENT_NAME, MLFLOW_WORKSPACE whenever ambientMlflowObsSecretCopied is false; implement this by filtering spec.environmentVariables before calling replaceOrAppendEnvVars (or by adding a guard in replaceOrAppendEnvVars) so user-supplied keys are removed/ignored unless ambientMlflowObsSecretCopied is true, referencing the ambientMlflowObsSecretCopied variable and the replaceOrAppendEnvVars merge point in sessions.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 426-434: The Creating→Stopped cleanup currently deletes Langfuse
and MLflow copied secrets but misses the Vertex secret; add a call to delete the
Vertex observability secret in the same cleanup block—invoke the existing helper
(e.g., deleteAmbientVertexObservabilitySecret(cleanupCtx, sessionNamespace,
name)) alongside deleteAmbientLangfuseSecret and
deleteAmbientMlflowObservabilitySecret and log a warning on error consistent
with the other two calls so the ambient-vertex copy is removed on early stop.
---
Duplicate comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 2482-2488: The added MLflow RBAC rule in the
APIGroups/Resources/Verbs block grants mlflow.kubeflow.org/experiments
get/list/update to the default runner token which is regenerated by
regenerateRunnerToken() and mounted into every session; restrict this rule so it
is only applied when the pod identity used by ambient MLflow is active (i.e.,
when ambientMlflowObsSecretCopied && mlflowK8sAuth are true). Modify the logic
that builds the RBAC binding (the code path that includes the shown APIGroups
block) to conditionally attach this rule only for the pod identity used by
ambient MLflow (check the same flags or selector used for
ambientMlflowObsSecretCopied/mlflowK8sAuth), or move the rule into the specific
Role/RoleBinding created for that pod identity instead of the default runner
token Role.
- Around line 1154-1219: The code currently allows sessions to set
MLflow-related env vars in spec.environmentVariables when
ambientMlflowObsSecretCopied is false, bypassing operator opt-in; update the env
merging logic (where base is combined with spec.environmentVariables and
replaceOrAppendEnvVars is used) to drop or override any user-provided env vars
with these names: OBSERVABILITY_BACKENDS, MLFLOW_TRACING_ENABLED,
MLFLOW_TRACKING_URI, MLFLOW_TRACKING_AUTH, MLFLOW_EXPERIMENT_NAME,
MLFLOW_WORKSPACE whenever ambientMlflowObsSecretCopied is false; implement this
by filtering spec.environmentVariables before calling replaceOrAppendEnvVars (or
by adding a guard in replaceOrAppendEnvVars) so user-supplied keys are
removed/ignored unless ambientMlflowObsSecretCopied is true, referencing the
ambientMlflowObsSecretCopied variable and the replaceOrAppendEnvVars merge point
in sessions.go.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3b8f8209-046b-4cbe-8928-909897265a64
📒 Files selected for processing (1)
components/operator/internal/handlers/sessions.go
| // Cleanup copied observability secrets (may have been copied in a prior reconciliation) | ||
| cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 30*time.Second) | ||
| defer cleanupCancel() | ||
| if err := deleteAmbientLangfuseSecret(cleanupCtx, sessionNamespace, name); err != nil { | ||
| log.Printf("Warning: Failed to cleanup langfuse secret from %s during Creating→Stopped: %v", sessionNamespace, err) | ||
| } | ||
| if err := deleteAmbientMlflowObservabilitySecret(cleanupCtx, sessionNamespace, name); err != nil { | ||
| log.Printf("Warning: Failed to cleanup mlflow secret from %s during Creating→Stopped: %v", sessionNamespace, err) | ||
| } |
There was a problem hiding this comment.
Creating→Stopped cleanup still misses ambient-vertex.
This stop path now deletes the Langfuse and MLflow copies, but if Vertex was enabled earlier in reconciliation, ambient-vertex can still remain in the session namespace when pod creation never completed. That leaves a copied secret behind on the same early-stop path this change is fixing.
Proposed fix
if err := deleteAmbientLangfuseSecret(cleanupCtx, sessionNamespace, name); err != nil {
log.Printf("Warning: Failed to cleanup langfuse secret from %s during Creating→Stopped: %v", sessionNamespace, err)
}
+ if err := deleteAmbientVertexSecret(cleanupCtx, sessionNamespace, name); err != nil {
+ log.Printf("Warning: Failed to cleanup %s secret from %s during Creating→Stopped: %v", types.AmbientVertexSecretName, sessionNamespace, err)
+ }
if err := deleteAmbientMlflowObservabilitySecret(cleanupCtx, sessionNamespace, name); err != nil {
log.Printf("Warning: Failed to cleanup mlflow secret from %s during Creating→Stopped: %v", sessionNamespace, err)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/operator/internal/handlers/sessions.go` around lines 426 - 434,
The Creating→Stopped cleanup currently deletes Langfuse and MLflow copied
secrets but misses the Vertex secret; add a call to delete the Vertex
observability secret in the same cleanup block—invoke the existing helper (e.g.,
deleteAmbientVertexObservabilitySecret(cleanupCtx, sessionNamespace, name))
alongside deleteAmbientLangfuseSecret and deleteAmbientMlflowObservabilitySecret
and log a warning on error consistent with the other two calls so the
ambient-vertex copy is removed on early stop.
Summary
Adds optional MLflow GenAI tracing alongside the existing Langfuse path. Langfuse remains the default when
OBSERVABILITY_BACKENDSis unset, so current reports, evals, and feedback (ambient:langfuse_trace/traceId) behave as before unless operators opt in.Backward compatibility:
LANGFUSE_MASK_MESSAGES/ shared masking uses the same rules as before this PR (e.g. redact long strings, allow-listed keys includingmetadataunchanged). Sessionspec.environmentVariablescannot override operator-injected env vars that come from SecretKeyRef (Langfuse + MLflow), so platform observability config is not bypassed.What changed
Runner (
ambient-runner)OBSERVABILITY_BACKENDS: comma listlangfuse,mlflow(default:langfuseonly).MLflowSessionTracer(mlflow_observability.py): mirrors turn / tool boundaries withmlflow.start_span, usage attributes, session summary span, error cleanup, and flush.observability_config.py/observability_privacy.py: backend selection and shared message masking (LANGFUSE_MASK_MESSAGESstill applies to both paths).ObservabilityManager: orchestrates Langfuse + MLflow;mlflow_tracing_activeproperty; per-backend try/except so one backend failing does not disable the other.pyproject.toml: optional extramlflow-observability(mlflow[kubernetes]>=3.11,opentelemetry-exporter-otlp); included inall(Docker[all]install).OTEL_*,MLFLOW_TRACE_ENABLE_OTLP_DUAL_EXPORT).test_observability_config.py,test_observability_mlflow_integration.py;_extract_assistant_texttolerates missingclaude_agent_sdkso pytest passes without theclaudeextra.Operator & manifests
ambient-admin-mlflow-observability-secretinto session namespaces whenMLFLOW_TRACING_ENABLEDis set on the operator (same pattern as Langfuse).MLFLOW_TRACING_ENABLED,MLFLOW_TRACKING_URI,MLFLOW_EXPERIMENT_NAME, optionalMLFLOW_TRACKING_AUTH/MLFLOW_WORKSPACE,OBSERVABILITY_BACKENDSviasecretKeyRef.replaceOrAppendEnvVars: entries already usingValueFrom(secret-backed) are not replaced byspec.environmentVariables, so user env cannot override Langfuse or MLflow keys wired from secrets.ServiceAccountName: ambient-session-<session>with token automount soMLFLOW_TRACKING_AUTH=kubernetes-namespacedcan supply JWT + workspace headers (per MLflow 3.11). SessionRolegrantsexperiments(mlflow.kubeflow.org) for Kubeflow-style workspace auth (tune for your CRD group).operator-deployment.yaml: optional env from the MLflow observability secret.components/manifests/base/ambient-admin-mlflow-observability-secret.yaml.example.Capabilities API
FrameworkCapabilities.tracing: e.g.langfuse,mlflow, orlangfuse,mlflowviatracing_capability_label.How to enable (high level)
ambient-admin-mlflow-observability-secretin the operator namespace (see example YAML).MLFLOW_TRACING_ENABLED=true(and ensure the secret exists) so the operator copies it and wires runner env.[all]builds).Testing
cd components/runners/ambient-runner && uv run pytest tests/(full suite; includes observability + privacy tests).go test ./...with a Go toolchain matchinggo.mod(e.g. 1.24+), includingTestReplaceOrAppendEnvVarsPreservesValueFrom.Commits (squash on merge)
Branch includes the feature commit plus follow-ups, e.g. MLflow k8s auth / SA automount / experiment RBAC,
ValueFromenv precedence, assistant-text test fix, and masking left aligned with historical Langfuse behaviour—seegit logon the PR branch for the exact list.Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests
Chores