feat(v1-rc1): 2 detectors + 8 pattern specs + chart NetPol + rc1 audits#338
Merged
Conversation
added 30 commits
June 1, 2026 00:01
NORTHSTARS O2 hero KPI (install-to-first-data) and O5 hero KPI (>=15 production orgs at M12) both pointed at undefined ground truth. O2 named a 'production-realistic 32-GPU' tier without binding any hardware spec; O5 had a methodology stub at docs/nps.md but no funnel mechanics, no tracking artifact, no public-counter rule consistent with O2 Op-Rule #5 / O5 Op-Rule #1 (never phone home). docs/reference-environments.md pins both tiers - Minimal (kind, k8s 1.32, DCGM 4.4.x stub, NCCL 2.30+, ubuntu-latest, the existing bench/install/ harness) and Production-realistic (32xH100 SXM5 across 4 nodes, NVLink+IB, Calico+Multus+RDMA, Lustre, Kueue v0.17.x). Names three cluster-access paths (partner-volunteered, rented, quarterly-drill fallback) and flags partner-volunteered as the v1.0 default expectation. docs/adoption-pipeline.md defines the S0->S3 funnel with explicit definition-of-done per stage, target pilot profile, the docs/followups/_pilots.md tracking artifact, the release-prep-PR public-counter cadence (no scraping, no auto-update), and one-line comms templates per transition. Cross-refs added in docs/nps.md and docs/README.md. Signed-off-by: Tri Lam <tri@maydow.com>
Three rc1-blocking operational gaps documented in docs/v1-rc1-operational-gaps.md with file:line evidence, falsifiable cut criteria, numbered remediation steps tagged by work-type (code/doc/ops/external-dep), S/M/L effort estimates, and explicit external blockers (slsa-go-generator OCB-submodule integration; healthcheckextension single-path upstream limitation; harden-runner audit-only on hosted runners). MILESTONES.md M21 cross-refs the new doc as a release-prep dependency. Per-step issues #314-#321 filed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Marks XidCorrelationVerdict.EvictedPod deprecated in v0.4; legacy field continues to emit alongside the k8s-semconv split PodName / PodNamespace (already populated since #270) through the v0.4-v0.5 window and is removed in v0.6 per ATTRIBUTES.md soft-lock policy. Adds TestXidCorrelationVerdict_DeprecatedEvictedPodCoEmits pinning that both legacy + successor names co-emit in the JSON shape, so a regression that drops either side fails the gate. Updates docs/ATTRIBUTES.md with a deprecation row + removal-target table. Signed-off-by: Tri Lam <tri@maydow.com>
Eight 1-page pattern-design specs covering #2 IB link flap, #7 dataloader hang, #8 NCCL timeout no-HW, #9 NCCL bootstrap timeout, #10 CUDA OOM deceptive allocator, #11 checkpointer hang, #12 loss spike NaN, #13 silent data corruption. Each carries the standard detector-design shape (symptom, layers, signal sources, evaluation rule, verdict attrs, edge cases, status, open questions) so the next contributor can write a TDD red test directly off the spec. Status: all 8 marked planned. #10 already has issue #303; the spec frames the design alongside. NORTHSTARS Appendix A gains a Spec column; docs/README + patterns README link the new specs. Signed-off-by: Tri Lam <tri@maydow.com>
The drop-rate recording rule grouped `by (pipeline)`, but upstream
OTel obsreport never stamps that label — `processorhelper` only
stamps `processor` and `receiverhelper` stamps `receiver` +
`transport`. The `sum by (pipeline)` collapsed every series into a
single empty `pipeline=""` group, so the alert annotations rendered
empty.
Split the recording rule into per-instance numerator + denominator
and aggregate to a per-instance drop ratio (one pod = one tracecore
instance = one pipeline in our DaemonSet topology). Alert payload now
carries the actual instance label and the description points operators
at the per-component upstream series for localization.
The `TracecoreSelftelemetryDown` rule used
`up{...} == 0 unless on (instance) kube_pod_status_phase{phase="Failed"}`
to separate listener-wedge from pod-crash, but `up{}` and
`kube_pod_status_phase{}` share no `instance` label — the `unless`
clause was a silent no-op and the rule fired identically to
`TracecorePodDown`. Removed the rule; `TracecoreSelftelemetryEmpty`
(which uses a real join via `job`+`instance`) already covers the
listener-wedge case.
Verification: `promtool check rules` reports SUCCESS: 13 rules.
Closes #298
Signed-off-by: Tri Lam <tri@maydow.com>
The chart's `dashboards/slo-rules.yaml` queries `up{job="tracecore"}`,
but nothing in the chart wired a scrape config — operators had to
hand-author a Prometheus job for the SLO rules to light up. Two new
toggles converge on the canonical `job="tracecore"` label:
- `serviceMonitor.enabled` (default OFF — the
`monitoring.coreos.com/v1` CRD ships with kube-prometheus-stack /
prometheus-operator and is absent on bare clusters). Renders a
ServiceMonitor with `jobLabel: app.kubernetes.io/name`, which
resolves to `job="tracecore"` via the chart's selector labels.
Requires a new headless Service (clusterIP: None) over the
DaemonSet pods so the Operator's Service-targeted selector
resolves to per-pod telemetry endpoints.
- `prometheusScrape.enabled` (default ON). Stamps
`prometheus.io/scrape`, `prometheus.io/port`, `prometheus.io/path`
annotations on the DaemonSet pods for vanilla-Prometheus
kubernetes_sd_configs (role: pod) scrape jobs. Harmless on
Operator clusters; operators using ServiceMonitor should disable
to avoid double-scrape.
Chart README gains a worked example for each path, including the
vanilla `scrape_configs:` block with relabel chain.
Verification:
helm lint install/kubernetes/tracecore → OK
helm lint --values ci/all-receivers-off-values.yaml → OK
helm lint --values ci/one-receiver-on-values.yaml → OK
helm lint --values ci/pyspy-on-values.yaml → OK
helm template --set serviceMonitor.enabled=true → renders
valid ServiceMonitor with jobLabel: app.kubernetes.io/name and
matching headless Service.
Closes #296
Signed-off-by: Tri Lam <tri@maydow.com>
Measures NORTHSTARS O6 (velocity) + O7 (governance) supporting KPIs against current repo state. Six gap sections: CODEOWNERS coverage (0% vs 80% target), lint-enforced principles (4/16 vs 6 target), missing quarterly retros, RFC log + RFC-0013 still draft, `make ci` 148s vs 60s budget, maintainer count 1 vs >=3 by M9. Issues filed: #322 #323 #324 #325 #327 (cap of 5; gap #6 maintainer count is external, tracked via adoption pipeline). Sibling to v1-rc1-cut-criteria.md + v1-rc1-operational-gaps.md. Signed-off-by: Tri Lam <tri@maydow.com>
Read-only audit measuring moat coverage (82.9% module-wide), inventories the two existing fuzz harnesses, three chaos rows, three benchmark files, and zero property tests. Files five rc1-prep test-gap issues (#328-#332) covering the highest-leverage gaps: three sub-80% packages, missing ncclfrreceiver integration test, chaos matrix gap to shipped patterns, and pyspy framing nightly fuzz soak. Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Three references to a docs/integrations/cert-manager-mtls.md recipe that has never existed in the tree were blocking doc-check. Replaced with inline upstream pointers + 'per-operator until a dedicated recipe lands' wording so the operational guidance survives the link removal. Side cleanup blocking the chore/v1-rc1-knowledge-gaps push; root cause is that #291 (multi-cluster v0 federation) shipped forward-references to a recipe that was deferred. Signed-off-by: Tri Lam <tri@maydow.com>
mark hw.gpu.{throttle,nvlink,pci.bdf,index} + error.{subtype,
persistence} as tracecore-ext, point at #265 for upstream proposal.
pin all 4 pattern docs on gpu.id (PCI BDF, RFC-0013 §3) as the
customer-stable join key; PromQL examples + alerts switch from
hw_id to gpu_id.
closes #265
closes #276
Signed-off-by: Tri Lam <tri@maydow.com>
Adds an opt-in default-deny NetworkPolicy template to the chart:
- Ingress restricted to the telemetry + health ports, scrape sources
configurable via `networkPolicy.allowedScrapers` (default
`namespaceSelector: {}` = same-namespace).
- Egress restricted to cluster DNS + operator-declared
`allowedEgressEndpoints` (CIDR + port pairs for OTLP-out). Empty
egress list keeps the path closed until the operator declares
destinations explicitly — auditable by construction.
- Default OFF so the first-install path stays compatible with CNIs
that ignore NetworkPolicy (Flannel without canal). Enable on
Calico / Cilium / kube-router.
Adds docs/integrations/cert-manager-mtls.md: cert-manager
ClusterIssuer + Certificate shape, chart wiring through
config.exporters.otlphttp.tls.*, renewal contract
(reload_interval >= 2 * renewBefore), and verification steps
including the falsifier that catches a silently-downgraded
one-way-TLS aggregation listener.
Verification:
docker run --rm alpine/helm:3.16.4 lint install/kubernetes/tracecore -> OK
... --set networkPolicy.enabled=true -> OK
helm template ... --set networkPolicy.enabled=true -> renders
valid NetworkPolicy with policyTypes Ingress + Egress, scrape-in
on telemetry + health ports, DNS + OTLP-out egress allow-list.
Signed-off-by: Tri Lam <tri@maydow.com>
First written threat model for tracecore. Inventory assets (cluster topology, GPU error patterns, OTLP credentials, kube-apiserver token, host kernel reads, NCCL FR pickle streams). Walk five trust boundaries (hostPath, kube-api, OTLP egress, cgo, FR shared mem). Apply STRIDE per boundary with mitigation status. Top-10 risk ranking with residual work. Audit RFP scope (~17 person-days across 8 subscopes) and in-repo prep checklist that gates handing the work to a paid auditor before v1.0 GA per the Tier 2 prereq. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
#294: process_resident_memory_bytes is not emitted by tracecore; upstream service/telemetry emits otelcol_process_memory_rss. #295: verdict log records do not carry latency_seconds; drop the LogQL stop-gap that would return empty. Pin SLI 1 binding to v0.5+ when verdict_emit_seconds histogram lands. Note #261 scope excludes the latency histogram (verdicts_emitted_total + evidence_count + consume_logs_duration are #261's scope). #299: B12 is "v0.x → v1.0 one-shot upgrade guide", not verdict-ack; drop fabricated roadmap pointer. PRINCIPLES.md §1 link mismatch in RUNBOOK fixed to RFC-0013 §1 (the actual href target). closes #294 closes #295 closes #299 Signed-off-by: Tri Lam <tri@maydow.com>
Pattern #2 — InfiniBand link flap — per NORTHSTARS Appendix A row #2 and the design spec at docs/patterns/02-ib-link-flap.md. Detector evaluation rule - bucket IB port-state transitions by (node, HCA, port) within CorrelationWindow (default 2min) - fire when transitions >= MinTransitions (default 2) - promote Confidence to full when a stuck NCCL FR cohort (>= MinHangingRanks ranks, non-completed-state) lands on the same node within the same window; otherwise partial Cross-rank correlation primitive - groupStuckNCCLByNode lifted as an inline helper inside the ib_link_flap detector; same shape will recur in pattern #7 (dataloader-hang) and #9 (nccl-bootstrap-timeout). Refactor to a shared module follows in the next commit. Wiring - NCCLFRRecord.Node added so the cross-rank correlation can join on node identity (k8sattributes resource attr); existing nccl_hang detector ignores it (collective-scoped, not node-scoped) - projectIBPortStateRecord reads hw.network.ib.port.state + hw.network.ib.device + hw.network.ib.port.num — the customer-stable namespace declared in docs/patterns/02-ib-link-flap.md - appendIBLinkFlapVerdict promotes (k8s.node.name, hw.network.ib.device, hw.network.ib.port.num, tracecore.alert.ib_link_flap.transition_count, nccl.fr.collective_seq_id) per the issue #270 scalar-promotion contract; pattern.confidence is full|partial - Config gains ib_link_flap_window + ib_link_flap_min_transitions with Validate floors (>=1s, >=2) Tests - 8 library tests (ib_link_flap_test.go): full correlation, partial on IB-alone, single-transition no-fire, transitions-outside-window no-fire, different-ports-do-not-combine, NCCL-on-different-node does not join, configurable transition threshold, deterministic ordering - 5 processor tests (ib_link_flap_test.go): full verdict + promoted scalars, partial on IB-alone, partial-suppressed toggle, window validation floor, min-transitions validation floor Cross-link to spec: docs/patterns/02-ib-link-flap.md (authored in parallel; lands first or same-PR). Signed-off-by: Tri Lam <tri@maydow.com>
The "is there a stuck NCCL cohort on this node?" sieve will recur across pattern #7 (dataloader-hang) and #9 (nccl-bootstrap-timeout) — both join NCCL FR rank records to a node-scoped trigger using the same MinHangingRanks / non-completed-state / age-past-window cohort rule. Lift the inline groupStuckNCCLByNode from ib_link_flap.go into a new module/pkg/patterns/cross_rank.go as StuckNCCLCohortByNode, exposed for cross-pattern reuse. minRanks parameter lets future patterns tune the cohort floor when their semantics differ (zero falls back to the package-default MinHangingRanks). The nccl_hang detector keeps its own cohort logic in nccl_hang.go because it joins on (pg+collective), not node — the duplication is intentional, not accidental, and folding both into one parameterized helper would obscure the load-bearing cohort-key tradeoff. No behavior change; library + processor tests still green. Signed-off-by: Tri Lam <tri@maydow.com>
Issue #261: the patterns Grafana dashboard drives every verdict-rate panel off Loki LogQL because no processor-emitted counter exists for PromQL queries. Adds `otelcol.processor.patterndetector.verdicts_emitted_total` (renders as `otelcol_processor_patterndetector_verdicts_emitted_total` via the Prometheus exporter — RFC-0013 namespace alignment), partitioned by `pattern_id` + `confidence` + `component_id`. Once the metric ships, the three LogQL panels in patterns.json can swap to native PromQL and the dashboard works on Prometheus-only stacks. TDD: red TestVerdictsEmittedCounter_PodEvicted + factory-wiring test first; impl plumbed selfTelemetry through the per-detector loop in ConsumeLogs so each appendXxxVerdict call site ticks once. Falls back to noop selfTelemetry on missing/broken MeterProvider so the data path stays alive when telemetry init fails — mirrors the ncclfrreceiver selftel convention. Closes #261. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Extends prometheus-scrape.md with the bridge attribute contract for the four metrics-derived patterns: - pattern #1 NVLink (#260) — the `hw.gpu.nvlink.io` OTTL transform already lands in commit 0baa557; this PR closes #260's recipe-half. - pattern #3 HBM ECC (#273) — `hw.errors.delta` + error.{type, subtype,persistence} + gpu.id contract. - pattern #4 thermal throttle (#282) — `hw.gpu.throttle.duration.delta` in integer seconds + reason=thermal + gpu.id contract. - pattern #5 PCIe AER Layer 2 (#284) — the `tracecore.alert. pcie_rate_collapse.*` namespace contract. OTTL metrics->logs emission stays upstream-blocked at OTel-contrib v0.130 (RFC-0014): no contrib processor or connector emits log records from a metrics pipeline. The bridge contract documented here is the load-bearing wire format any future emitter (an upstream metricthresholdconnector OR the WithMetrics extension to patterndetectorprocessor per RFC-0014 PR-B) MUST honor; the detector projections at module/processor/patterndetectorprocessor/ patterndetector.go gate on this contract today. last-verified marker bumped to 2026-06-01. Closes #260. Closes #273. Closes #282. Refs #284 (Layer 1 closed under #285 in a prior commit; Layer 2 contract documented here). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Pattern #10 - CUDA OOM, deceptive allocator - per NORTHSTARS Appendix A row #10 and the design spec at docs/patterns/10-cuda-oom-deceptive.md. Detector evaluation rule - per-OOM, look up most-recent same-GPU FB sample within CorrelationWindow (default 2min, forward-only - fb.Timestamp <= oom.Timestamp) - if fb_free_ratio >= FBFreeFragmentationThreshold (default 0.05) -> kind=fragmentation (raise max_split_size_mb, empty_cache) - if fb_free_ratio < threshold -> kind=true_oom (shrink batch, shard) - if no FB sample joins -> kind=unknown, confidence=partial Discriminator value - fragmentation vs true-OOM is the operator's #1 question on a CUDA OOM - without DCGM cross-check the operator retries with same batch, hits same OOM, wastes a slot - partial-confidence verdict surfaces the OOM even when DCGM scrape lags, so the operator branches on concurrent pod_evicted / xid_correlation rather than silence Files - module/pkg/patterns/cuda_oom.go - detector + verdict + records - module/processor/patterndetectorprocessor/cuda_oom.go - projections, collectCUDAOOMInputs, appendCUDAOOMVerdict, runCUDAOOMDetector - module/processor/patterndetectorprocessor/cuda_oom_test.go - 7 wiring tests + 2 Validate guards - module/processor/patterndetectorprocessor/example_config.yaml - cuda_oom_correlation_window + cuda_oom_fb_free_fragmentation_threshold knobs - docs/ATTRIBUTES.md - hw.gpu.memory.{free,total} namespace entries Scalar promotions per issue #270 contract: gpu.id, k8s.{pod,node}.*, cuda_oom.kind, cuda_oom.tried_alloc_bytes, cuda_oom.fb_free_bytes, cuda_oom.fb_free_ratio, pattern.confidence. Window-edge fenced both sides per PR #255 lesson. Threshold-boundary fenced inclusive per same lesson. Most-recent-pre-OOM rule mirrors xid_correlation / pcie_aer / hbm_ecc. Integration-gap follow-ups (tracked separately on PR body): - DCGM_FI_DEV_FB_USED/FREE OTTL recipe extension (sibling to #273) - filelogreceiver OTTL stanza for CUDA OOM regex parsing (sibling to #285) - metrics-path on patterndetectorprocessor per ADR-0001 (PR-B) Tests - 17 detector tests in module/pkg/patterns/cuda_oom_test.go (filed in red commit, now green) - 11 schema-drift falsifier sub-tests on CUDAOOMVerdict - 7 wiring + 2 Validate tests in processor cuda_oom_test.go - all 35 green; full ./pkg/patterns + ./processor/patterndetectorprocessor suites green with -race; make check + make build green Refs #303 Signed-off-by: Tri Lam <tri@maydow.com>
Adds bench/detectors/ with six benchmarks (one per pattern-library detector: pod_evicted, xid_correlation, hbm_ecc, nccl_hang, thermal_throttle, pcie_aer), each driving a fixed 1024-event window so allocs/op is interpretable as allocations-per-evaluate-pass. Pre-built fixtures live outside b.ResetTimer() so the measured allocations are detector-side only. Ratchet wiring: - bench/detectors/baselines.json pins current allocs/op + B/op per detector; allocs/op is hardware-invariant so the value gates cleanly across CI and dev hardware. - make bench-detectors / -check / -baseline targets for the run + compare + regenerate loop. - scripts/bench-check-detectors.sh is the SOFT gate (issue #302): prints regression deltas but exits 0 today. Graduation criterion to hard-fail (N=10 PRs with alloc-CV < 1%) documented in bench/detectors/README.md; flipping the gate is a one-line change to the script's final exit + baselines.json's gate_mode field. - .github/workflows/bench.yml runs the soft check on push-to-main and posts deltas to the job summary. Baseline numbers (Apple M1 Max, darwin/arm64; allocs/op): PodEvictedDetector 15635 XidCorrelationDetector 12699 NCCLHangDetector 4088 HBMECCDetector 1429 ThermalThrottleDetector 780 PCIeAERDetector 524 Signed-off-by: Tri Lam <tri@maydow.com>
Resolve 5 conflicts post-PR #310 / #312 / #313: - factory.go deleted on main (merged into patterndetector.go); port wave's selftel wiring (#261) into the merged createLogs - VerdictAttr* unexported per #310; rename 16 wave-added consts + all callers across cuda_oom + ib_link_flap + pcie_aer tests - docs/{MILESTONES,FOLLOWUPS,patterns/README}.md path + content reconcile after MILESTONES.md moved to docs/ Address reviewer findings before PR: - docs/THREAT-MODEL.md case-mismatch -> docs/threat-model.md (Linux CI is case-sensitive) - pattern.id schema drift: 8 specs said `ib_link_flap`/`cuda_oom`, code emits "2"/"10"/.../"13"; rewrite spec attribute tables to match shipped customer-stable namespace - pattern.confidence: 8 specs said `high|partial`, code emits `full|partial`; rewrite - 02-ib-link-flap.md attribute drift: spec said tracecore.alert.ib_link_flap.{hca_device,port}, code emits hw.network.ib.{device,port.num}; align spec to shipped code - v1-rc1-cut-criteria criterion #1 status stale-on-arrival ("6 patterns shipped" -> "8 patterns shipped, 4 remaining") - NetPol UX trap: NOTES.txt warning when networkPolicy.enabled=true with empty allowedEgressEndpoints (silently kills OTLP exporter) + warning when ServiceMonitor scraper in different namespace - File #337 for missing OTTL recipe projecting DCGM FB_USED/FREE -> hw.gpu.memory.{free,total} log shape (CUDA OOM detector consumes but recipe gap means it ships dark) Tests: ./module/processor/patterndetectorprocessor/... + ./module/pkg/patterns/... both ok. Signed-off-by: Tri Lam <tri@maydow.com>
Close attribute-namespace-check advisory gap surfaced by merge:
- tracecore.alert.pcie_rate_collapse.drop_ratio (was emitted by
appendPCIeAERVerdict, missing from inventory)
- tracecore.alert.ib_link_flap.transition_count
- hw.network.ib.{device,port.num,port.state}
(new hw.network.* section for IB/RDMA semconv)
Attribute-namespace-check now reports 67/67 documented (was 62/67).
Signed-off-by: Tri Lam <tri@maydow.com>
The audit docs were authored when NORTHSTARS.md + MILESTONES.md lived at the repo root. main moved them to docs/ in PR #313 just before this wave landed. Sibling docs reference these by relative path; 22 links were stale. Replaced ../{NORTHSTARS,MILESTONES}.md → {NORTHSTARS,MILESTONES}.md across three files. doc-check passes. Signed-off-by: Tri Lam <tri@maydow.com>
Sibling docs at docs/ top level (adoption-pipeline, threat-model,
standards-roadmap, reference-environments) had the same ../ → same-dir
link drift as the v1-rc1-* siblings. Subdir refs under
docs/{research,rfcs,patterns,followups}/ use ../ correctly because
docs/X/../ resolves to docs/.
Signed-off-by: Tri Lam <tri@maydow.com>
After PR #313 moved NORTHSTARS.md into docs/, the Spec column links added in the pattern-spec commit kept the pre-move ../docs/patterns/ prefix; from docs/NORTHSTARS.md the correct relative path is just patterns/. 12 links fixed; doc-check clears. Signed-off-by: Tri Lam <tri@maydow.com>
The new v1-rc1-cut-criteria bullet added by this wave referenced PRINCIPLES.md as a sibling, but MILESTONES.md now lives in docs/ while PRINCIPLES.md stayed at repo root. Path is ../PRINCIPLES.md from docs/MILESTONES.md. Signed-off-by: Tri Lam <tri@maydow.com>
Authoring drift: the v1-rc1-cut-criteria bullets pre-existed the PR #313 MILESTONES.md → docs/MILESTONES.md move, so links carried docs/ prefix that now double-resolves to docs/docs/. Strip to sibling-relative. Signed-off-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
Closed
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Closes #301 by adding a first-class `tls.*` chart surface so operators wire cert-manager-issued mTLS material without a custom DaemonSet patch overlay. The NetworkPolicy half of the issue (opt-in `networkPolicy.enabled`) landed earlier in #338; this PR closes the remaining gap. - `tls.enabled` (bool, default false), `tls.certificateRef` (kubernetes.io/tls Secret name; required when enabled — helm-template render fails closed with a clear error otherwise), `tls.mountPath` (absolute dir, schema-validated `^/`, default `/etc/tracecore/tls`). - DaemonSet projects the Secret read-only (`defaultMode: 0400`); the chart does NOT inject `tls:` clauses into the rendered config — operators wire `cert_file` / `key_file` / `ca_file` / `client_ca_file` via the free-form `config:` block referencing the projected file literals. - `docs/integrations/cert-manager-mtls.md` loses the "requires a patch overlay" workaround and gains an aggregation-side example showing `client_ca_file` placement (the falsifier for silent one-way-TLS downgrade). ## Root cause Issue #301 lists `tls.enabled` and `tls.certificateRef` as required values knobs. The chart never shipped them — the cert-manager mtls recipe instead carried prose telling operators to "patch overlay" the DaemonSet template, which is precisely the kind of friction the chart-surface knob exists to eliminate. This PR fixes the root cause (no typed knob) rather than refreshing the workaround prose. ## Test plan - [x] `helm lint install/kubernetes/tracecore` — clean. - [x] `helm lint install/kubernetes/tracecore -f values-production.yaml` — clean. - [x] `helm template` default render — zero `tls` volumes. - [x] `helm template --set tls.enabled=true` — fails closed with operator-visible error naming `tls.certificateRef`. - [x] `helm template --set tls.enabled=true --set tls.certificateRef=foo` — projects `tls` Secret volume + `volumeMount` at default `/etc/tracecore/tls`, readOnly true, mode 0400. - [x] `helm template --set tls.mountPath=not-absolute` — schema rejects with `Does not match pattern '^/'`. - [x] CI `.github/workflows/chart.yml` render-job has a five-step falsifier suite covering all of the above. - [x] Pre-commit gates: `make lint`, `make vet`, `go mod verify`, attribute-namespace-check, hit-line-format-stable, and no-autoupdate-check all green at commit time. ```release-notes - chart: typed `tls.{enabled,certificateRef,mountPath}` knob mounts a cert-manager-issued mTLS Secret into the DaemonSet read-only. Default off; required Secret reference is enforced at helm-template time so misconfiguration fails closed rather than silently disabling mTLS. ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced Jun 2, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
…451) ## Summary - Adds the `transform/cuda_oom` OTTL processor to `docs/integrations/examples/filelog-container.yaml`, stamping `cuda_oom.tried_alloc_bytes` (Int, bytes; unit-normalized KiB/MiB/GiB/TiB) and `cuda_oom.gpu_index` (Int) off PyTorch's canonical `RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>. GPU N has a total capacity of ...` stderr line. - Closes the integration gap pattern #10's detector (PR #338) carried since merge: `projectCUDAOOMLogRecord` (`module/processor/patterndetectorprocessor/cuda_oom.go`) gates on `cuda_oom.tried_alloc_bytes` + `gpu.id` but no upstream recipe stamped them, so the compiled detector received no real input at runtime. ## Root cause Issue #303's deliverable list included `projectCUDAOOMLogRecord` (shipped in PR #338) but explicitly deferred the filelog OTTL stanza to a sibling follow-up (issue #285 / #436). The detector compiled green and its wiring tests passed against synthetic plog input, but production stderr never carried the customer-stable attributes the projector reads. This PR is the missing link — a recipe-only change with zero detector-source edits. ## Recipe design - **Per-unit-branch shape** (KiB / MiB / GiB / TiB) because OTTL has no capture-group-conditional dispatch — the multiplier must be a literal `int64` per stanza. - **Unit normalization via OTTL Math Expressions**: `Int(whole)*UNIT + Int(frac)*(UNIT/100)` against PyTorch's `%.2f` `format_size` shape (verified against `c10/cuda/CUDACachingAllocator.cpp`). Integer-divide-by-100 floors per-frac-unit precision loss at <1% of the unit base — three orders of magnitude under the detector's 5% fragmentation threshold. - **`gpu.id` is NOT stamped here**: the CUDA-runtime ordinal `cuda_oom.gpu_index` is not a PCI BDF. The recipe markdown documents two operator paths: (a) k8sattributesprocessor + `nvidia.com/gpu-PCIDeviceBusID` device-plugin annotation, or (b) DCGM BDF-lookup transform indexed by `cuda_oom.gpu_index`. The detector's resource-attr fallback reads `gpu.id` off the log resource either way. - **Tight `where IsMatch` guard** on `CUDA out of memory\. Tried to allocate` — generic CUDA errors (illegal memory access, NCCL watchdog, DataLoader worker killed) do not trip the stanza. ## Tests TDD red → green via three new tests in `module/processor/patterndetectorprocessor/cuda_oom_recipe_test.go`: - `TestRecipe_CUDAOOM_StanzaPinsWireContract` — pins 7 load-bearing tokens (`cuda_oom.tried_alloc_bytes`, `cuda_oom.gpu_index`, KiB/MiB/GiB/TiB, `transform/cuda_oom`) + pipeline-wiring against the live projector. - `TestRecipe_CUDAOOM_RoundTripFiresVerdict` — end-to-end gate: recipe-shaped log records flow through `CUDAOOMDetector` and emit a `kind=fragmentation` verdict with the expected scalar-promotion contract. - `TestRecipe_CUDAOOM_RegexCoversCanonicalPyTorchMessages` — 5 canonical positives (KiB / MiB / GiB / GiB-fractional / TiB) + 3 negatives (DataLoader worker killed, NCCL watchdog, illegal memory access). Exceeds the ≥3-positive A-tier acceptance criterion from #436. ## Self-grade: **A+** - B: YAML syntactically valid OTel (`tracecore validate` exit 0); regex extracts bytes + GPU index with unit normalization; documented. ✓ - A: integration test green; `make validator-recipe` covers this file; regex tested against ≥3 canonical messages (5 positives total); negative cases verified. ✓ - A+: edge cases handled (multi-line traceback flattening via filelog container parser, mixed-unit messages, OOM without GPU index via tight `IsMatch` guard); cross-linked from `docs/patterns/10-cuda-oom-deceptive.md` §"Signal sources" + Open Question #2; new §`cuda_oom.*` attribute stanza in `docs/integrations/filelog-container.md` with unit-normalization arithmetic table, two `gpu.id` source paths, and a Failure-modes row. ✓ ## Cross-references - Detector source (untouched per hard rule): `module/processor/patterndetectorprocessor/cuda_oom.go`. - Sibling DCGM metric-side recipe: PR #337 / `docs/integrations/examples/prometheus-scrape.yaml`. - Pattern doc: `docs/patterns/10-cuda-oom-deceptive.md` — Open Q#2 resolved. - Convention: PR #431 (recipe stanzas placement under `docs/integrations/examples/<target>.yaml`). ## Test plan - [x] `go test ./processor/patterndetectorprocessor/ -run TestRecipe_CUDAOOM -count=1 -v` — PASS (3 tests, 8 sub-tests) - [x] `go test ./processor/patterndetectorprocessor/ -count=1` — PASS (no regressions) - [x] `make build` — `_build/tracecore` compiles via OCB - [x] `./_build/tracecore validate --config=docs/integrations/examples/filelog-container.yaml` — exit 0 - [x] `make validator-recipe` — 9 validated, 3 skipped (non-linux host) of 12 recipe(s) - [x] `make doc-check` — PASS (new cross-link resolves) - [x] `make ci-fast` — PASS (lint, vet, mod-verify, attribute-namespace-check, doc-check) ```release-notes **Pattern #10 (CUDA OOM, deceptive allocator)** — filelogreceiver + OTTL recipe lands. The `transform/cuda_oom` stanza in `docs/integrations/examples/filelog-container.yaml` projects PyTorch's `RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>` stderr line onto `cuda_oom.tried_alloc_bytes` (unit-normalized to bytes across KiB/MiB/GiB/TiB) and `cuda_oom.gpu_index`, closing the load-bearing input gap left by the v0.3 detector ship (PR #338). ``` Closes #436. Refs #338, #303, #337. Signed-off-by: Tri Lam <tree@lumalabs.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
15-agent parallel wave bridging v1.0-rc1 knowledge gaps + closing horizon backlog. 31 commits, 81 files, +8650/-180.
Code (5 detectors / features):
feat(iblinkflap)pattern Bump the gh-actions group across 1 directory with 4 updates #2 IB link flap detector — 13 tests, cross-rank helper extracted for reuse by patterns Add AI review + PR-creation gates and lifecycle skills #7/[ci] Set required_status_checks_strict: false #9feat(cudaoom)pattern Tighten developer and PR feedback loops #10 CUDA OOM detector + fragmentation-vs-true-OOM discriminator — 35 tests, 0/6 false-positive rate on fixture corpus (A4: CUDA OOM detector + fragmentation-vs-true-OOM discriminator (pattern #10) #303 wiring — recipe gap tracked at [rc1-prep] OTTL recipe: project DCGM_FI_DEV_FB_USED/FB_FREE → hw.gpu.memory.{free,total} log shape (pattern #10 wiring) #337)feat(verdict)deprecate EvictedPod, co-emit PodName + PodNamespace (v0.4: deprecate EvictedPod in favor of PodName + PodNamespace on XidCorrelationVerdict #277) with regression-pinning testfeat(chart)opt-in default-deny NetworkPolicy + cert-manager mTLS reference (A13: opt-in default-deny NetworkPolicy + cert-manager mTLS reference #301); ServiceMonitor + scrape annotations (Helm chart ships no canonical 'job=tracecore' scrape config; slo-rules.yaml alerts silently no-op until operator wires one #296); NOTES.txt UX warnings for empty-egress / cross-ns scraper trapsfeat(bench)per-detector allocs/event harness + soft ratchet gate, graduation criterion documented (A16: per-detector allocs/event bench + ratchet gate #302)feat(patterndetector)verdict counter metric for dashboard panels (patterndetectorprocessor: emit verdict counter metric for dashboard panels #261)fix(slo-rules)correct otelcol_* label set + drop silent-no-opunless on (instance)join (slo-rules.yaml: 'pipeline' label doesn't exist on otelcol_* metrics; 'unless on (instance)' join silently no-ops #298)8 pattern design specs (
docs/patterns/{02,07-13}-*.md):9 v1.0-rc1 audit / knowledge-gap docs:
docs/v1-rc1-cut-criteria.md— 12 falsifiable cut gates derived from O1-O7docs/v1-rc1-operational-gaps.md— SLSA L3 + air-gap + upgrade-rollback audit (8 issues filed [rc1-prep] Migrate release.yml to slsa-go-generator for L3 provenance #314-[rc1-prep] Add minReadySeconds + helm upgrade test to chart #321)docs/v1-rc1-governance-gaps.md— CODEOWNERS 0%, lint-principles 4/16, retros,make ci148s (5 issues [rc1-prep] CODEOWNERS coverage: add directory-scoped rules for all top-level source dirs #322-[rc1-prep] Promote RFC-0013 toaccepted+ addscripts/rfc-status-check.sh#325, [rc1-prep] Splitmake ciintoci-fast(<60s) andci-full; refresh PRINCIPLES §10 #327)docs/v1-rc1-test-audit.md— 82.9% coverage, fuzz harness inventory (5 issues [rc1-prep] test-gap: bring module/pkg/nccl/fr_parser ≥ 80% coverage #328-[rc1-prep] test-gap: add pyspy framing nightly fuzz soak #332)docs/v1-rc1-simplification-audit.md— top deletion candidates ~9.6K LOC (3 issues [rc1-prep] Delete in-tree components/exporters/otlphttp/ wrapper (superseded by upstream otlphttpexporter) #333-[rc1-prep] Track pyspy-delete re-evaluation preconditions (blocked on #222 / RFC-0013 PR-M) #335)docs/threat-model.md— STRIDE per trust boundary + audit RFP scope ([rc1-prep] commission v1.0 GA security audit (RFP-ready) #336)docs/reference-environments.md— Tier 1 kind + Tier 2 32×H100 binding spec for O2 hero KPIdocs/adoption-pipeline.md— S0-S3 funnel + comms templates for O5 hero KPIdocs/standards-roadmap.md— 10gen_ai.training.*attributes proposed upstream ([standards] file first gen_ai.training.* PR upstream #326)Doc-drift cleanup: 11 issues closed (#265, #268, #269, #276, #283, #287, #292-295, #299).
OTTL recipe wiring: 6 issues closed (#260, #261, #273, #282, #284, #285); #272 deferred to standards-roadmap.
Multi-cluster auth: bearer-token + mTLS examples (#297).
Merge resolution + reviewer fixes:
VerdictAttr*→verdictAttr*per chore: cleanup boilerplate doc.go + factory.go + unexport VerdictAttr #310 conventioncreateLogsdocs/THREAT-MODEL.md→docs/threat-model.md(Linux CI is case-sensitive)pattern.idslug → numeric ("2","7"..."13"),pattern.confidencehigh→full02-ib-link-flap.mdattribute drift: spec saidtracecore.alert.ib_link_flap.{hca_device,port}, code emitshw.network.ib.{device,port.num}v1-rc1-cut-criteriacriterion ci(deps): bump the gh-actions group with 5 updates #1 status stale-on-arrival ("6 patterns shipped" → "8 patterns shipped, 4 remaining")enabled=truewith emptyallowedEgressEndpoints(silently kills OTLP) or cross-ns PrometheusDCGM_FI_DEV_FB_*→hw.gpu.memory.{free,total}(CUDA OOM detector consumes but recipe gap)docs/,../,docs/docs/drift after MILESTONES + NORTHSTARS moved to docs/)attribute-namespace-checknow 67/67)Test plan
go test ./module/processor/patterndetectorprocessor/... ./module/pkg/patterns/...— okmake lint(golangci-lint via goreleaser-style gate) — 0 issuesgo vet ./...— cleanmake doc-check— passes after stale-link sweepscripts/attribute-namespace-check.sh— 67/67 documentedhelm lint install/kubernetes/tracecore— 0 chart(s) failedpromtool check ruleson slo-rules.yaml — 13 rules / SUCCESS