feat(v1-rc1): 2 detectors + 8 pattern specs + chart NetPol + rc1 audits by trilamsr · Pull Request #338 · TraceCoreAI/tracecore

trilamsr · 2026-06-01T08:07:01Z

Summary

15-agent parallel wave bridging v1.0-rc1 knowledge gaps + closing horizon backlog. 31 commits, 81 files, +8650/-180.

Code (5 detectors / features):

feat(iblinkflap) pattern Bump the gh-actions group across 1 directory with 4 updates #2 IB link flap detector — 13 tests, cross-rank helper extracted for reuse by patterns Add AI review + PR-creation gates and lifecycle skills #7/[ci] Set required_status_checks_strict: false #9
feat(cudaoom) pattern Tighten developer and PR feedback loops #10 CUDA OOM detector + fragmentation-vs-true-OOM discriminator — 35 tests, 0/6 false-positive rate on fixture corpus (A4: CUDA OOM detector + fragmentation-vs-true-OOM discriminator (pattern #10) #303 wiring — recipe gap tracked at [rc1-prep] OTTL recipe: project DCGM_FI_DEV_FB_USED/FB_FREE → hw.gpu.memory.{free,total} log shape (pattern #10 wiring) #337)
feat(verdict) deprecate EvictedPod, co-emit PodName + PodNamespace (v0.4: deprecate EvictedPod in favor of PodName + PodNamespace on XidCorrelationVerdict #277) with regression-pinning test
feat(chart) opt-in default-deny NetworkPolicy + cert-manager mTLS reference (A13: opt-in default-deny NetworkPolicy + cert-manager mTLS reference #301); ServiceMonitor + scrape annotations (Helm chart ships no canonical 'job=tracecore' scrape config; slo-rules.yaml alerts silently no-op until operator wires one #296); NOTES.txt UX warnings for empty-egress / cross-ns scraper traps
feat(bench) per-detector allocs/event harness + soft ratchet gate, graduation criterion documented (A16: per-detector allocs/event bench + ratchet gate #302)
feat(patterndetector) verdict counter metric for dashboard panels (patterndetectorprocessor: emit verdict counter metric for dashboard panels #261)
fix(slo-rules) correct otelcol_* label set + drop silent-no-op unless on (instance) join (slo-rules.yaml: 'pipeline' label doesn't exist on otelcol_* metrics; 'unless on (instance)' join silently no-ops #298)

8 pattern design specs (docs/patterns/{02,07-13}-*.md):

Per pattern: symptom, layers crossed, signal sources, detector evaluation rule, verdict attrs, edge cases, open questions.
7 load-bearing spec gaps flagged for future TDD red-test work (multi-vendor SDC signal, cohort grouping, processor metrics path, etc).

9 v1.0-rc1 audit / knowledge-gap docs:

docs/v1-rc1-cut-criteria.md — 12 falsifiable cut gates derived from O1-O7
docs/v1-rc1-operational-gaps.md — SLSA L3 + air-gap + upgrade-rollback audit (8 issues filed [rc1-prep] Migrate release.yml to slsa-go-generator for L3 provenance #314-[rc1-prep] Add minReadySeconds + helm upgrade test to chart #321)
docs/v1-rc1-governance-gaps.md — CODEOWNERS 0%, lint-principles 4/16, retros, make ci 148s (5 issues [rc1-prep] CODEOWNERS coverage: add directory-scoped rules for all top-level source dirs #322-[rc1-prep] Promote RFC-0013 to accepted + add scripts/rfc-status-check.sh #325, [rc1-prep] Split make ci into ci-fast (<60s) and ci-full; refresh PRINCIPLES §10 #327)
docs/v1-rc1-test-audit.md — 82.9% coverage, fuzz harness inventory (5 issues [rc1-prep] test-gap: bring module/pkg/nccl/fr_parser ≥ 80% coverage #328-[rc1-prep] test-gap: add pyspy framing nightly fuzz soak #332)
docs/v1-rc1-simplification-audit.md — top deletion candidates ~9.6K LOC (3 issues [rc1-prep] Delete in-tree components/exporters/otlphttp/ wrapper (superseded by upstream otlphttpexporter) #333-[rc1-prep] Track pyspy-delete re-evaluation preconditions (blocked on #222 / RFC-0013 PR-M) #335)
docs/threat-model.md — STRIDE per trust boundary + audit RFP scope ([rc1-prep] commission v1.0 GA security audit (RFP-ready) #336)
docs/reference-environments.md — Tier 1 kind + Tier 2 32×H100 binding spec for O2 hero KPI
docs/adoption-pipeline.md — S0-S3 funnel + comms templates for O5 hero KPI
docs/standards-roadmap.md — 10 gen_ai.training.* attributes proposed upstream ([standards] file first gen_ai.training.* PR upstream #326)

Doc-drift cleanup: 11 issues closed (#265, #268, #269, #276, #283, #287, #292-295, #299).

OTTL recipe wiring: 6 issues closed (#260, #261, #273, #282, #284, #285); #272 deferred to standards-roadmap.

Multi-cluster auth: bearer-token + mTLS examples (#297).

Merge resolution + reviewer fixes:

Resolved 5 conflicts post-PR chore: cleanup boilerplate doc.go + factory.go + unexport VerdictAttr #310/refactor: extract shared selftel + test helpers #312/docs: move MILESTONES + NORTHSTARS into docs/ #313 (factory.go delete, VerdictAttr* unexport, MILESTONES.md → docs/, FOLLOWUPS, patterns README)
Adversarial reviewer found 1 BLOCKER + 6 MAJOR; all addressed before push:
- Renamed 16 VerdictAttr* → verdictAttr* per chore: cleanup boilerplate doc.go + factory.go + unexport VerdictAttr #310 convention
- Re-ported selftel wiring (patterndetectorprocessor: emit verdict counter metric for dashboard panels #261) into main's merged createLogs
- Fixed case-mismatch docs/THREAT-MODEL.md → docs/threat-model.md (Linux CI is case-sensitive)
- 8 pattern specs schema drift: pattern.id slug → numeric ("2", "7"..."13"), pattern.confidence high → full
- 02-ib-link-flap.md attribute drift: spec said tracecore.alert.ib_link_flap.{hca_device,port}, code emits hw.network.ib.{device,port.num}
- v1-rc1-cut-criteria criterion ci(deps): bump the gh-actions group with 5 updates #1 status stale-on-arrival ("6 patterns shipped" → "8 patterns shipped, 4 remaining")
- NetPol UX trap: NOTES.txt warns when enabled=true with empty allowedEgressEndpoints (silently kills OTLP) or cross-ns Prometheus
- Filed [rc1-prep] OTTL recipe: project DCGM_FI_DEV_FB_USED/FB_FREE → hw.gpu.memory.{free,total} log shape (pattern #10 wiring) #337 for missing OTTL recipe projecting DCGM_FI_DEV_FB_* → hw.gpu.memory.{free,total} (CUDA OOM detector consumes but recipe gap)
Post-merge stale-relative-path sweep: 6 wave docs + NORTHSTARS.md + MILESTONES.md (docs/, ../, docs/docs/ drift after MILESTONES + NORTHSTARS moved to docs/)
Documented 5 newly-emitted attributes in ATTRIBUTES.md (drop_ratio + IB tier — attribute-namespace-check now 67/67)

Test plan

go test ./module/processor/patterndetectorprocessor/... ./module/pkg/patterns/... — ok
make lint (golangci-lint via goreleaser-style gate) — 0 issues
go vet ./... — clean
make doc-check — passes after stale-link sweep
scripts/attribute-namespace-check.sh — 67/67 documented
helm lint install/kubernetes/tracecore — 0 chart(s) failed
promtool check rules on slo-rules.yaml — 13 rules / SUCCESS
CI compat-matrix (rc1 criterion Make RFC process optional, not gated #6) — gated on next wave
manual smoke install on real cluster — owner clearance pending

Lands two new pattern detectors (#2 IB link flap, #10 CUDA OOM
fragmentation-vs-true discriminator), 8 pattern design specs for the
remaining v1.0 root-cause patterns, opt-in default-deny NetworkPolicy
+ Prometheus Operator ServiceMonitor on the Helm chart, the
EvictedPod → PodName/PodNamespace verdict-attribute deprecation
co-emit, per-detector allocs/event bench harness, SLO-rules label
fix, and the v1.0-rc1 knowledge-gap audit set (cut criteria, ops gaps,
governance gaps, test audit, simplification audit, threat model,
reference envs, adoption pipeline, standards roadmap).

NORTHSTARS O2 hero KPI (install-to-first-data) and O5 hero KPI (>=15 production orgs at M12) both pointed at undefined ground truth. O2 named a 'production-realistic 32-GPU' tier without binding any hardware spec; O5 had a methodology stub at docs/nps.md but no funnel mechanics, no tracking artifact, no public-counter rule consistent with O2 Op-Rule #5 / O5 Op-Rule #1 (never phone home). docs/reference-environments.md pins both tiers - Minimal (kind, k8s 1.32, DCGM 4.4.x stub, NCCL 2.30+, ubuntu-latest, the existing bench/install/ harness) and Production-realistic (32xH100 SXM5 across 4 nodes, NVLink+IB, Calico+Multus+RDMA, Lustre, Kueue v0.17.x). Names three cluster-access paths (partner-volunteered, rented, quarterly-drill fallback) and flags partner-volunteered as the v1.0 default expectation. docs/adoption-pipeline.md defines the S0->S3 funnel with explicit definition-of-done per stage, target pilot profile, the docs/followups/_pilots.md tracking artifact, the release-prep-PR public-counter cadence (no scraping, no auto-update), and one-line comms templates per transition. Cross-refs added in docs/nps.md and docs/README.md. Signed-off-by: Tri Lam <tri@maydow.com>

Three rc1-blocking operational gaps documented in docs/v1-rc1-operational-gaps.md with file:line evidence, falsifiable cut criteria, numbered remediation steps tagged by work-type (code/doc/ops/external-dep), S/M/L effort estimates, and explicit external blockers (slsa-go-generator OCB-submodule integration; healthcheckextension single-path upstream limitation; harden-runner audit-only on hosted runners). MILESTONES.md M21 cross-refs the new doc as a release-prep dependency. Per-step issues #314-#321 filed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>

Marks XidCorrelationVerdict.EvictedPod deprecated in v0.4; legacy field continues to emit alongside the k8s-semconv split PodName / PodNamespace (already populated since #270) through the v0.4-v0.5 window and is removed in v0.6 per ATTRIBUTES.md soft-lock policy. Adds TestXidCorrelationVerdict_DeprecatedEvictedPodCoEmits pinning that both legacy + successor names co-emit in the JSON shape, so a regression that drops either side fails the gate. Updates docs/ATTRIBUTES.md with a deprecation row + removal-target table. Signed-off-by: Tri Lam <tri@maydow.com>

Eight 1-page pattern-design specs covering #2 IB link flap, #7 dataloader hang, #8 NCCL timeout no-HW, #9 NCCL bootstrap timeout, #10 CUDA OOM deceptive allocator, #11 checkpointer hang, #12 loss spike NaN, #13 silent data corruption. Each carries the standard detector-design shape (symptom, layers, signal sources, evaluation rule, verdict attrs, edge cases, status, open questions) so the next contributor can write a TDD red test directly off the spec. Status: all 8 marked planned. #10 already has issue #303; the spec frames the design alongside. NORTHSTARS Appendix A gains a Spec column; docs/README + patterns README link the new specs. Signed-off-by: Tri Lam <tri@maydow.com>

drop goreleaser. tell truth: inline-shell pipeline + SLSA + cosign. pyspy delete deferred v0.4.0+ per #222, not v0.3.0. closes #268 closes #269 Signed-off-by: Tri Lam <tri@maydow.com>

The drop-rate recording rule grouped `by (pipeline)`, but upstream OTel obsreport never stamps that label — `processorhelper` only stamps `processor` and `receiverhelper` stamps `receiver` + `transport`. The `sum by (pipeline)` collapsed every series into a single empty `pipeline=""` group, so the alert annotations rendered empty. Split the recording rule into per-instance numerator + denominator and aggregate to a per-instance drop ratio (one pod = one tracecore instance = one pipeline in our DaemonSet topology). Alert payload now carries the actual instance label and the description points operators at the per-component upstream series for localization. The `TracecoreSelftelemetryDown` rule used `up{...} == 0 unless on (instance) kube_pod_status_phase{phase="Failed"}` to separate listener-wedge from pod-crash, but `up{}` and `kube_pod_status_phase{}` share no `instance` label — the `unless` clause was a silent no-op and the rule fired identically to `TracecorePodDown`. Removed the rule; `TracecoreSelftelemetryEmpty` (which uses a real join via `job`+`instance`) already covers the listener-wedge case. Verification: `promtool check rules` reports SUCCESS: 13 rules. Closes #298 Signed-off-by: Tri Lam <tri@maydow.com>

The chart's `dashboards/slo-rules.yaml` queries `up{job="tracecore"}`, but nothing in the chart wired a scrape config — operators had to hand-author a Prometheus job for the SLO rules to light up. Two new toggles converge on the canonical `job="tracecore"` label: - `serviceMonitor.enabled` (default OFF — the `monitoring.coreos.com/v1` CRD ships with kube-prometheus-stack / prometheus-operator and is absent on bare clusters). Renders a ServiceMonitor with `jobLabel: app.kubernetes.io/name`, which resolves to `job="tracecore"` via the chart's selector labels. Requires a new headless Service (clusterIP: None) over the DaemonSet pods so the Operator's Service-targeted selector resolves to per-pod telemetry endpoints. - `prometheusScrape.enabled` (default ON). Stamps `prometheus.io/scrape`, `prometheus.io/port`, `prometheus.io/path` annotations on the DaemonSet pods for vanilla-Prometheus kubernetes_sd_configs (role: pod) scrape jobs. Harmless on Operator clusters; operators using ServiceMonitor should disable to avoid double-scrape. Chart README gains a worked example for each path, including the vanilla `scrape_configs:` block with relabel chain. Verification: helm lint install/kubernetes/tracecore → OK helm lint --values ci/all-receivers-off-values.yaml → OK helm lint --values ci/one-receiver-on-values.yaml → OK helm lint --values ci/pyspy-on-values.yaml → OK helm template --set serviceMonitor.enabled=true → renders valid ServiceMonitor with jobLabel: app.kubernetes.io/name and matching headless Service. Closes #296 Signed-off-by: Tri Lam <tri@maydow.com>

Measures NORTHSTARS O6 (velocity) + O7 (governance) supporting KPIs against current repo state. Six gap sections: CODEOWNERS coverage (0% vs 80% target), lint-enforced principles (4/16 vs 6 target), missing quarterly retros, RFC log + RFC-0013 still draft, `make ci` 148s vs 60s budget, maintainer count 1 vs >=3 by M9. Issues filed: #322 #323 #324 #325 #327 (cap of 5; gap #6 maintainer count is external, tracked via adoption pipeline). Sibling to v1-rc1-cut-criteria.md + v1-rc1-operational-gaps.md. Signed-off-by: Tri Lam <tri@maydow.com>

Read-only audit measuring moat coverage (82.9% module-wide), inventories the two existing fuzz harnesses, three chaos rows, three benchmark files, and zero property tests. Files five rc1-prep test-gap issues (#328-#332) covering the highest-leverage gaps: three sub-80% packages, missing ncclfrreceiver integration test, chaos matrix gap to shipped patterns, and pyspy framing nightly fuzz soak. Signed-off-by: Tri Lam <tri@maydow.com>

Signed-off-by: Tri Lam <tri@maydow.com>

Three references to a docs/integrations/cert-manager-mtls.md recipe that has never existed in the tree were blocking doc-check. Replaced with inline upstream pointers + 'per-operator until a dedicated recipe lands' wording so the operational guidance survives the link removal. Side cleanup blocking the chore/v1-rc1-knowledge-gaps push; root cause is that #291 (multi-cluster v0 federation) shipped forward-references to a recipe that was deferred. Signed-off-by: Tri Lam <tri@maydow.com>

mark hw.gpu.{throttle,nvlink,pci.bdf,index} + error.{subtype, persistence} as tracecore-ext, point at #265 for upstream proposal. pin all 4 pattern docs on gpu.id (PCI BDF, RFC-0013 §3) as the customer-stable join key; PromQL examples + alerts switch from hw_id to gpu_id. closes #265 closes #276 Signed-off-by: Tri Lam <tri@maydow.com>

native OTLP path = bare-underscored (pattern_id, k8s_node_name). attributes_* prefix is Promtail/Alloy JSON-extraction surface. align loki.md with dashboard PR #264's native-OTLP shape. closes #283 Signed-off-by: Tri Lam <tri@maydow.com>

Adds an opt-in default-deny NetworkPolicy template to the chart: - Ingress restricted to the telemetry + health ports, scrape sources configurable via `networkPolicy.allowedScrapers` (default `namespaceSelector: {}` = same-namespace). - Egress restricted to cluster DNS + operator-declared `allowedEgressEndpoints` (CIDR + port pairs for OTLP-out). Empty egress list keeps the path closed until the operator declares destinations explicitly — auditable by construction. - Default OFF so the first-install path stays compatible with CNIs that ignore NetworkPolicy (Flannel without canal). Enable on Calico / Cilium / kube-router. Adds docs/integrations/cert-manager-mtls.md: cert-manager ClusterIssuer + Certificate shape, chart wiring through config.exporters.otlphttp.tls.*, renewal contract (reload_interval >= 2 * renewBefore), and verification steps including the falsifier that catches a silently-downgraded one-way-TLS aggregation listener. Verification: docker run --rm alpine/helm:3.16.4 lint install/kubernetes/tracecore -> OK ... --set networkPolicy.enabled=true -> OK helm template ... --set networkPolicy.enabled=true -> renders valid NetworkPolicy with policyTypes Ingress + Egress, scrape-in on telemetry + health ports, DNS + OTLP-out egress allow-list. Signed-off-by: Tri Lam <tri@maydow.com>

First written threat model for tracecore. Inventory assets (cluster topology, GPU error patterns, OTLP credentials, kube-apiserver token, host kernel reads, NCCL FR pickle streams). Walk five trust boundaries (hostPath, kube-api, OTLP egress, cgo, FR shared mem). Apply STRIDE per boundary with mitigation status. Top-10 risk ranking with residual work. Audit RFP scope (~17 person-days across 8 subscopes) and in-repo prep checklist that gates handing the work to a paid auditor before v1.0 GA per the Tier 2 prereq. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>

#294: process_resident_memory_bytes is not emitted by tracecore; upstream service/telemetry emits otelcol_process_memory_rss. #295: verdict log records do not carry latency_seconds; drop the LogQL stop-gap that would return empty. Pin SLI 1 binding to v0.5+ when verdict_emit_seconds histogram lands. Note #261 scope excludes the latency histogram (verdicts_emitted_total + evidence_count + consume_logs_duration are #261's scope). #299: B12 is "v0.x → v1.0 one-shot upgrade guide", not verdict-ack; drop fabricated roadmap pointer. PRINCIPLES.md §1 link mismatch in RUNBOOK fixed to RFC-0013 §1 (the actual href target). closes #294 closes #295 closes #299 Signed-off-by: Tri Lam <tri@maydow.com>

panel 1 description listed only 14/15/16 as shipped; HBM ECC (pattern_id=3) landed in #274. add it to the enumeration. README matrix + template-var options already reflected the ship. closes #287 Signed-off-by: Tri Lam <tri@maydow.com>

Pattern #2 — InfiniBand link flap — per NORTHSTARS Appendix A row #2 and the design spec at docs/patterns/02-ib-link-flap.md. Detector evaluation rule - bucket IB port-state transitions by (node, HCA, port) within CorrelationWindow (default 2min) - fire when transitions >= MinTransitions (default 2) - promote Confidence to full when a stuck NCCL FR cohort (>= MinHangingRanks ranks, non-completed-state) lands on the same node within the same window; otherwise partial Cross-rank correlation primitive - groupStuckNCCLByNode lifted as an inline helper inside the ib_link_flap detector; same shape will recur in pattern #7 (dataloader-hang) and #9 (nccl-bootstrap-timeout). Refactor to a shared module follows in the next commit. Wiring - NCCLFRRecord.Node added so the cross-rank correlation can join on node identity (k8sattributes resource attr); existing nccl_hang detector ignores it (collective-scoped, not node-scoped) - projectIBPortStateRecord reads hw.network.ib.port.state + hw.network.ib.device + hw.network.ib.port.num — the customer-stable namespace declared in docs/patterns/02-ib-link-flap.md - appendIBLinkFlapVerdict promotes (k8s.node.name, hw.network.ib.device, hw.network.ib.port.num, tracecore.alert.ib_link_flap.transition_count, nccl.fr.collective_seq_id) per the issue #270 scalar-promotion contract; pattern.confidence is full|partial - Config gains ib_link_flap_window + ib_link_flap_min_transitions with Validate floors (>=1s, >=2) Tests - 8 library tests (ib_link_flap_test.go): full correlation, partial on IB-alone, single-transition no-fire, transitions-outside-window no-fire, different-ports-do-not-combine, NCCL-on-different-node does not join, configurable transition threshold, deterministic ordering - 5 processor tests (ib_link_flap_test.go): full verdict + promoted scalars, partial on IB-alone, partial-suppressed toggle, window validation floor, min-transitions validation floor Cross-link to spec: docs/patterns/02-ib-link-flap.md (authored in parallel; lands first or same-PR). Signed-off-by: Tri Lam <tri@maydow.com>

The "is there a stuck NCCL cohort on this node?" sieve will recur across pattern #7 (dataloader-hang) and #9 (nccl-bootstrap-timeout) — both join NCCL FR rank records to a node-scoped trigger using the same MinHangingRanks / non-completed-state / age-past-window cohort rule. Lift the inline groupStuckNCCLByNode from ib_link_flap.go into a new module/pkg/patterns/cross_rank.go as StuckNCCLCohortByNode, exposed for cross-pattern reuse. minRanks parameter lets future patterns tune the cohort floor when their semantics differ (zero falls back to the package-default MinHangingRanks). The nccl_hang detector keeps its own cohort logic in nccl_hang.go because it joins on (pg+collective), not node — the duplication is intentional, not accidental, and folding both into one parameterized helper would obscure the load-bearing cohort-key tradeoff. No behavior change; library + processor tests still green. Signed-off-by: Tri Lam <tri@maydow.com>

Issue #261: the patterns Grafana dashboard drives every verdict-rate panel off Loki LogQL because no processor-emitted counter exists for PromQL queries. Adds `otelcol.processor.patterndetector.verdicts_emitted_total` (renders as `otelcol_processor_patterndetector_verdicts_emitted_total` via the Prometheus exporter — RFC-0013 namespace alignment), partitioned by `pattern_id` + `confidence` + `component_id`. Once the metric ships, the three LogQL panels in patterns.json can swap to native PromQL and the dashboard works on Prometheus-only stacks. TDD: red TestVerdictsEmittedCounter_PodEvicted + factory-wiring test first; impl plumbed selfTelemetry through the per-detector loop in ConsumeLogs so each appendXxxVerdict call site ticks once. Falls back to noop selfTelemetry on missing/broken MeterProvider so the data path stays alive when telemetry init fails — mirrors the ncclfrreceiver selftel convention. Closes #261. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>

Extends prometheus-scrape.md with the bridge attribute contract for the four metrics-derived patterns: - pattern #1 NVLink (#260) — the `hw.gpu.nvlink.io` OTTL transform already lands in commit 0baa557; this PR closes #260's recipe-half. - pattern #3 HBM ECC (#273) — `hw.errors.delta` + error.{type, subtype,persistence} + gpu.id contract. - pattern #4 thermal throttle (#282) — `hw.gpu.throttle.duration.delta` in integer seconds + reason=thermal + gpu.id contract. - pattern #5 PCIe AER Layer 2 (#284) — the `tracecore.alert. pcie_rate_collapse.*` namespace contract. OTTL metrics->logs emission stays upstream-blocked at OTel-contrib v0.130 (RFC-0014): no contrib processor or connector emits log records from a metrics pipeline. The bridge contract documented here is the load-bearing wire format any future emitter (an upstream metricthresholdconnector OR the WithMetrics extension to patterndetectorprocessor per RFC-0014 PR-B) MUST honor; the detector projections at module/processor/patterndetectorprocessor/ patterndetector.go gate on this contract today. last-verified marker bumped to 2026-06-01. Closes #260. Closes #273. Closes #282. Refs #284 (Layer 1 closed under #285 in a prior commit; Layer 2 contract documented here). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>

Pattern #10 - CUDA OOM, deceptive allocator - per NORTHSTARS Appendix A row #10 and the design spec at docs/patterns/10-cuda-oom-deceptive.md. Detector evaluation rule - per-OOM, look up most-recent same-GPU FB sample within CorrelationWindow (default 2min, forward-only - fb.Timestamp <= oom.Timestamp) - if fb_free_ratio >= FBFreeFragmentationThreshold (default 0.05) -> kind=fragmentation (raise max_split_size_mb, empty_cache) - if fb_free_ratio < threshold -> kind=true_oom (shrink batch, shard) - if no FB sample joins -> kind=unknown, confidence=partial Discriminator value - fragmentation vs true-OOM is the operator's #1 question on a CUDA OOM - without DCGM cross-check the operator retries with same batch, hits same OOM, wastes a slot - partial-confidence verdict surfaces the OOM even when DCGM scrape lags, so the operator branches on concurrent pod_evicted / xid_correlation rather than silence Files - module/pkg/patterns/cuda_oom.go - detector + verdict + records - module/processor/patterndetectorprocessor/cuda_oom.go - projections, collectCUDAOOMInputs, appendCUDAOOMVerdict, runCUDAOOMDetector - module/processor/patterndetectorprocessor/cuda_oom_test.go - 7 wiring tests + 2 Validate guards - module/processor/patterndetectorprocessor/example_config.yaml - cuda_oom_correlation_window + cuda_oom_fb_free_fragmentation_threshold knobs - docs/ATTRIBUTES.md - hw.gpu.memory.{free,total} namespace entries Scalar promotions per issue #270 contract: gpu.id, k8s.{pod,node}.*, cuda_oom.kind, cuda_oom.tried_alloc_bytes, cuda_oom.fb_free_bytes, cuda_oom.fb_free_ratio, pattern.confidence. Window-edge fenced both sides per PR #255 lesson. Threshold-boundary fenced inclusive per same lesson. Most-recent-pre-OOM rule mirrors xid_correlation / pcie_aer / hbm_ecc. Integration-gap follow-ups (tracked separately on PR body): - DCGM_FI_DEV_FB_USED/FREE OTTL recipe extension (sibling to #273) - filelogreceiver OTTL stanza for CUDA OOM regex parsing (sibling to #285) - metrics-path on patterndetectorprocessor per ADR-0001 (PR-B) Tests - 17 detector tests in module/pkg/patterns/cuda_oom_test.go (filed in red commit, now green) - 11 schema-drift falsifier sub-tests on CUDAOOMVerdict - 7 wiring + 2 Validate tests in processor cuda_oom_test.go - all 35 green; full ./pkg/patterns + ./processor/patterndetectorprocessor suites green with -race; make check + make build green Refs #303 Signed-off-by: Tri Lam <tri@maydow.com>

Adds bench/detectors/ with six benchmarks (one per pattern-library detector: pod_evicted, xid_correlation, hbm_ecc, nccl_hang, thermal_throttle, pcie_aer), each driving a fixed 1024-event window so allocs/op is interpretable as allocations-per-evaluate-pass. Pre-built fixtures live outside b.ResetTimer() so the measured allocations are detector-side only. Ratchet wiring: - bench/detectors/baselines.json pins current allocs/op + B/op per detector; allocs/op is hardware-invariant so the value gates cleanly across CI and dev hardware. - make bench-detectors / -check / -baseline targets for the run + compare + regenerate loop. - scripts/bench-check-detectors.sh is the SOFT gate (issue #302): prints regression deltas but exits 0 today. Graduation criterion to hard-fail (N=10 PRs with alloc-CV < 1%) documented in bench/detectors/README.md; flipping the gate is a one-line change to the script's final exit + baselines.json's gate_mode field. - .github/workflows/bench.yml runs the soft check on push-to-main and posts deltas to the job summary. Baseline numbers (Apple M1 Max, darwin/arm64; allocs/op): PodEvictedDetector 15635 XidCorrelationDetector 12699 NCCLHangDetector 4088 HBMECCDetector 1429 ThermalThrottleDetector 780 PCIeAERDetector 524 Signed-off-by: Tri Lam <tri@maydow.com>

The dedicated cert-manager recipe landed in #301; restore the multi-cluster.md links that 0f3833c routed to upstream while the recipe was missing. Signed-off-by: Tri Lam <tri@maydow.com>

Resolve 5 conflicts post-PR #310 / #312 / #313: - factory.go deleted on main (merged into patterndetector.go); port wave's selftel wiring (#261) into the merged createLogs - VerdictAttr* unexported per #310; rename 16 wave-added consts + all callers across cuda_oom + ib_link_flap + pcie_aer tests - docs/{MILESTONES,FOLLOWUPS,patterns/README}.md path + content reconcile after MILESTONES.md moved to docs/ Address reviewer findings before PR: - docs/THREAT-MODEL.md case-mismatch -> docs/threat-model.md (Linux CI is case-sensitive) - pattern.id schema drift: 8 specs said `ib_link_flap`/`cuda_oom`, code emits "2"/"10"/.../"13"; rewrite spec attribute tables to match shipped customer-stable namespace - pattern.confidence: 8 specs said `high|partial`, code emits `full|partial`; rewrite - 02-ib-link-flap.md attribute drift: spec said tracecore.alert.ib_link_flap.{hca_device,port}, code emits hw.network.ib.{device,port.num}; align spec to shipped code - v1-rc1-cut-criteria criterion #1 status stale-on-arrival ("6 patterns shipped" -> "8 patterns shipped, 4 remaining") - NetPol UX trap: NOTES.txt warning when networkPolicy.enabled=true with empty allowedEgressEndpoints (silently kills OTLP exporter) + warning when ServiceMonitor scraper in different namespace - File #337 for missing OTTL recipe projecting DCGM FB_USED/FREE -> hw.gpu.memory.{free,total} log shape (CUDA OOM detector consumes but recipe gap means it ships dark) Tests: ./module/processor/patterndetectorprocessor/... + ./module/pkg/patterns/... both ok. Signed-off-by: Tri Lam <tri@maydow.com>

Close attribute-namespace-check advisory gap surfaced by merge: - tracecore.alert.pcie_rate_collapse.drop_ratio (was emitted by appendPCIeAERVerdict, missing from inventory) - tracecore.alert.ib_link_flap.transition_count - hw.network.ib.{device,port.num,port.state} (new hw.network.* section for IB/RDMA semconv) Attribute-namespace-check now reports 67/67 documented (was 62/67). Signed-off-by: Tri Lam <tri@maydow.com>

The audit docs were authored when NORTHSTARS.md + MILESTONES.md lived at the repo root. main moved them to docs/ in PR #313 just before this wave landed. Sibling docs reference these by relative path; 22 links were stale. Replaced ../{NORTHSTARS,MILESTONES}.md → {NORTHSTARS,MILESTONES}.md across three files. doc-check passes. Signed-off-by: Tri Lam <tri@maydow.com>

Sibling docs at docs/ top level (adoption-pipeline, threat-model, standards-roadmap, reference-environments) had the same ../ → same-dir link drift as the v1-rc1-* siblings. Subdir refs under docs/{research,rfcs,patterns,followups}/ use ../ correctly because docs/X/../ resolves to docs/. Signed-off-by: Tri Lam <tri@maydow.com>

After PR #313 moved NORTHSTARS.md into docs/, the Spec column links added in the pattern-spec commit kept the pre-move ../docs/patterns/ prefix; from docs/NORTHSTARS.md the correct relative path is just patterns/. 12 links fixed; doc-check clears. Signed-off-by: Tri Lam <tri@maydow.com>

The new v1-rc1-cut-criteria bullet added by this wave referenced PRINCIPLES.md as a sibling, but MILESTONES.md now lives in docs/ while PRINCIPLES.md stayed at repo root. Path is ../PRINCIPLES.md from docs/MILESTONES.md. Signed-off-by: Tri Lam <tri@maydow.com>

Authoring drift: the v1-rc1-cut-criteria bullets pre-existed the PR #313 MILESTONES.md → docs/MILESTONES.md move, so links carried docs/ prefix that now double-resolves to docs/docs/. Strip to sibling-relative. Signed-off-by: Tri Lam <tri@maydow.com>

## Summary Closes #301 by adding a first-class `tls.*` chart surface so operators wire cert-manager-issued mTLS material without a custom DaemonSet patch overlay. The NetworkPolicy half of the issue (opt-in `networkPolicy.enabled`) landed earlier in #338; this PR closes the remaining gap. - `tls.enabled` (bool, default false), `tls.certificateRef` (kubernetes.io/tls Secret name; required when enabled — helm-template render fails closed with a clear error otherwise), `tls.mountPath` (absolute dir, schema-validated `^/`, default `/etc/tracecore/tls`). - DaemonSet projects the Secret read-only (`defaultMode: 0400`); the chart does NOT inject `tls:` clauses into the rendered config — operators wire `cert_file` / `key_file` / `ca_file` / `client_ca_file` via the free-form `config:` block referencing the projected file literals. - `docs/integrations/cert-manager-mtls.md` loses the "requires a patch overlay" workaround and gains an aggregation-side example showing `client_ca_file` placement (the falsifier for silent one-way-TLS downgrade). ## Root cause Issue #301 lists `tls.enabled` and `tls.certificateRef` as required values knobs. The chart never shipped them — the cert-manager mtls recipe instead carried prose telling operators to "patch overlay" the DaemonSet template, which is precisely the kind of friction the chart-surface knob exists to eliminate. This PR fixes the root cause (no typed knob) rather than refreshing the workaround prose. ## Test plan - [x] `helm lint install/kubernetes/tracecore` — clean. - [x] `helm lint install/kubernetes/tracecore -f values-production.yaml` — clean. - [x] `helm template` default render — zero `tls` volumes. - [x] `helm template --set tls.enabled=true` — fails closed with operator-visible error naming `tls.certificateRef`. - [x] `helm template --set tls.enabled=true --set tls.certificateRef=foo` — projects `tls` Secret volume + `volumeMount` at default `/etc/tracecore/tls`, readOnly true, mode 0400. - [x] `helm template --set tls.mountPath=not-absolute` — schema rejects with `Does not match pattern '^/'`. - [x] CI `.github/workflows/chart.yml` render-job has a five-step falsifier suite covering all of the above. - [x] Pre-commit gates: `make lint`, `make vet`, `go mod verify`, attribute-namespace-check, hit-line-format-stable, and no-autoupdate-check all green at commit time. ```release-notes - chart: typed `tls.{enabled,certificateRef,mountPath}` knob mounts a cert-manager-issued mTLS Secret into the DaemonSet read-only. Default off; required Secret reference is enforced at helm-template time so misconfiguration fails closed rather than silently disabling mTLS. ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>

…451) ## Summary - Adds the `transform/cuda_oom` OTTL processor to `docs/integrations/examples/filelog-container.yaml`, stamping `cuda_oom.tried_alloc_bytes` (Int, bytes; unit-normalized KiB/MiB/GiB/TiB) and `cuda_oom.gpu_index` (Int) off PyTorch's canonical `RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>. GPU N has a total capacity of ...` stderr line. - Closes the integration gap pattern #10's detector (PR #338) carried since merge: `projectCUDAOOMLogRecord` (`module/processor/patterndetectorprocessor/cuda_oom.go`) gates on `cuda_oom.tried_alloc_bytes` + `gpu.id` but no upstream recipe stamped them, so the compiled detector received no real input at runtime. ## Root cause Issue #303's deliverable list included `projectCUDAOOMLogRecord` (shipped in PR #338) but explicitly deferred the filelog OTTL stanza to a sibling follow-up (issue #285 / #436). The detector compiled green and its wiring tests passed against synthetic plog input, but production stderr never carried the customer-stable attributes the projector reads. This PR is the missing link — a recipe-only change with zero detector-source edits. ## Recipe design - **Per-unit-branch shape** (KiB / MiB / GiB / TiB) because OTTL has no capture-group-conditional dispatch — the multiplier must be a literal `int64` per stanza. - **Unit normalization via OTTL Math Expressions**: `Int(whole)*UNIT + Int(frac)*(UNIT/100)` against PyTorch's `%.2f` `format_size` shape (verified against `c10/cuda/CUDACachingAllocator.cpp`). Integer-divide-by-100 floors per-frac-unit precision loss at <1% of the unit base — three orders of magnitude under the detector's 5% fragmentation threshold. - **`gpu.id` is NOT stamped here**: the CUDA-runtime ordinal `cuda_oom.gpu_index` is not a PCI BDF. The recipe markdown documents two operator paths: (a) k8sattributesprocessor + `nvidia.com/gpu-PCIDeviceBusID` device-plugin annotation, or (b) DCGM BDF-lookup transform indexed by `cuda_oom.gpu_index`. The detector's resource-attr fallback reads `gpu.id` off the log resource either way. - **Tight `where IsMatch` guard** on `CUDA out of memory\. Tried to allocate` — generic CUDA errors (illegal memory access, NCCL watchdog, DataLoader worker killed) do not trip the stanza. ## Tests TDD red → green via three new tests in `module/processor/patterndetectorprocessor/cuda_oom_recipe_test.go`: - `TestRecipe_CUDAOOM_StanzaPinsWireContract` — pins 7 load-bearing tokens (`cuda_oom.tried_alloc_bytes`, `cuda_oom.gpu_index`, KiB/MiB/GiB/TiB, `transform/cuda_oom`) + pipeline-wiring against the live projector. - `TestRecipe_CUDAOOM_RoundTripFiresVerdict` — end-to-end gate: recipe-shaped log records flow through `CUDAOOMDetector` and emit a `kind=fragmentation` verdict with the expected scalar-promotion contract. - `TestRecipe_CUDAOOM_RegexCoversCanonicalPyTorchMessages` — 5 canonical positives (KiB / MiB / GiB / GiB-fractional / TiB) + 3 negatives (DataLoader worker killed, NCCL watchdog, illegal memory access). Exceeds the ≥3-positive A-tier acceptance criterion from #436. ## Self-grade: **A+** - B: YAML syntactically valid OTel (`tracecore validate` exit 0); regex extracts bytes + GPU index with unit normalization; documented. ✓ - A: integration test green; `make validator-recipe` covers this file; regex tested against ≥3 canonical messages (5 positives total); negative cases verified. ✓ - A+: edge cases handled (multi-line traceback flattening via filelog container parser, mixed-unit messages, OOM without GPU index via tight `IsMatch` guard); cross-linked from `docs/patterns/10-cuda-oom-deceptive.md` §"Signal sources" + Open Question #2; new §`cuda_oom.*` attribute stanza in `docs/integrations/filelog-container.md` with unit-normalization arithmetic table, two `gpu.id` source paths, and a Failure-modes row. ✓ ## Cross-references - Detector source (untouched per hard rule): `module/processor/patterndetectorprocessor/cuda_oom.go`. - Sibling DCGM metric-side recipe: PR #337 / `docs/integrations/examples/prometheus-scrape.yaml`. - Pattern doc: `docs/patterns/10-cuda-oom-deceptive.md` — Open Q#2 resolved. - Convention: PR #431 (recipe stanzas placement under `docs/integrations/examples/<target>.yaml`). ## Test plan - [x] `go test ./processor/patterndetectorprocessor/ -run TestRecipe_CUDAOOM -count=1 -v` — PASS (3 tests, 8 sub-tests) - [x] `go test ./processor/patterndetectorprocessor/ -count=1` — PASS (no regressions) - [x] `make build` — `_build/tracecore` compiles via OCB - [x] `./_build/tracecore validate --config=docs/integrations/examples/filelog-container.yaml` — exit 0 - [x] `make validator-recipe` — 9 validated, 3 skipped (non-linux host) of 12 recipe(s) - [x] `make doc-check` — PASS (new cross-link resolves) - [x] `make ci-fast` — PASS (lint, vet, mod-verify, attribute-namespace-check, doc-check) ```release-notes **Pattern #10 (CUDA OOM, deceptive allocator)** — filelogreceiver + OTTL recipe lands. The `transform/cuda_oom` stanza in `docs/integrations/examples/filelog-container.yaml` projects PyTorch's `RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>` stderr line onto `cuda_oom.tried_alloc_bytes` (unit-normalized to bytes across KiB/MiB/GiB/TiB) and `cuda_oom.gpu_index`, closing the load-bearing input gap left by the v0.3 detector ship (PR #338). ``` Closes #436. Refs #338, #303, #337. Signed-off-by: Tri Lam <tree@lumalabs.ai>

Tri Lam added 30 commits June 1, 2026 00:01

docs: fix stale goreleaser + pyspy references (#268, #269)

d28897d

drop goreleaser. tell truth: inline-shell pipeline + SLSA + cosign. pyspy delete deferred v0.4.0+ per #222, not v0.3.0. closes #268 closes #269 Signed-off-by: Tri Lam <tri@maydow.com>

docs: simplification audit pre-v1.0 freeze

61c85f9

Signed-off-by: Tri Lam <tri@maydow.com>

docs(multi-cluster): relink cert-manager-mtls.md (#301 follow-up)

a6ef3cb

The dedicated cert-manager recipe landed in #301; restore the multi-cluster.md links that 0f3833c routed to upstream while the recipe was missing. Signed-off-by: Tri Lam <tri@maydow.com>

trilamsr enabled auto-merge (squash) June 1, 2026 08:07

trilamsr merged commit 8d9750a into main Jun 1, 2026
21 checks passed

trilamsr deleted the chore/v1-rc1-knowledge-gaps branch June 1, 2026 08:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(v1-rc1): 2 detectors + 8 pattern specs + chart NetPol + rc1 audits#338

feat(v1-rc1): 2 detectors + 8 pattern specs + chart NetPol + rc1 audits#338
trilamsr merged 31 commits into
mainfrom
chore/v1-rc1-knowledge-gaps

trilamsr commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 1, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant