Skip to content

feat(v1-rc1): 2 detectors + 8 pattern specs + chart NetPol + rc1 audits#338

Merged
trilamsr merged 31 commits into
mainfrom
chore/v1-rc1-knowledge-gaps
Jun 1, 2026
Merged

feat(v1-rc1): 2 detectors + 8 pattern specs + chart NetPol + rc1 audits#338
trilamsr merged 31 commits into
mainfrom
chore/v1-rc1-knowledge-gaps

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

15-agent parallel wave bridging v1.0-rc1 knowledge gaps + closing horizon backlog. 31 commits, 81 files, +8650/-180.

Code (5 detectors / features):

8 pattern design specs (docs/patterns/{02,07-13}-*.md):

  • Per pattern: symptom, layers crossed, signal sources, detector evaluation rule, verdict attrs, edge cases, open questions.
  • 7 load-bearing spec gaps flagged for future TDD red-test work (multi-vendor SDC signal, cohort grouping, processor metrics path, etc).

9 v1.0-rc1 audit / knowledge-gap docs:

Doc-drift cleanup: 11 issues closed (#265, #268, #269, #276, #283, #287, #292-295, #299).

OTTL recipe wiring: 6 issues closed (#260, #261, #273, #282, #284, #285); #272 deferred to standards-roadmap.

Multi-cluster auth: bearer-token + mTLS examples (#297).

Merge resolution + reviewer fixes:

Test plan

  • go test ./module/processor/patterndetectorprocessor/... ./module/pkg/patterns/... — ok
  • make lint (golangci-lint via goreleaser-style gate) — 0 issues
  • go vet ./... — clean
  • make doc-check — passes after stale-link sweep
  • scripts/attribute-namespace-check.sh — 67/67 documented
  • helm lint install/kubernetes/tracecore — 0 chart(s) failed
  • promtool check rules on slo-rules.yaml — 13 rules / SUCCESS
  • CI compat-matrix (rc1 criterion Make RFC process optional, not gated #6) — gated on next wave
  • manual smoke install on real cluster — owner clearance pending
Lands two new pattern detectors (#2 IB link flap, #10 CUDA OOM
fragmentation-vs-true discriminator), 8 pattern design specs for the
remaining v1.0 root-cause patterns, opt-in default-deny NetworkPolicy
+ Prometheus Operator ServiceMonitor on the Helm chart, the
EvictedPod → PodName/PodNamespace verdict-attribute deprecation
co-emit, per-detector allocs/event bench harness, SLO-rules label
fix, and the v1.0-rc1 knowledge-gap audit set (cut criteria, ops gaps,
governance gaps, test audit, simplification audit, threat model,
reference envs, adoption pipeline, standards roadmap).

Tri Lam added 30 commits June 1, 2026 00:01
NORTHSTARS O2 hero KPI (install-to-first-data) and O5 hero KPI (>=15
production orgs at M12) both pointed at undefined ground truth. O2
named a 'production-realistic 32-GPU' tier without binding any
hardware spec; O5 had a methodology stub at docs/nps.md but no funnel
mechanics, no tracking artifact, no public-counter rule consistent
with O2 Op-Rule #5 / O5 Op-Rule #1 (never phone home).

docs/reference-environments.md pins both tiers - Minimal (kind, k8s
1.32, DCGM 4.4.x stub, NCCL 2.30+, ubuntu-latest, the existing
bench/install/ harness) and Production-realistic (32xH100 SXM5 across
4 nodes, NVLink+IB, Calico+Multus+RDMA, Lustre, Kueue v0.17.x).
Names three cluster-access paths (partner-volunteered, rented,
quarterly-drill fallback) and flags partner-volunteered as the v1.0
default expectation.

docs/adoption-pipeline.md defines the S0->S3 funnel with explicit
definition-of-done per stage, target pilot profile, the
docs/followups/_pilots.md tracking artifact, the release-prep-PR
public-counter cadence (no scraping, no auto-update), and one-line
comms templates per transition.

Cross-refs added in docs/nps.md and docs/README.md.

Signed-off-by: Tri Lam <tri@maydow.com>
Three rc1-blocking operational gaps documented in docs/v1-rc1-operational-gaps.md with file:line evidence, falsifiable cut criteria, numbered remediation steps tagged by work-type (code/doc/ops/external-dep), S/M/L effort estimates, and explicit external blockers (slsa-go-generator OCB-submodule integration; healthcheckextension single-path upstream limitation; harden-runner audit-only on hosted runners).

MILESTONES.md M21 cross-refs the new doc as a release-prep dependency. Per-step issues #314-#321 filed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
Marks XidCorrelationVerdict.EvictedPod deprecated in v0.4; legacy
field continues to emit alongside the k8s-semconv split PodName /
PodNamespace (already populated since #270) through the v0.4-v0.5
window and is removed in v0.6 per ATTRIBUTES.md soft-lock policy.

Adds TestXidCorrelationVerdict_DeprecatedEvictedPodCoEmits pinning
that both legacy + successor names co-emit in the JSON shape, so a
regression that drops either side fails the gate. Updates
docs/ATTRIBUTES.md with a deprecation row + removal-target table.

Signed-off-by: Tri Lam <tri@maydow.com>
Eight 1-page pattern-design specs covering #2 IB link flap, #7
dataloader hang, #8 NCCL timeout no-HW, #9 NCCL bootstrap timeout,
#10 CUDA OOM deceptive allocator, #11 checkpointer hang, #12 loss
spike NaN, #13 silent data corruption. Each carries the standard
detector-design shape (symptom, layers, signal sources, evaluation
rule, verdict attrs, edge cases, status, open questions) so the next
contributor can write a TDD red test directly off the spec.

Status: all 8 marked planned. #10 already has issue #303; the spec
frames the design alongside.

NORTHSTARS Appendix A gains a Spec column; docs/README + patterns
README link the new specs.

Signed-off-by: Tri Lam <tri@maydow.com>
drop goreleaser. tell truth: inline-shell pipeline + SLSA + cosign.
pyspy delete deferred v0.4.0+ per #222, not v0.3.0.

closes #268
closes #269

Signed-off-by: Tri Lam <tri@maydow.com>
The drop-rate recording rule grouped `by (pipeline)`, but upstream
OTel obsreport never stamps that label — `processorhelper` only
stamps `processor` and `receiverhelper` stamps `receiver` +
`transport`. The `sum by (pipeline)` collapsed every series into a
single empty `pipeline=""` group, so the alert annotations rendered
empty.

Split the recording rule into per-instance numerator + denominator
and aggregate to a per-instance drop ratio (one pod = one tracecore
instance = one pipeline in our DaemonSet topology). Alert payload now
carries the actual instance label and the description points operators
at the per-component upstream series for localization.

The `TracecoreSelftelemetryDown` rule used
`up{...} == 0 unless on (instance) kube_pod_status_phase{phase="Failed"}`
to separate listener-wedge from pod-crash, but `up{}` and
`kube_pod_status_phase{}` share no `instance` label — the `unless`
clause was a silent no-op and the rule fired identically to
`TracecorePodDown`. Removed the rule; `TracecoreSelftelemetryEmpty`
(which uses a real join via `job`+`instance`) already covers the
listener-wedge case.

Verification: `promtool check rules` reports SUCCESS: 13 rules.

Closes #298

Signed-off-by: Tri Lam <tri@maydow.com>
The chart's `dashboards/slo-rules.yaml` queries `up{job="tracecore"}`,
but nothing in the chart wired a scrape config — operators had to
hand-author a Prometheus job for the SLO rules to light up. Two new
toggles converge on the canonical `job="tracecore"` label:

- `serviceMonitor.enabled` (default OFF — the
  `monitoring.coreos.com/v1` CRD ships with kube-prometheus-stack /
  prometheus-operator and is absent on bare clusters). Renders a
  ServiceMonitor with `jobLabel: app.kubernetes.io/name`, which
  resolves to `job="tracecore"` via the chart's selector labels.
  Requires a new headless Service (clusterIP: None) over the
  DaemonSet pods so the Operator's Service-targeted selector
  resolves to per-pod telemetry endpoints.

- `prometheusScrape.enabled` (default ON). Stamps
  `prometheus.io/scrape`, `prometheus.io/port`, `prometheus.io/path`
  annotations on the DaemonSet pods for vanilla-Prometheus
  kubernetes_sd_configs (role: pod) scrape jobs. Harmless on
  Operator clusters; operators using ServiceMonitor should disable
  to avoid double-scrape.

Chart README gains a worked example for each path, including the
vanilla `scrape_configs:` block with relabel chain.

Verification:
  helm lint install/kubernetes/tracecore                  → OK
  helm lint --values ci/all-receivers-off-values.yaml     → OK
  helm lint --values ci/one-receiver-on-values.yaml       → OK
  helm lint --values ci/pyspy-on-values.yaml              → OK
  helm template --set serviceMonitor.enabled=true         → renders
    valid ServiceMonitor with jobLabel: app.kubernetes.io/name and
    matching headless Service.

Closes #296

Signed-off-by: Tri Lam <tri@maydow.com>
Measures NORTHSTARS O6 (velocity) + O7 (governance) supporting
KPIs against current repo state. Six gap sections: CODEOWNERS
coverage (0% vs 80% target), lint-enforced principles (4/16
vs 6 target), missing quarterly retros, RFC log + RFC-0013
still draft, `make ci` 148s vs 60s budget, maintainer count
1 vs >=3 by M9.

Issues filed: #322 #323 #324 #325 #327 (cap of 5; gap #6
maintainer count is external, tracked via adoption pipeline).

Sibling to v1-rc1-cut-criteria.md + v1-rc1-operational-gaps.md.

Signed-off-by: Tri Lam <tri@maydow.com>
Read-only audit measuring moat coverage (82.9% module-wide), inventories
the two existing fuzz harnesses, three chaos rows, three benchmark
files, and zero property tests. Files five rc1-prep test-gap issues
(#328-#332) covering the highest-leverage gaps: three sub-80% packages,
missing ncclfrreceiver integration test, chaos matrix gap to shipped
patterns, and pyspy framing nightly fuzz soak.

Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Three references to a docs/integrations/cert-manager-mtls.md recipe that has never existed in the tree were blocking doc-check. Replaced with inline upstream pointers + 'per-operator until a dedicated recipe lands' wording so the operational guidance survives the link removal.

Side cleanup blocking the chore/v1-rc1-knowledge-gaps push; root cause is that #291 (multi-cluster v0 federation) shipped forward-references to a recipe that was deferred.

Signed-off-by: Tri Lam <tri@maydow.com>
mark hw.gpu.{throttle,nvlink,pci.bdf,index} + error.{subtype,
persistence} as tracecore-ext, point at #265 for upstream proposal.
pin all 4 pattern docs on gpu.id (PCI BDF, RFC-0013 §3) as the
customer-stable join key; PromQL examples + alerts switch from
hw_id to gpu_id.

closes #265
closes #276

Signed-off-by: Tri Lam <tri@maydow.com>
native OTLP path = bare-underscored (pattern_id, k8s_node_name).
attributes_* prefix is Promtail/Alloy JSON-extraction surface.
align loki.md with dashboard PR #264's native-OTLP shape.

closes #283

Signed-off-by: Tri Lam <tri@maydow.com>
Adds an opt-in default-deny NetworkPolicy template to the chart:

- Ingress restricted to the telemetry + health ports, scrape sources
  configurable via `networkPolicy.allowedScrapers` (default
  `namespaceSelector: {}` = same-namespace).
- Egress restricted to cluster DNS + operator-declared
  `allowedEgressEndpoints` (CIDR + port pairs for OTLP-out). Empty
  egress list keeps the path closed until the operator declares
  destinations explicitly — auditable by construction.
- Default OFF so the first-install path stays compatible with CNIs
  that ignore NetworkPolicy (Flannel without canal). Enable on
  Calico / Cilium / kube-router.

Adds docs/integrations/cert-manager-mtls.md: cert-manager
ClusterIssuer + Certificate shape, chart wiring through
config.exporters.otlphttp.tls.*, renewal contract
(reload_interval >= 2 * renewBefore), and verification steps
including the falsifier that catches a silently-downgraded
one-way-TLS aggregation listener.

Verification:
  docker run --rm alpine/helm:3.16.4 lint install/kubernetes/tracecore  -> OK
  ... --set networkPolicy.enabled=true                                  -> OK
  helm template ... --set networkPolicy.enabled=true                    -> renders
    valid NetworkPolicy with policyTypes Ingress + Egress, scrape-in
    on telemetry + health ports, DNS + OTLP-out egress allow-list.

Signed-off-by: Tri Lam <tri@maydow.com>
First written threat model for tracecore. Inventory assets (cluster
topology, GPU error patterns, OTLP credentials, kube-apiserver token,
host kernel reads, NCCL FR pickle streams). Walk five trust boundaries
(hostPath, kube-api, OTLP egress, cgo, FR shared mem). Apply STRIDE per
boundary with mitigation status. Top-10 risk ranking with residual
work. Audit RFP scope (~17 person-days across 8 subscopes) and in-repo
prep checklist that gates handing the work to a paid auditor before
v1.0 GA per the Tier 2 prereq.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
#294: process_resident_memory_bytes is not emitted by tracecore;
upstream service/telemetry emits otelcol_process_memory_rss.

#295: verdict log records do not carry latency_seconds; drop the
LogQL stop-gap that would return empty. Pin SLI 1 binding to v0.5+
when verdict_emit_seconds histogram lands. Note #261 scope excludes
the latency histogram (verdicts_emitted_total + evidence_count +
consume_logs_duration are #261's scope).

#299: B12 is "v0.x → v1.0 one-shot upgrade guide", not verdict-ack;
drop fabricated roadmap pointer. PRINCIPLES.md §1 link mismatch in
RUNBOOK fixed to RFC-0013 §1 (the actual href target).

closes #294
closes #295
closes #299

Signed-off-by: Tri Lam <tri@maydow.com>
panel 1 description listed only 14/15/16 as shipped; HBM ECC
(pattern_id=3) landed in #274. add it to the enumeration. README
matrix + template-var options already reflected the ship.

closes #287

Signed-off-by: Tri Lam <tri@maydow.com>
Pattern #2 — InfiniBand link flap — per NORTHSTARS Appendix A row #2
and the design spec at docs/patterns/02-ib-link-flap.md.

Detector evaluation rule
- bucket IB port-state transitions by (node, HCA, port) within
  CorrelationWindow (default 2min)
- fire when transitions >= MinTransitions (default 2)
- promote Confidence to full when a stuck NCCL FR cohort
  (>= MinHangingRanks ranks, non-completed-state) lands on the
  same node within the same window; otherwise partial

Cross-rank correlation primitive
- groupStuckNCCLByNode lifted as an inline helper inside the
  ib_link_flap detector; same shape will recur in pattern #7
  (dataloader-hang) and #9 (nccl-bootstrap-timeout). Refactor to a
  shared module follows in the next commit.

Wiring
- NCCLFRRecord.Node added so the cross-rank correlation can join on
  node identity (k8sattributes resource attr); existing nccl_hang
  detector ignores it (collective-scoped, not node-scoped)
- projectIBPortStateRecord reads hw.network.ib.port.state +
  hw.network.ib.device + hw.network.ib.port.num — the customer-stable
  namespace declared in docs/patterns/02-ib-link-flap.md
- appendIBLinkFlapVerdict promotes (k8s.node.name,
  hw.network.ib.device, hw.network.ib.port.num,
  tracecore.alert.ib_link_flap.transition_count,
  nccl.fr.collective_seq_id) per the issue #270 scalar-promotion
  contract; pattern.confidence is full|partial
- Config gains ib_link_flap_window + ib_link_flap_min_transitions
  with Validate floors (>=1s, >=2)

Tests
- 8 library tests (ib_link_flap_test.go): full correlation, partial
  on IB-alone, single-transition no-fire, transitions-outside-window
  no-fire, different-ports-do-not-combine, NCCL-on-different-node
  does not join, configurable transition threshold, deterministic
  ordering
- 5 processor tests (ib_link_flap_test.go): full verdict + promoted
  scalars, partial on IB-alone, partial-suppressed toggle, window
  validation floor, min-transitions validation floor

Cross-link to spec: docs/patterns/02-ib-link-flap.md (authored in
parallel; lands first or same-PR).

Signed-off-by: Tri Lam <tri@maydow.com>
The "is there a stuck NCCL cohort on this node?" sieve will recur
across pattern #7 (dataloader-hang) and #9 (nccl-bootstrap-timeout)
— both join NCCL FR rank records to a node-scoped trigger using the
same MinHangingRanks / non-completed-state / age-past-window cohort
rule.

Lift the inline groupStuckNCCLByNode from ib_link_flap.go into a
new module/pkg/patterns/cross_rank.go as StuckNCCLCohortByNode,
exposed for cross-pattern reuse. minRanks parameter lets future
patterns tune the cohort floor when their semantics differ (zero
falls back to the package-default MinHangingRanks).

The nccl_hang detector keeps its own cohort logic in nccl_hang.go
because it joins on (pg+collective), not node — the duplication is
intentional, not accidental, and folding both into one parameterized
helper would obscure the load-bearing cohort-key tradeoff.

No behavior change; library + processor tests still green.

Signed-off-by: Tri Lam <tri@maydow.com>
Issue #261: the patterns Grafana dashboard drives every verdict-rate
panel off Loki LogQL because no processor-emitted counter exists for
PromQL queries. Adds
`otelcol.processor.patterndetector.verdicts_emitted_total` (renders
as `otelcol_processor_patterndetector_verdicts_emitted_total` via the
Prometheus exporter — RFC-0013 namespace alignment), partitioned by
`pattern_id` + `confidence` + `component_id`. Once the metric ships,
the three LogQL panels in patterns.json can swap to native PromQL
and the dashboard works on Prometheus-only stacks.

TDD: red TestVerdictsEmittedCounter_PodEvicted + factory-wiring test
first; impl plumbed selfTelemetry through the per-detector loop in
ConsumeLogs so each appendXxxVerdict call site ticks once. Falls
back to noop selfTelemetry on missing/broken MeterProvider so the
data path stays alive when telemetry init fails — mirrors the
ncclfrreceiver selftel convention.

Closes #261.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
Extends prometheus-scrape.md with the bridge attribute contract for the
four metrics-derived patterns:

- pattern #1 NVLink (#260) — the `hw.gpu.nvlink.io` OTTL transform
  already lands in commit 0baa557; this PR closes #260's recipe-half.
- pattern #3 HBM ECC (#273) — `hw.errors.delta` + error.{type,
  subtype,persistence} + gpu.id contract.
- pattern #4 thermal throttle (#282) — `hw.gpu.throttle.duration.delta`
  in integer seconds + reason=thermal + gpu.id contract.
- pattern #5 PCIe AER Layer 2 (#284) — the `tracecore.alert.
  pcie_rate_collapse.*` namespace contract.

OTTL metrics->logs emission stays upstream-blocked at OTel-contrib
v0.130 (RFC-0014): no contrib processor or connector emits log records
from a metrics pipeline. The bridge contract documented here is the
load-bearing wire format any future emitter (an upstream
metricthresholdconnector OR the WithMetrics extension to
patterndetectorprocessor per RFC-0014 PR-B) MUST honor; the detector
projections at module/processor/patterndetectorprocessor/
patterndetector.go gate on this contract today.

last-verified marker bumped to 2026-06-01.

Closes #260. Closes #273. Closes #282. Refs #284 (Layer 1 closed
under #285 in a prior commit; Layer 2 contract documented here).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
Pattern #10 - CUDA OOM, deceptive allocator - per NORTHSTARS Appendix A
row #10 and the design spec at docs/patterns/10-cuda-oom-deceptive.md.

Detector evaluation rule
- per-OOM, look up most-recent same-GPU FB sample within CorrelationWindow
  (default 2min, forward-only - fb.Timestamp <= oom.Timestamp)
- if fb_free_ratio >= FBFreeFragmentationThreshold (default 0.05) ->
  kind=fragmentation (raise max_split_size_mb, empty_cache)
- if fb_free_ratio < threshold -> kind=true_oom (shrink batch, shard)
- if no FB sample joins -> kind=unknown, confidence=partial

Discriminator value
- fragmentation vs true-OOM is the operator's #1 question on a CUDA OOM
- without DCGM cross-check the operator retries with same batch, hits
  same OOM, wastes a slot
- partial-confidence verdict surfaces the OOM even when DCGM scrape lags,
  so the operator branches on concurrent pod_evicted / xid_correlation
  rather than silence

Files
- module/pkg/patterns/cuda_oom.go - detector + verdict + records
- module/processor/patterndetectorprocessor/cuda_oom.go - projections,
  collectCUDAOOMInputs, appendCUDAOOMVerdict, runCUDAOOMDetector
- module/processor/patterndetectorprocessor/cuda_oom_test.go - 7 wiring
  tests + 2 Validate guards
- module/processor/patterndetectorprocessor/example_config.yaml -
  cuda_oom_correlation_window + cuda_oom_fb_free_fragmentation_threshold
  knobs
- docs/ATTRIBUTES.md - hw.gpu.memory.{free,total} namespace entries

Scalar promotions per issue #270 contract: gpu.id, k8s.{pod,node}.*,
cuda_oom.kind, cuda_oom.tried_alloc_bytes, cuda_oom.fb_free_bytes,
cuda_oom.fb_free_ratio, pattern.confidence.

Window-edge fenced both sides per PR #255 lesson. Threshold-boundary
fenced inclusive per same lesson. Most-recent-pre-OOM rule mirrors
xid_correlation / pcie_aer / hbm_ecc.

Integration-gap follow-ups (tracked separately on PR body):
- DCGM_FI_DEV_FB_USED/FREE OTTL recipe extension (sibling to #273)
- filelogreceiver OTTL stanza for CUDA OOM regex parsing (sibling to #285)
- metrics-path on patterndetectorprocessor per ADR-0001 (PR-B)

Tests
- 17 detector tests in module/pkg/patterns/cuda_oom_test.go (filed in
  red commit, now green)
- 11 schema-drift falsifier sub-tests on CUDAOOMVerdict
- 7 wiring + 2 Validate tests in processor cuda_oom_test.go
- all 35 green; full ./pkg/patterns + ./processor/patterndetectorprocessor
  suites green with -race; make check + make build green

Refs #303

Signed-off-by: Tri Lam <tri@maydow.com>
Adds bench/detectors/ with six benchmarks (one per pattern-library
detector: pod_evicted, xid_correlation, hbm_ecc, nccl_hang,
thermal_throttle, pcie_aer), each driving a fixed 1024-event window
so allocs/op is interpretable as allocations-per-evaluate-pass.
Pre-built fixtures live outside b.ResetTimer() so the measured
allocations are detector-side only.

Ratchet wiring:
- bench/detectors/baselines.json pins current allocs/op + B/op per
  detector; allocs/op is hardware-invariant so the value gates
  cleanly across CI and dev hardware.
- make bench-detectors / -check / -baseline targets for the run +
  compare + regenerate loop.
- scripts/bench-check-detectors.sh is the SOFT gate (issue #302):
  prints regression deltas but exits 0 today. Graduation criterion
  to hard-fail (N=10 PRs with alloc-CV < 1%) documented in
  bench/detectors/README.md; flipping the gate is a one-line change
  to the script's final exit + baselines.json's gate_mode field.
- .github/workflows/bench.yml runs the soft check on push-to-main
  and posts deltas to the job summary.

Baseline numbers (Apple M1 Max, darwin/arm64; allocs/op):
  PodEvictedDetector       15635
  XidCorrelationDetector   12699
  NCCLHangDetector          4088
  HBMECCDetector            1429
  ThermalThrottleDetector    780
  PCIeAERDetector            524

Signed-off-by: Tri Lam <tri@maydow.com>
The dedicated cert-manager recipe landed in #301; restore the
multi-cluster.md links that 0f3833c routed to upstream while the
recipe was missing.

Signed-off-by: Tri Lam <tri@maydow.com>
Resolve 5 conflicts post-PR #310 / #312 / #313:
- factory.go deleted on main (merged into patterndetector.go);
  port wave's selftel wiring (#261) into the merged createLogs
- VerdictAttr* unexported per #310; rename 16 wave-added consts
  + all callers across cuda_oom + ib_link_flap + pcie_aer tests
- docs/{MILESTONES,FOLLOWUPS,patterns/README}.md path + content
  reconcile after MILESTONES.md moved to docs/

Address reviewer findings before PR:
- docs/THREAT-MODEL.md case-mismatch -> docs/threat-model.md
  (Linux CI is case-sensitive)
- pattern.id schema drift: 8 specs said `ib_link_flap`/`cuda_oom`,
  code emits "2"/"10"/.../"13"; rewrite spec attribute tables to
  match shipped customer-stable namespace
- pattern.confidence: 8 specs said `high|partial`, code emits
  `full|partial`; rewrite
- 02-ib-link-flap.md attribute drift: spec said
  tracecore.alert.ib_link_flap.{hca_device,port}, code emits
  hw.network.ib.{device,port.num}; align spec to shipped code
- v1-rc1-cut-criteria criterion #1 status stale-on-arrival
  ("6 patterns shipped" -> "8 patterns shipped, 4 remaining")
- NetPol UX trap: NOTES.txt warning when networkPolicy.enabled=true
  with empty allowedEgressEndpoints (silently kills OTLP exporter)
  + warning when ServiceMonitor scraper in different namespace
- File #337 for missing OTTL recipe projecting DCGM FB_USED/FREE
  -> hw.gpu.memory.{free,total} log shape (CUDA OOM detector
  consumes but recipe gap means it ships dark)

Tests: ./module/processor/patterndetectorprocessor/... +
./module/pkg/patterns/... both ok.

Signed-off-by: Tri Lam <tri@maydow.com>
Close attribute-namespace-check advisory gap surfaced by merge:
- tracecore.alert.pcie_rate_collapse.drop_ratio (was emitted by
  appendPCIeAERVerdict, missing from inventory)
- tracecore.alert.ib_link_flap.transition_count
- hw.network.ib.{device,port.num,port.state}
  (new hw.network.* section for IB/RDMA semconv)

Attribute-namespace-check now reports 67/67 documented (was 62/67).

Signed-off-by: Tri Lam <tri@maydow.com>
The audit docs were authored when NORTHSTARS.md + MILESTONES.md lived
at the repo root. main moved them to docs/ in PR #313 just before this
wave landed. Sibling docs reference these by relative path; 22 links
were stale. Replaced ../{NORTHSTARS,MILESTONES}.md → {NORTHSTARS,MILESTONES}.md
across three files. doc-check passes.

Signed-off-by: Tri Lam <tri@maydow.com>
Sibling docs at docs/ top level (adoption-pipeline, threat-model,
standards-roadmap, reference-environments) had the same ../ → same-dir
link drift as the v1-rc1-* siblings. Subdir refs under
docs/{research,rfcs,patterns,followups}/ use ../ correctly because
docs/X/../ resolves to docs/.

Signed-off-by: Tri Lam <tri@maydow.com>
After PR #313 moved NORTHSTARS.md into docs/, the Spec column links
added in the pattern-spec commit kept the pre-move ../docs/patterns/
prefix; from docs/NORTHSTARS.md the correct relative path is just
patterns/. 12 links fixed; doc-check clears.

Signed-off-by: Tri Lam <tri@maydow.com>
The new v1-rc1-cut-criteria bullet added by this wave referenced
PRINCIPLES.md as a sibling, but MILESTONES.md now lives in docs/
while PRINCIPLES.md stayed at repo root. Path is ../PRINCIPLES.md
from docs/MILESTONES.md.

Signed-off-by: Tri Lam <tri@maydow.com>
Authoring drift: the v1-rc1-cut-criteria bullets pre-existed the
PR #313 MILESTONES.md → docs/MILESTONES.md move, so links carried
docs/ prefix that now double-resolves to docs/docs/. Strip to
sibling-relative.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 08:07
@trilamsr trilamsr merged commit 8d9750a into main Jun 1, 2026
21 checks passed
@trilamsr trilamsr deleted the chore/v1-rc1-knowledge-gaps branch June 1, 2026 08:15
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Closes #301 by adding a first-class `tls.*` chart surface so operators
wire cert-manager-issued mTLS material without a custom DaemonSet
patch overlay. The NetworkPolicy half of the issue (opt-in
`networkPolicy.enabled`) landed earlier in #338; this PR closes the
remaining gap.

- `tls.enabled` (bool, default false), `tls.certificateRef`
  (kubernetes.io/tls Secret name; required when enabled — helm-template
  render fails closed with a clear error otherwise), `tls.mountPath`
  (absolute dir, schema-validated `^/`, default `/etc/tracecore/tls`).
- DaemonSet projects the Secret read-only (`defaultMode: 0400`); the
  chart does NOT inject `tls:` clauses into the rendered config —
  operators wire `cert_file` / `key_file` / `ca_file` /
  `client_ca_file` via the free-form `config:` block referencing the
  projected file literals.
- `docs/integrations/cert-manager-mtls.md` loses the "requires a patch
  overlay" workaround and gains an aggregation-side example showing
  `client_ca_file` placement (the falsifier for silent
  one-way-TLS downgrade).

## Root cause

Issue #301 lists `tls.enabled` and `tls.certificateRef` as required
values knobs. The chart never shipped them — the cert-manager mtls
recipe instead carried prose telling operators to "patch overlay" the
DaemonSet template, which is precisely the kind of friction the
chart-surface knob exists to eliminate. This PR fixes the root cause
(no typed knob) rather than refreshing the workaround prose.

## Test plan

- [x] `helm lint install/kubernetes/tracecore` — clean.
- [x] `helm lint install/kubernetes/tracecore -f values-production.yaml`
— clean.
- [x] `helm template` default render — zero `tls` volumes.
- [x] `helm template --set tls.enabled=true` — fails closed with
      operator-visible error naming `tls.certificateRef`.
- [x] `helm template --set tls.enabled=true --set
tls.certificateRef=foo`
      — projects `tls` Secret volume + `volumeMount` at default
      `/etc/tracecore/tls`, readOnly true, mode 0400.
- [x] `helm template --set tls.mountPath=not-absolute` — schema rejects
      with `Does not match pattern '^/'`.
- [x] CI `.github/workflows/chart.yml` render-job has a five-step
      falsifier suite covering all of the above.
- [x] Pre-commit gates: `make lint`, `make vet`, `go mod verify`,
      attribute-namespace-check, hit-line-format-stable, and
      no-autoupdate-check all green at commit time.

```release-notes
- chart: typed `tls.{enabled,certificateRef,mountPath}` knob mounts a
  cert-manager-issued mTLS Secret into the DaemonSet read-only. Default
  off; required Secret reference is enforced at helm-template time so
  misconfiguration fails closed rather than silently disabling mTLS.
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 2, 2026
…451)

## Summary

- Adds the `transform/cuda_oom` OTTL processor to
`docs/integrations/examples/filelog-container.yaml`, stamping
`cuda_oom.tried_alloc_bytes` (Int, bytes; unit-normalized
KiB/MiB/GiB/TiB) and `cuda_oom.gpu_index` (Int) off PyTorch's canonical
`RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>. GPU N
has a total capacity of ...` stderr line.
- Closes the integration gap pattern #10's detector (PR #338) carried
since merge: `projectCUDAOOMLogRecord`
(`module/processor/patterndetectorprocessor/cuda_oom.go`) gates on
`cuda_oom.tried_alloc_bytes` + `gpu.id` but no upstream recipe stamped
them, so the compiled detector received no real input at runtime.

## Root cause

Issue #303's deliverable list included `projectCUDAOOMLogRecord`
(shipped in PR #338) but explicitly deferred the filelog OTTL stanza to
a sibling follow-up (issue #285 / #436). The detector compiled green and
its wiring tests passed against synthetic plog input, but production
stderr never carried the customer-stable attributes the projector reads.
This PR is the missing link — a recipe-only change with zero
detector-source edits.

## Recipe design

- **Per-unit-branch shape** (KiB / MiB / GiB / TiB) because OTTL has no
capture-group-conditional dispatch — the multiplier must be a literal
`int64` per stanza.
- **Unit normalization via OTTL Math Expressions**: `Int(whole)*UNIT +
Int(frac)*(UNIT/100)` against PyTorch's `%.2f` `format_size` shape
(verified against `c10/cuda/CUDACachingAllocator.cpp`).
Integer-divide-by-100 floors per-frac-unit precision loss at <1% of the
unit base — three orders of magnitude under the detector's 5%
fragmentation threshold.
- **`gpu.id` is NOT stamped here**: the CUDA-runtime ordinal
`cuda_oom.gpu_index` is not a PCI BDF. The recipe markdown documents two
operator paths: (a) k8sattributesprocessor +
`nvidia.com/gpu-PCIDeviceBusID` device-plugin annotation, or (b) DCGM
BDF-lookup transform indexed by `cuda_oom.gpu_index`. The detector's
resource-attr fallback reads `gpu.id` off the log resource either way.
- **Tight `where IsMatch` guard** on `CUDA out of memory\. Tried to
allocate` — generic CUDA errors (illegal memory access, NCCL watchdog,
DataLoader worker killed) do not trip the stanza.

## Tests

TDD red → green via three new tests in
`module/processor/patterndetectorprocessor/cuda_oom_recipe_test.go`:

- `TestRecipe_CUDAOOM_StanzaPinsWireContract` — pins 7 load-bearing
tokens (`cuda_oom.tried_alloc_bytes`, `cuda_oom.gpu_index`,
KiB/MiB/GiB/TiB, `transform/cuda_oom`) + pipeline-wiring against the
live projector.
- `TestRecipe_CUDAOOM_RoundTripFiresVerdict` — end-to-end gate:
recipe-shaped log records flow through `CUDAOOMDetector` and emit a
`kind=fragmentation` verdict with the expected scalar-promotion
contract.
- `TestRecipe_CUDAOOM_RegexCoversCanonicalPyTorchMessages` — 5 canonical
positives (KiB / MiB / GiB / GiB-fractional / TiB) + 3 negatives
(DataLoader worker killed, NCCL watchdog, illegal memory access).
Exceeds the ≥3-positive A-tier acceptance criterion from #436.

## Self-grade: **A+**

- B: YAML syntactically valid OTel (`tracecore validate` exit 0); regex
extracts bytes + GPU index with unit normalization; documented. ✓
- A: integration test green; `make validator-recipe` covers this file;
regex tested against ≥3 canonical messages (5 positives total); negative
cases verified. ✓
- A+: edge cases handled (multi-line traceback flattening via filelog
container parser, mixed-unit messages, OOM without GPU index via tight
`IsMatch` guard); cross-linked from
`docs/patterns/10-cuda-oom-deceptive.md` §"Signal sources" + Open
Question #2; new §`cuda_oom.*` attribute stanza in
`docs/integrations/filelog-container.md` with unit-normalization
arithmetic table, two `gpu.id` source paths, and a Failure-modes row. ✓

## Cross-references

- Detector source (untouched per hard rule):
`module/processor/patterndetectorprocessor/cuda_oom.go`.
- Sibling DCGM metric-side recipe: PR #337 /
`docs/integrations/examples/prometheus-scrape.yaml`.
- Pattern doc: `docs/patterns/10-cuda-oom-deceptive.md` — Open Q#2
resolved.
- Convention: PR #431 (recipe stanzas placement under
`docs/integrations/examples/<target>.yaml`).

## Test plan

- [x] `go test ./processor/patterndetectorprocessor/ -run
TestRecipe_CUDAOOM -count=1 -v` — PASS (3 tests, 8 sub-tests)
- [x] `go test ./processor/patterndetectorprocessor/ -count=1` — PASS
(no regressions)
- [x] `make build` — `_build/tracecore` compiles via OCB
- [x] `./_build/tracecore validate
--config=docs/integrations/examples/filelog-container.yaml` — exit 0
- [x] `make validator-recipe` — 9 validated, 3 skipped (non-linux host)
of 12 recipe(s)
- [x] `make doc-check` — PASS (new cross-link resolves)
- [x] `make ci-fast` — PASS (lint, vet, mod-verify,
attribute-namespace-check, doc-check)

```release-notes
**Pattern #10 (CUDA OOM, deceptive allocator)** — filelogreceiver + OTTL recipe lands. The `transform/cuda_oom` stanza in `docs/integrations/examples/filelog-container.yaml` projects PyTorch's `RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>` stderr line onto `cuda_oom.tried_alloc_bytes` (unit-normalized to bytes across KiB/MiB/GiB/TiB) and `cuda_oom.gpu_index`, closing the load-bearing input gap left by the v0.3 detector ship (PR #338).
```

Closes #436.
Refs #338, #303, #337.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant