feat(patterndetector): metrics-path consumer for hw.gpu.memory.* (#437)#461
Conversation
Extend patterndetectorprocessor to implement processors.metrics.Metrics
alongside processors.logs.Logs (ADR-0001 PR-B). The metrics processor
projects `hw.gpu.memory.{free,total}` Gauge/Sum data points into
patterns.FBMemoryRecord values and buffers them in a bounded,
per-component.ID ring; the logs processor drains the buffer at
CUDA OOM-log time so the detector emits full-confidence verdicts
without requiring the metrics->logs OTTL recipe (#337).
Sibling-pattern style: the bridge is cuda_oom-specific by design;
no shared abstraction. Bounded buffer (4096 records cap; time-bound
eviction at correlation_window + 30s slack). Logs-path semantics
unchanged when no metrics processor is wired (nil-safe).
Closes #437.
Signed-off-by: Tri Lam <tree@lumalabs.ai>
Independent Adversarial Review: PR #461B/A/A+ CriteriaB (Acceptable): Solves the stated problem (#437), tests pass (-race), architecture is sound, no data loss risk. Critical Concurrency Findings
Findings (Numbered)
Simplification Sweepsnapshot() function (metrics.go line 116-122):
recordsWithin vs separate add/Drain pattern:
Preamble docstring (lines 21-49):
fbBuffer struct with two constants:
Overall: Simplification sweep: CLEAN Design Decisions
Test Coverage Assessment✓ Gauge + Sum metric shapes (line 59-126) ✗ Double-typed data points (truncation path) All existing CUDA OOM tests pass (-race, 7 tests); no regressions. Verdict PointsPositive:
Concerns:
Risk assessment:
VERDICT: A — Feature is solid, thread-safe, well-designed. Two untested edge cases (Double coercion, reverse ordering) are minor gaps that don't block; the design falls back safely. Missing e2e integration is accepted out-of-scope. Ready to merge once the two edge-case tests are added (or mark as acceptable deferred follow-up). |
Independent adversarial review — VERDICT: AConcurrency-safe by construction; cross-stream wiring correct; no critical findings. Findings
Verified clean
Simplification sweep: cleanNo dead code, no premature helpers, 5-paragraph preamble is load-bearing context for non-obvious cross-stream wiring. VERDICT: A — recommend MERGE. Risks 1+2 are post-merge follow-ups; not blockers. |
…460) (#466) ## Summary Closes #460. The `exit 0` on `scripts/doc-check.sh` ran unconditionally whenever `docs/FAILURE-MODES.md` carried no `Test*`/`Fuzz*`/`Benchmark*` identifiers (its current state on `main` — `grep -c` = 0), silently bypassing every gate below it. Fix scopes the skip to the Go-test parity block only (if/else, not `exit`), then surfaces and fixes the dead refs the gates were supposed to be catching. ## Root cause Commit a57883f (#13) shipped `doc-check.sh` with one gate — the Go-test name parity check — so `[ -z "$referenced" ] && exit 0` was correct then. PRs #28, #56, #115, #131, #144, #149, #195, #234, #241, #443, #455, #459 (and others) appended gates **below** that line without recognising they'd become dead code whenever `FAILURE-MODES.md` lost its `Test*` references. PR #459 worked around the bug by placing its new YAML gate *above* line 99 and tracked the root cause separately as #460. ## What surfaced Once `exit 0` was removed, three real issues fired: 1. **Dead `.md` link**: `docs/FOLLOWUPS.md` → `followups/otlphttp.md`. The shard was never committed to `main`'s ancestry. Folded into the existing "Shards deleted post-v0.2.0 as fully resolved-via-pivot" prose block (sibling treatment to M9, M14, M16). 2. **Banned-phrase hits** (3x `production-grade`): reworded in `docs/cut-criteria.yaml.md` (2x) and `install/kubernetes/tracecore/README.md` (1x) to falsifiable language. 3. **`docs/getting-started.md` block cap**: 7 fenced bash/sh blocks. The M6 cap of 5 was set for the quickstart only — `## Install via Helm` and `## Air-gapped install` are alternate deployment paths that landed post-M6 and aren't part of the quickstart budget. Rescoped the gate to count blocks inside the `## Walkthrough` H2 section only (1 block, well under cap). ## Gate count Empirically verified via `grep -c '^doc-check: '` on `make doc-check` output on a clean tree: | State | Status lines emitted | Gates the early-exit was hiding | |---|---|---| | Pre-fix on `main` (post-#459) | 3 (trust-posture, YAML cross-link, parity-skip) | 14 | | Post-fix this PR (post-rebase) | 17 | 0 | The "14 gates hidden" number is invariant across the rebase: it counts gates placed below the early-exit line. The "3 → 17" total reflects post-#459 reality on `main`; pre-#459 baseline was "2 → 16" (the figure originally in this PR body), and #459 itself worked around the bug by placing its YAML gate above line 99. ## Mutation tests Each gate below the original early-exit was confirmed to fire post-fix: | Mutation | Gate expected to fire | Exit code post-mutation | Exit code post-restore | |---|---|---|---| | Inject `[bad](nonexistent-ghost.md)` into `docs/FOLLOWUPS.md` | markdown link-rot | 1 | 0 | | Append `blazing-fast` + `rock-solid` to `docs/getting-started.md` | banned-phrase lint | 1 | 0 | | Delete `<!-- tested-against: ... -->` from `docs/integrations/datadog.md` | M6 recipe markers | 1 | 0 | ## Test plan - [x] `make doc-check` exits 0 on clean tree (re-run post-rebase onto origin/main; 17 status lines) - [x] 3 mutation tests above each toggle exit 1 → 0 across mutate / restore - [x] Pre-push hooks green: golangci-lint (0 issues), `go vet ./...`, `go mod verify`, `attribute-namespace-check` (100 attrs, all documented), `register-lint`, `actionlint`, `zizmor`, `deprecation-check`, `no-autoupdate-check` - [x] Rebased onto current `origin/main` (includes #459, #461, #462, #456); no conflicts; gate count re-verified empirically post-rebase - [x] No changes to gates above line 99 (the trust-posture callout + YAML cross-link gate from #459 still run and emit unchanged status lines) ## Self-grade **A+** — root cause named in commit body (a57883f #13 with one gate; gates appended below without exit-path awareness); 3 mutation tests (success criteria required 1–2); rescoped the getting-started gate to match M6 intent rather than papering over the surfaced overflow; the `[ -z "$referenced" ]` legitimate skip is preserved via if/else (not `:` no-op, which would have left the `defined=` / `orphans=` block running on empty input); gate count corrected empirically post-rebase per reviewer B feedback. ```release-notes - fix(ci): `scripts/doc-check.sh` no longer exits 0 at the Go-test parity gate when `docs/FAILURE-MODES.md` carries no `Test*` references. 14 gates below that line (link-rot, banned-phrase, M6 recipe markers, etc.) are now actually enforced on every `make doc-check` invocation. Closes #460. ``` --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
…eam (#467) ## Summary Closes #463. Adds two test cases flagged by the reviewer of #461 (#437) as reachable-but-uncovered branches in the metrics-path consumer. ### Root cause PR #461 shipped the metrics-path consumer + cross-stream join for cuda_oom with 10 projection tests. Two branches were reachable but unexercised: 1. `projectFBMemoryFromMetrics` uses `int64(dp.DoubleValue())` at `metrics.go:344` — the `SetDoubleValue` truncation path. All 10 existing tests used `SetIntValue`. A regression that swapped truncation for rounding (or returned 0) would have shipped silently. 2. `TestMetricsProcessor_CrossStreamJoinFragmentationVerdict` fed FB-then-OOM only. The reverse order (OOM arrives before the next dcgm scrape) was an implicit assumption — the detector's empty-buffer branch + the buffer-survives-across-calls invariant were both uncovered for the metrics-path wiring. ### Changes Two new tests in `module/processor/patterndetectorprocessor/metrics_test.go`: - **`TestProjectFBMemoryFromMetrics_DoubleValueTruncatesToInt64`** — feeds a `SetDoubleValue(42.7)` free + `SetDoubleValue(80 GiB)` total; asserts `FreeBytes == 42` (truncate, not round to 43; not zero). - **`TestMetricsProcessor_CrossStreamReverseOrderPartialThenJoin`** — OOM log first (buffer empty) → asserts partial verdict `kind=unknown`, `Confidence=Partial`. Then FB metric arrives, retry OOM log → asserts full verdict `kind=fragmentation` (locks the buffer-survives-across-calls invariant). Test-only; no production-code changes. ### Mutation results Each test verified to fail when its target code is broken (then restored): | Mutation | Test caught it? | | --- | --- | | `numberDataPointIntValue` Double branch -> `return 0` | yes (`expected: 42, actual: 0`) | | `numberDataPointIntValue` Double branch -> rounding `(x + 0.5)` | yes (`expected: 42, actual: 43`) | | `runCUDAOOMDetector` buffer-drain skipped (`if false && buffer != nil`) | yes (retry OOM falls back to `unknown` instead of `fragmentation`) | | Drop all partial verdicts in `runCUDAOOMDetector` | yes (first OOM emits 0 verdicts instead of 1) | ### A+ exhaustiveness audit (non-blocking notes) Other projection branches checked: - Gauge / Sum / non-numeric metric types — covered (`HappyPathGaugeAndSum`, `DropsNonNumericDataPoints`, `IgnoresNonFBMetrics`). - Missing `gpu.id` (resource + DP) — covered (`DropsRecordsMissingGPUID`). - Empty-value DPs — covered (`DropsNonNumericDataPoints`). - DP-level `gpu.id` overriding resource — covered (`MultiGPUSplit`). - Histogram/Summary/ExponentialHistogram metric types named `hw.gpu.memory.*` (the `default` branch in `numberDataPoints`) — technically untested but the customer-stable scrape format pins these to Gauge/Sum; not load-bearing enough to file. ```release-notes Tests-only: cover `SetDoubleValue` truncation + reverse-order cross-stream join in the cuda_oom metrics-path consumer (closes #463). ``` ## Test plan - [x] `go test -race -count=1 ./processor/patterndetectorprocessor/...` — green - [x] Each new test fails when its target is mutated; passes when restored (table above) - [x] Pre-commit hooks: golangci-lint, go vet, go mod verify, attribute-namespace-check — green Signed-off-by: Tri Lam <tree@lumalabs.ai>
## Summary Closes #380. Comment-only sweep updating 20 stale "future PR-B" / RFC-0014 references after PR #461 (#437) shipped the `processor.WithMetrics` extension for `patterndetectorprocessor` (cuda_oom #10 metrics-path consumer with bounded cross-stream buffer keyed on `component.ID`). ## Root cause 12 sites flagged in #380 + 8 sibling sites the audit missed were factually stale: the bridge IS shipped (via #461), just not for every pattern yet. Sweep updates comments to past tense referencing the durable concept (ADR-0001 PR-B) and issue tracker (#437) instead of the squash-merge PR number — per STYLE.md (PR refs in source comments rot in long-lived files; doc-check.sh enforces this). ## Sites updated (20 total) - `module/pkg/patterns/{cuda_oom,pcie_aer,thermal_throttle}.go` - `module/processor/patterndetectorprocessor/{patterndetector,pcie_aer,thermal_throttle,pcie_aer_test,thermal_throttle_test}.go` - `docs/integrations/examples/prometheus-scrape.yaml` - 2 markdown sites (`docs/patterns/pattern-4-thermal-throttle.md`, `docs/integrations/prometheus-scrape.md`) - RFC-0014 status prefix swap to parenthetical qualifier (rfc-status-check.sh compliance) ## Test plan - [x] `make doc-check` exit 0 (post comment-style fixes) - [x] `make rfc-status-check` exit 0 (post Status-line qualifier fix) - [x] `go test -race -count=1 ./module/...` green (comment-only sweep; no source semantics) - [x] `go vet ./module/...` clean ## Notes cuda_oom (#10) sites rewritten to past tense ("shipped via #437"). Other pattern sites (pcie_aer, thermal_throttle) note that the bridge exists but their patterns don't yet have metrics-path consumers — preserved as concrete forward-references with the durable issue tracker. ```release-notes NONE ``` Refs #380 #437 #461. --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
Summary
patterndetectorprocessorto additionally implementprocessors.metrics.Metrics(ADR-0001 PR-B), closing feat(patterndetector): metrics-path consumer for hw.gpu.memory.* (ADR-0001 PR-B; CUDA OOM detector coverage) #437.hw.gpu.memory.{free,total}Gauge/Sum data points intopatterns.FBMemoryRecordvalues, buffers them in a bounded per-component.IDring, and the logs-path consumer drains the buffer at CUDA OOM-log time.correlation_window + 30sslack).Root cause
Pre-PR,
patterndetectorprocessor.NewFactoryregistered onlyprocessor.WithLogs. The CUDA OOM detector joins two inputs — a PyTorch OOM log line and a same-GPUhw.gpu.memory.{free,total}Counter — but the FB Counter arrives on the metrics pipeline by default. With no metrics surface, the only on-ramp for the FB layer was the metrics->logs OTTL recipe (#337); when an operator hadn't configured it, the detector saw no FB records and fell back to partial verdicts (kind=unknown). The fix is to grow the processor's signal surface: addWithMetrics, project FB data points directly offpmetric.NumberDataPoints, and ferry them across the logs/metrics graph boundary via a sharedfbBufferkeyed oncomponent.ID.Test plan
go test -race -count=1 ./module/processor/patterndetectorprocessor/green (all existing tests pass; 10 new test functions added covering: pmetric->FBMemoryRecord projection happy path on Gauge + Sum shapes; missing-gpu.id drop; empty-valued data point drop; non-FB metric ignored; missing-free-or-total emits half-record; multi-GPU split; metrics-path forward-unchanged; cross-stream join fragmentation verdict; bounded buffer eviction; time-bound stale eviction)go test -race -count=1 ./module/pkg/patterns/greengo build ./...greenScope decision
Option (a) BOUNDED per task spec: metrics ingestion + projection + bounded cross-stream buffer + cross-stream join (covered by
TestMetricsProcessor_CrossStreamJoinFragmentationVerdict).Notes
set.IDvia a package-levelfbBufferRegistryso a normal collector config (oneprocessors.patterndetectorblock wired to bothlogs.processorsandmetrics.processors) picks up the same buffer for both signal paths.numberDataPointIntValuecoercion accepts Double-typed datapoints (truncates) — guards against OTTL transform processors that emit floats; FB bytes are inherently integral.