Skip to content

feat(patterndetector): metrics-path consumer for hw.gpu.memory.* (#437)#461

Merged
trilamsr merged 1 commit into
mainfrom
feat/437-metrics-path-consumer
Jun 2, 2026
Merged

feat(patterndetector): metrics-path consumer for hw.gpu.memory.* (#437)#461
trilamsr merged 1 commit into
mainfrom
feat/437-metrics-path-consumer

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Root cause

Pre-PR, patterndetectorprocessor.NewFactory registered only processor.WithLogs. The CUDA OOM detector joins two inputs — a PyTorch OOM log line and a same-GPU hw.gpu.memory.{free,total} Counter — but the FB Counter arrives on the metrics pipeline by default. With no metrics surface, the only on-ramp for the FB layer was the metrics->logs OTTL recipe (#337); when an operator hadn't configured it, the detector saw no FB records and fell back to partial verdicts (kind=unknown). The fix is to grow the processor's signal surface: add WithMetrics, project FB data points directly off pmetric.NumberDataPoints, and ferry them across the logs/metrics graph boundary via a shared fbBuffer keyed on component.ID.

- **feat(patterndetector): metrics-path consumer for `hw.gpu.memory.*`** — the processor now implements `processors.metrics.Metrics` alongside `processors.logs.Logs`. The CUDA OOM detector (pattern #10) now joins log-shaped OOM events against metric-shaped `hw.gpu.memory.{free,total}` data points via a bounded cross-stream buffer, so operators get full-confidence verdicts without the metrics->logs OTTL recipe. ADR-0001 PR-B. (#437)

Test plan

  • go test -race -count=1 ./module/processor/patterndetectorprocessor/ green (all existing tests pass; 10 new test functions added covering: pmetric->FBMemoryRecord projection happy path on Gauge + Sum shapes; missing-gpu.id drop; empty-valued data point drop; non-FB metric ignored; missing-free-or-total emits half-record; multi-GPU split; metrics-path forward-unchanged; cross-stream join fragmentation verdict; bounded buffer eviction; time-bound stale eviction)
  • go test -race -count=1 ./module/pkg/patterns/ green
  • go build ./... green
  • Pre-commit gates (gofumpt, go vet, attribute-namespace-check, hit-line-format-stable) pass

Scope decision

Option (a) BOUNDED per task spec: metrics ingestion + projection + bounded cross-stream buffer + cross-stream join (covered by TestMetricsProcessor_CrossStreamJoinFragmentationVerdict).

Notes

  • Cross-stream sharing keyed on set.ID via a package-level fbBufferRegistry so a normal collector config (one processors.patterndetector block wired to both logs.processors and metrics.processors) picks up the same buffer for both signal paths.
  • The numberDataPointIntValue coercion accepts Double-typed datapoints (truncates) — guards against OTTL transform processors that emit floats; FB bytes are inherently integral.
  • Bounded scope (per hard rule): no full filelog + dcgm receiver integration test (operator-end-to-end flow); that ships as a follow-up if it becomes load-bearing.

Extend patterndetectorprocessor to implement processors.metrics.Metrics
alongside processors.logs.Logs (ADR-0001 PR-B). The metrics processor
projects `hw.gpu.memory.{free,total}` Gauge/Sum data points into
patterns.FBMemoryRecord values and buffers them in a bounded,
per-component.ID ring; the logs processor drains the buffer at
CUDA OOM-log time so the detector emits full-confidence verdicts
without requiring the metrics->logs OTTL recipe (#337).

Sibling-pattern style: the bridge is cuda_oom-specific by design;
no shared abstraction. Bounded buffer (4096 records cap; time-bound
eviction at correlation_window + 30s slack). Logs-path semantics
unchanged when no metrics processor is wired (nil-safe).

Closes #437.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Independent Adversarial Review: PR #461

B/A/A+ Criteria

B (Acceptable): Solves the stated problem (#437), tests pass (-race), architecture is sound, no data loss risk.
A (Good): Adds above + bounded buffer with dual eviction (capacity + time), nil-safe fallback, comprehensive projection logic.
A+ (Excellent): Would require e2e integration test (filelog + dcgm receiver) — deferred as acceptable out-of-scope per "hard rule".

Critical Concurrency Findings

  1. fbBuffer thread-safety: CLEAN

    • add/addAll/recordsWithin/snapshot all use sync.Mutex correctly
    • Eviction logic (drop oldest on overflow) is correct
    • Time-bound stale pruning at drain time is correct
    • Go -race tests pass 3x
  2. fbBufferRegistry global state: SAFE

    • Mutex-protected singleton map, keyed per component.ID
    • sharedFBBuffer() correctly double-checks under lock
    • Lifetime matches process lifetime (acceptable per design)
    • No shutdown or cleanup needed (collector dies = process dies)
  3. Cross-stream join ordering: RISKY (NOT RACE, but ordering assumption)

    • Test TestMetricsProcessor_CrossStreamJoinFragmentationVerdict feeds FB metric THEN OOM log
    • No test for reverse ordering: OOM log arrives BEFORE FB metric in buffer
    • If OOM log hits runCUDAOOMDetector before metrics processor populates buffer, verdict will be partial (kind=unknown)
    • Practical risk is LOW: DCGM scrape cadence (15s default) + filog ingestion (sub-second) = FB almost always in-flight or buffered when OOM log arrives
    • But edge case exists: if metrics scrape cadence >> OOM frequency, or if metrics pipeline backs up

Findings (Numbered)

  1. metrics.go:280-296 (projectFBMemoryFromMetrics) — Missing test case for Double-typed data points

    • Code at line 344-352 (numberDataPointIntValue) accepts Double and truncates: int64(dp.DoubleValue())
    • All 10 test cases use SetIntValue only; no coverage of SetDoubleValue path
    • Risk: moderate — truncation is intentional per doc, but untested edge case
    • Recommendation: Add TestProjectFBMemoryFromMetrics_DoubleValuesAreTruncated
  2. metrics_test.go:364 (TestMetricsProcessor_CrossStreamJoinFragmentationVerdict) — Ordering assumption not tested

    • Test establishes FB metric in buffer FIRST, OOM log SECOND (normal case)
    • No test for pathological ordering: OOM log ConsumeLogs called before FB ConsumeMetrics populates buffer
    • If OOM arrives before FB buffer is filled, verdict will lack FB evidence (kind=unknown instead of kind=fragmentation)
    • Practical impact: LOW (DCGM scrape cadence + log ingestion timing makes FB-first highly likely)
    • Design is nil-safe fallback (correct), but ordering edge case implicit, not documented or tested
  3. cuda_oom.go:210-215 (runCUDAOOMDetector buffer drain)

    • Uses latestOOMTimestamp as anchor for buffer.recordsWithin — correct
    • Appends buffer records to log-path FB records before det.Evaluate — correct
    • Nil-safe (buffer==nil is a no-op) — correct
    • No race: buffer.recordsWithin is mutex-guarded
  4. projectFBMemoryFromMetrics (metrics.go line 252-325) — GPU ID attribute precedence

    • Correctly checks datapoint.Attributes() FIRST, then resource.Attributes() SECOND (per RFC-0013)
    • Multi-GPU test (TestProjectFBMemoryFromMetrics_MultiGPUSplit) covers datapoint-level gpu.id
    • Empty-string gpu.id is correctly dropped (line 294 guard)
    • No issues
  5. Metrics processor forward-unchanged contract

    • Capabilities() returns MutatesData=false (line 220) — correct
    • ConsumeMetrics (line 226) calls p.next.ConsumeMetrics after buffer.addAll — correct
    • Test TestMetricsProcessor_ForwardsMetricsUnchanged verifies it
    • No issues
  6. Factory wire-in (patterndetector.go:48-54)

    • processor.WithMetrics(createMetrics, stability) registered at line 53 — correct
    • Stability = component.StabilityLevelAlpha (line 40) — appropriate
    • createMetrics factory (line 67-77) correctly retrieves sharedFBBuffer(set.ID)
    • Logs factory (line 79-106) also calls sharedFBBuffer(set.ID) on line 88
    • Both use same ID → same buffer ✓
    • No issues

Simplification Sweep

snapshot() function (metrics.go line 116-122):

  • Used only in tests (1 call at metrics_test.go:458)
  • Only 7 lines, correctly documented as "test only"
  • Acceptable; not bloat

recordsWithin vs separate add/Drain pattern:

  • Could be split (drain to separate method, filter in a helper)
  • Current implementation is cleaner: single lock-holding operation, atomic pruning
  • Acceptable; not over-engineered

Preamble docstring (lines 21-49):

  • 5-paragraph explanation of why file exists, cross-stream wiring, design philosophy
  • Load-bearing context for a subtle feature, not noise
  • Acceptable; appropriate for non-obvious design

fbBuffer struct with two constants:

  • fbBufferDefaultMaxRecords (line 59) and fbBufferStaleSlack (line 65) are well-named
  • Constants are used consistently throughout; not over-abstraction
  • Acceptable

Overall: Simplification sweep: CLEAN
No refactor targets, no dead code, no premature helpers.

Design Decisions

  1. No shared abstraction across patterns — correct per sibling-pattern design (line 44-49)
  2. Package-level registry keyed by component.ID — correct; simplest cross-stream bridge given OTel's separate consumer graphs
  3. Metrics processor projects, logs processor drains — asymmetric but correct; OOM is inherently log-shaped (PyTorch stderr)
  4. Bounded buffer + time-bound eviction — both necessary; capacity guards against metric bursts, time guards against stale samples

Test Coverage Assessment

✓ Gauge + Sum metric shapes (line 59-126)
✓ Missing gpu.id drop (line 132-153)
✓ Empty-valued data points drop (line 160-183)
✓ Non-FB metrics ignored (line 189-206)
✓ Missing free-or-total emits partial (line 216-266)
✓ Multi-GPU split (line 271-315)
✓ Metrics forwarded unchanged (line 323-349)
✓ Cross-stream join fragmentation verdict (line 364-439)
✓ Bounded buffer eviction (line 445-463)
✓ Time-bound stale eviction (line 470-484)

Double-typed data points (truncation path)
Reverse ordering: OOM before FB (edge case, low practical risk)

All existing CUDA OOM tests pass (-race, 7 tests); no regressions.

Verdict Points

Positive:

Concerns:

  • Missing test for Double data point truncation (minor)
  • Cross-stream join ordering assumption not tested (low practical risk, but implicit)
  • No e2e integration (out-of-scope per design, acceptable)

Risk assessment:

  • No data loss or corruption (buffer is ephemeral, falls back to partial verdicts)
  • No memory leak (bounded buffer + time-bound eviction)
  • No deadlock (single mutex, no nested locks, no waits)
  • Race condition: NONE DETECTED (Go -race passes)

VERDICT: A — Feature is solid, thread-safe, well-designed. Two untested edge cases (Double coercion, reverse ordering) are minor gaps that don't block; the design falls back safely. Missing e2e integration is accepted out-of-scope. Ready to merge once the two edge-case tests are added (or mark as acceptable deferred follow-up).

@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Independent adversarial review — VERDICT: A

Concurrency-safe by construction; cross-stream wiring correct; no critical findings.

Findings

  1. 🟡 metrics.go:344int64(dp.DoubleValue()) truncation has no test coverage. All 10 projectFBMemoryFromMetrics tests use SetIntValue; SetDoubleValue path untested. Recommend adding 1 test case (not blocking).
  2. 🟡 metrics_test.go:364 — Cross-stream join ordering: only tests FB-then-OOM. Reverse (OOM before FB buffered) untested. Practical risk LOW (DCGM 15s scrape + filelog sub-second = FB almost always buffered first), but implicit assumption — add t.Skip or follow-up test.

Verified clean

  • fbBuffer thread-safety: mutex guards all access; add eviction (drop-oldest) correct; recordsWithin time-bound correct; go test -race 3× clean.
  • fbBufferRegistry singleton: mutex-protected double-check under lock; metrics + logs factories share buffer via same component.ID.
  • runCUDAOOMDetector buffer drain: uses latestOOMTimestamp as anchor; nil-safe fallback; no race.
  • ✓ GPU ID precedence: datapoint-level first, resource-level fallback per RFC-0013 §3. Empty-string dropped correctly.
  • ✓ Factory wire-in: processor.WithMetrics(createMetrics, stability) registered correctly.

Simplification sweep: clean

No dead code, no premature helpers, 5-paragraph preamble is load-bearing context for non-obvious cross-stream wiring.

VERDICT: A — recommend MERGE. Risks 1+2 are post-merge follow-ups; not blockers.

@trilamsr trilamsr enabled auto-merge (squash) June 2, 2026 02:47
@trilamsr trilamsr merged commit 5028e00 into main Jun 2, 2026
12 checks passed
@trilamsr trilamsr deleted the feat/437-metrics-path-consumer branch June 2, 2026 02:47
trilamsr added a commit that referenced this pull request Jun 2, 2026
…460) (#466)

## Summary

Closes #460. The `exit 0` on `scripts/doc-check.sh` ran unconditionally
whenever `docs/FAILURE-MODES.md` carried no `Test*`/`Fuzz*`/`Benchmark*`
identifiers (its current state on `main` — `grep -c` = 0), silently
bypassing every gate below it. Fix scopes the skip to the Go-test parity
block only (if/else, not `exit`), then surfaces and fixes the dead refs
the gates were supposed to be catching.

## Root cause

Commit a57883f (#13) shipped `doc-check.sh` with one gate — the Go-test
name parity check — so `[ -z "$referenced" ] && exit 0` was correct
then. PRs #28, #56, #115, #131, #144, #149, #195, #234, #241, #443,
#455, #459 (and others) appended gates **below** that line without
recognising they'd become dead code whenever `FAILURE-MODES.md` lost its
`Test*` references. PR #459 worked around the bug by placing its new
YAML gate *above* line 99 and tracked the root cause separately as #460.

## What surfaced

Once `exit 0` was removed, three real issues fired:

1. **Dead `.md` link**: `docs/FOLLOWUPS.md` → `followups/otlphttp.md`.
The shard was never committed to `main`'s ancestry. Folded into the
existing "Shards deleted post-v0.2.0 as fully resolved-via-pivot" prose
block (sibling treatment to M9, M14, M16).
2. **Banned-phrase hits** (3x `production-grade`): reworded in
`docs/cut-criteria.yaml.md` (2x) and
`install/kubernetes/tracecore/README.md` (1x) to falsifiable language.
3. **`docs/getting-started.md` block cap**: 7 fenced bash/sh blocks. The
M6 cap of 5 was set for the quickstart only — `## Install via Helm` and
`## Air-gapped install` are alternate deployment paths that landed
post-M6 and aren't part of the quickstart budget. Rescoped the gate to
count blocks inside the `## Walkthrough` H2 section only (1 block, well
under cap).

## Gate count

Empirically verified via `grep -c '^doc-check: '` on `make doc-check`
output on a clean tree:

| State | Status lines emitted | Gates the early-exit was hiding |
|---|---|---|
| Pre-fix on `main` (post-#459) | 3 (trust-posture, YAML cross-link,
parity-skip) | 14 |
| Post-fix this PR (post-rebase) | 17 | 0 |

The "14 gates hidden" number is invariant across the rebase: it counts
gates placed below the early-exit line. The "3 → 17" total reflects
post-#459 reality on `main`; pre-#459 baseline was "2 → 16" (the figure
originally in this PR body), and #459 itself worked around the bug by
placing its YAML gate above line 99.

## Mutation tests

Each gate below the original early-exit was confirmed to fire post-fix:

| Mutation | Gate expected to fire | Exit code post-mutation | Exit code
post-restore |
|---|---|---|---|
| Inject `[bad](nonexistent-ghost.md)` into `docs/FOLLOWUPS.md` |
markdown link-rot | 1 | 0 |
| Append `blazing-fast` + `rock-solid` to `docs/getting-started.md` |
banned-phrase lint | 1 | 0 |
| Delete `<!-- tested-against: ... -->` from
`docs/integrations/datadog.md` | M6 recipe markers | 1 | 0 |

## Test plan

- [x] `make doc-check` exits 0 on clean tree (re-run post-rebase onto
origin/main; 17 status lines)
- [x] 3 mutation tests above each toggle exit 1 → 0 across mutate /
restore
- [x] Pre-push hooks green: golangci-lint (0 issues), `go vet ./...`,
`go mod verify`, `attribute-namespace-check` (100 attrs, all
documented), `register-lint`, `actionlint`, `zizmor`,
`deprecation-check`, `no-autoupdate-check`
- [x] Rebased onto current `origin/main` (includes #459, #461, #462,
#456); no conflicts; gate count re-verified empirically post-rebase
- [x] No changes to gates above line 99 (the trust-posture callout +
YAML cross-link gate from #459 still run and emit unchanged status
lines)

## Self-grade

**A+** — root cause named in commit body (a57883f #13 with one gate;
gates appended below without exit-path awareness); 3 mutation tests
(success criteria required 1–2); rescoped the getting-started gate to
match M6 intent rather than papering over the surfaced overflow; the `[
-z "$referenced" ]` legitimate skip is preserved via if/else (not `:`
no-op, which would have left the `defined=` / `orphans=` block running
on empty input); gate count corrected empirically post-rebase per
reviewer B feedback.

```release-notes
- fix(ci): `scripts/doc-check.sh` no longer exits 0 at the Go-test parity gate when `docs/FAILURE-MODES.md` carries no `Test*` references. 14 gates below that line (link-rot, banned-phrase, M6 recipe markers, etc.) are now actually enforced on every `make doc-check` invocation. Closes #460.
```

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
…eam (#467)

## Summary

Closes #463. Adds two test cases flagged by the reviewer of #461 (#437)
as reachable-but-uncovered branches in the metrics-path consumer.

### Root cause

PR #461 shipped the metrics-path consumer + cross-stream join for
cuda_oom with 10 projection tests. Two branches were reachable but
unexercised:

1. `projectFBMemoryFromMetrics` uses `int64(dp.DoubleValue())` at
`metrics.go:344` — the `SetDoubleValue` truncation path. All 10 existing
tests used `SetIntValue`. A regression that swapped truncation for
rounding (or returned 0) would have shipped silently.
2. `TestMetricsProcessor_CrossStreamJoinFragmentationVerdict` fed
FB-then-OOM only. The reverse order (OOM arrives before the next dcgm
scrape) was an implicit assumption — the detector's empty-buffer branch
+ the buffer-survives-across-calls invariant were both uncovered for the
metrics-path wiring.

### Changes

Two new tests in
`module/processor/patterndetectorprocessor/metrics_test.go`:

- **`TestProjectFBMemoryFromMetrics_DoubleValueTruncatesToInt64`** —
feeds a `SetDoubleValue(42.7)` free + `SetDoubleValue(80 GiB)` total;
asserts `FreeBytes == 42` (truncate, not round to 43; not zero).
- **`TestMetricsProcessor_CrossStreamReverseOrderPartialThenJoin`** —
OOM log first (buffer empty) → asserts partial verdict `kind=unknown`,
`Confidence=Partial`. Then FB metric arrives, retry OOM log → asserts
full verdict `kind=fragmentation` (locks the
buffer-survives-across-calls invariant).

Test-only; no production-code changes.

### Mutation results

Each test verified to fail when its target code is broken (then
restored):

| Mutation | Test caught it? |
| --- | --- |
| `numberDataPointIntValue` Double branch -> `return 0` | yes
(`expected: 42, actual: 0`) |
| `numberDataPointIntValue` Double branch -> rounding `(x + 0.5)` | yes
(`expected: 42, actual: 43`) |
| `runCUDAOOMDetector` buffer-drain skipped (`if false && buffer !=
nil`) | yes (retry OOM falls back to `unknown` instead of
`fragmentation`) |
| Drop all partial verdicts in `runCUDAOOMDetector` | yes (first OOM
emits 0 verdicts instead of 1) |

### A+ exhaustiveness audit (non-blocking notes)

Other projection branches checked:
- Gauge / Sum / non-numeric metric types — covered
(`HappyPathGaugeAndSum`, `DropsNonNumericDataPoints`,
`IgnoresNonFBMetrics`).
- Missing `gpu.id` (resource + DP) — covered
(`DropsRecordsMissingGPUID`).
- Empty-value DPs — covered (`DropsNonNumericDataPoints`).
- DP-level `gpu.id` overriding resource — covered (`MultiGPUSplit`).
- Histogram/Summary/ExponentialHistogram metric types named
`hw.gpu.memory.*` (the `default` branch in `numberDataPoints`) —
technically untested but the customer-stable scrape format pins these to
Gauge/Sum; not load-bearing enough to file.

```release-notes
Tests-only: cover `SetDoubleValue` truncation + reverse-order cross-stream join in the cuda_oom metrics-path consumer (closes #463).
```

## Test plan

- [x] `go test -race -count=1 ./processor/patterndetectorprocessor/...`
— green
- [x] Each new test fails when its target is mutated; passes when
restored (table above)
- [x] Pre-commit hooks: golangci-lint, go vet, go mod verify,
attribute-namespace-check — green

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
## Summary

Closes #380. Comment-only sweep updating 20 stale "future PR-B" /
RFC-0014 references after PR #461 (#437) shipped the
`processor.WithMetrics` extension for `patterndetectorprocessor`
(cuda_oom #10 metrics-path consumer with bounded cross-stream buffer
keyed on `component.ID`).

## Root cause

12 sites flagged in #380 + 8 sibling sites the audit missed were
factually stale: the bridge IS shipped (via #461), just not for every
pattern yet. Sweep updates comments to past tense referencing the
durable concept (ADR-0001 PR-B) and issue tracker (#437) instead of the
squash-merge PR number — per STYLE.md (PR refs in source comments rot in
long-lived files; doc-check.sh enforces this).

## Sites updated (20 total)

- `module/pkg/patterns/{cuda_oom,pcie_aer,thermal_throttle}.go`
-
`module/processor/patterndetectorprocessor/{patterndetector,pcie_aer,thermal_throttle,pcie_aer_test,thermal_throttle_test}.go`
- `docs/integrations/examples/prometheus-scrape.yaml`
- 2 markdown sites (`docs/patterns/pattern-4-thermal-throttle.md`,
`docs/integrations/prometheus-scrape.md`)
- RFC-0014 status prefix swap to parenthetical qualifier
(rfc-status-check.sh compliance)

## Test plan

- [x] `make doc-check` exit 0 (post comment-style fixes)
- [x] `make rfc-status-check` exit 0 (post Status-line qualifier fix)
- [x] `go test -race -count=1 ./module/...` green (comment-only sweep;
no source semantics)
- [x] `go vet ./module/...` clean

## Notes

cuda_oom (#10) sites rewritten to past tense ("shipped via #437"). Other
pattern sites (pcie_aer, thermal_throttle) note that the bridge exists
but their patterns don't yet have metrics-path consumers — preserved as
concrete forward-references with the durable issue tracker.

```release-notes
NONE
```

Refs #380 #437 #461.

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant