Skip to content

refactor(patterns): full appendVerdict consolidation#521

Merged
trilamsr merged 1 commit into
mainfrom
refactor/appendverdict-full
Jun 4, 2026
Merged

refactor(patterns): full appendVerdict consolidation#521
trilamsr merged 1 commit into
mainfrom
refactor/appendverdict-full

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Finishes the appendVerdict[V verdictAttrer] consolidation that Lane M (#508 / merged as #520) started. Before this change, only 7 of 12 verdict types satisfied the verdictAttrer interface — the other 5 (cuda_oom, checkpointer_hang, dataloader_hang, nccl_bootstrap, silent_data_corruption) bypassed the generic emitter and called the lower-level appendVerdictRecord directly with hand-built verdictCommon{...} literals + per-pattern stamp closures.

Root cause

The half-refactor wasn't an interface limitation — it was an unfinished migration. The 5 verdict types simply never got Common() + PutAttrs() methods written for them, and the per-pattern attribute-key constants stayed in the consumer (patterndetectorprocessor) package instead of moving to the producer (patterns) package alongside the keys for the other 7 detectors. The result was three-tier indirection (runner → appendXxxVerdict wrapper → appendVerdictRecord) for half the detectors and one-tier (runner → appendVerdict) for the other half — inconsistent enough that adding a new pattern required a coin-flip on which template to follow.

What changed

  • Adds Common() + PutAttrs(pcommon.Map) methods to the 5 remaining verdict types in module/pkg/patterns/.
  • Moves the per-pattern attribute-key constants into the canonical block in module/pkg/patterns/verdict.go next to the keys for the other 7 — wire-format vocabulary is now entirely owned by the producer.
  • Replaces the 5 inline appendXxxVerdict wrappers with appendVerdict[V] call-site uses.
  • Inlines appendVerdictRecord into appendVerdict (single caller after the consolidation).
  • Drops the now-dead verdictCommon type alias, putStrIfSet helper, and orphaned verdictAttrConfidence / verdictAttrPCIeAERGPUID constants from the processor.

All 12 detector emit paths now route through one generic emitter.

Wire-format preservation

Verdict wire-format is preserved bit-for-bit: PutAttrs methods stamp the same keys with the same values in the same order; concrete-type json.Marshal still honors per-verdict json tags (because V is concrete at the generic call site); the diagnostic kind label flows from VerdictCommon.Kind set by each verdict's Common().

LoC delta

+266 / -369 (-103 net) across 12 files. Net deletion because 5 wrapper functions + their per-pattern key-constant blocks collapsed into method-based stamping with shared constants.

Closes-the-loop

Wave-end-audit next-wave item #6 (full appendVerdict consolidation).

Test plan

  • cd module && go test ./... green (full module, including pattern-detector golden + verdict-attribute tests that pin the wire format).
  • make lint green (0 issues; golangci-lint catches dead constants + unused helpers).
  • go vet ./... clean.
  • Manual diff sweep confirms wire-format equivalence: each PutAttrs method stamps the same key/value/order as the previous inline closure.
NONE

Lane M introduced the generic appendVerdict[V verdictAttrer] seam but
only 4 of 9 active detector files routed through it. The 5 wrapped
runners (cuda_oom, checkpointer_hang, dataloader_hang, nccl_bootstrap,
silent_data_corruption) still called the lower-level appendVerdictRecord
directly with hand-built verdictCommon literals and per-pattern stamp
closures — a half-refactor that left the attribute-key vocabulary split
across producer (patterns) and consumer (patterndetectorprocessor)
packages.

This change finishes the consolidation:

- Adds Common() + PutAttrs(pcommon.Map) methods to the 5 remaining
  verdict types in module/pkg/patterns/ (CUDAOOMVerdict,
  CheckpointerHangVerdict, DataLoaderHangVerdict,
  NCCLBootstrapTimeoutVerdict, SilentDataCorruptionVerdict) so they
  satisfy the verdictAttrer interface alongside the 7 already wired.
- Moves the per-pattern attribute-key constants into the canonical
  block in module/pkg/patterns/verdict.go next to the keys for the
  other 7 detectors — wire-format vocabulary now lives entirely with
  the producer.
- Replaces the 5 inline appendXxxVerdict wrappers in the processor
  with appendVerdict[V] call-site uses. Inlines appendVerdictRecord
  into appendVerdict (it had a single caller). Drops the now-dead
  verdictCommon type alias, putStrIfSet helper, and orphaned
  verdictAttrConfidence / verdictAttrPCIeAERGPUID constants from the
  processor.
- All 12 detector emit paths now go through one generic emitter; the
  closure plumbing is gone.

Verdict wire-format is preserved bit-for-bit: PutAttrs methods stamp
the same keys with the same values in the same order; concrete-type
json.Marshal still honors per-verdict json tags; the diagnostic kind
label flows from VerdictCommon.Kind set by each verdict's Common().

LoC delta: +266 / -369 (-103 net). Pre+post test pass set identical;
the one pre-existing failure on the base branch
(TestPatternDetector_NegativeFixturesEmitNoVerdicts/multi-rank-disk-pressure)
is a pod_evicted detector regression unaffected by this refactor.

Closes wave-end-audit next-wave item #6.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) June 4, 2026 01:29
@trilamsr trilamsr merged commit df87e75 into main Jun 4, 2026
27 checks passed
@trilamsr trilamsr deleted the refactor/appendverdict-full branch June 4, 2026 01:37
trilamsr added a commit that referenced this pull request Jun 4, 2026
## Summary

Closes the loop on PR #521 (Lane N). #521's body claimed "full
appendVerdict consolidation" but left two typed wrappers in place —
`emitPodEvicted` and `emitIBLinkFlap` — because the reviewer correctly
flagged the gating + telemetry-label divergence as a behavior risk, and
the original carve-out comment justified them with "Go generics cannot
abstract over a struct field." That's an 85% consolidation, not 100%.

This PR removes both wrappers and routes their callsites through the
generic `emitAll[V verdictAttrer]` directly.

### Root cause of the wrapper residue

The wrappers existed because `PodEvictedVerdict.Confidence` and
`IBLinkFlapVerdict.Confidence` are **exported struct fields**, and
adding a `Confidence()` method to the `verdictAttrer` interface would
have shadowed the field name on the same type. The fix is one level up:
`VerdictCommon` (the existing shared-boilerplate carrier returned by
every verdict's `Common()` method) was the right place for Confidence
all along. It's already the interface seam `emitAll` reads through —
moving Confidence there lets the generic path gate partial verdicts and
stamp the real confidence label on `IncVerdict` without touching field
names on individual verdict types.

### What changes

- **`module/pkg/patterns/verdict.go`** — add `Confidence Confidence` to
`VerdictCommon`.
- **7 verdict `Common()` methods** — populate `Confidence: v.Confidence`
(PodEvicted, IBLinkFlap, CUDAOOM, Checkpointer, DataLoader,
NCCLBootstrap, SilentDataCorruption). The other 5 detectors (NCCLHang,
XidCorrelation, HBMECC, ThermalThrottle, PCIeAER) have no `Confidence`
field, so their `Common()` returns the zero value (`""`) —
byte-identical to today's `IncVerdict(patternID, "")` tick.
- **`patterndetector.go`** — `emitAll` now reads `c := v.Common()`,
gates `c.Confidence == ConfidencePartial && !emitPartial`, and ticks
`IncVerdict(c.PatternID, string(c.Confidence))`. Both wrapper funcs
deleted; both callsites call `emitAll` directly.

### Behavior-identical proof

| Detector | Path before | Path after | Telemetry label before |
Telemetry label after |
|---|---|---|---|---|
| PodEvicted | `emitPodEvicted` | `emitAll` | `string(v.Confidence)` |
`string(c.Confidence)` — same |
| IBLinkFlap | `emitIBLinkFlap` | `emitAll` | `string(v.Confidence)` |
`string(c.Confidence)` — same |
| NCCLHang / XidCorrelation / HBMECC / ThermalThrottle / PCIeAER |
`emitAll` | `emitAll` | `""` | `""` (zero-value Confidence) — same |
| CUDAOOM / Checkpointer / SDC / NCCLBootstrap / DataLoader | own `run*`
helper (unchanged) | own `run*` helper (unchanged) | n/a | n/a |

Partial-gating preserved:
`TestPatternDetector_IBLinkFlapWiringPartialSuppressed`,
`TestPatternDetector_PodEvictedWiringPartial*`,
`TestPatternDetector_CUDAOOMWiringPartialVerdictWhenFBMissing` all pass.

Pinned by `TestNoopFallback_IncVerdict_TableDriven` (15 sub-cases
covering every pattern × confidence combination) — green.

### Success criteria

```
$ grep -nE '^func emit[A-Z]' module/processor/patterndetectorprocessor/patterndetector.go
285:func emitAll[V verdictAttrer](p *patterndetectorProcessor, ld plog.Logs, verdicts []V) {
```

Only `emitAll` remains.

## Test plan

- [x] `cd module && go build ./...` — clean
- [x] `cd module && go test ./...` — 12 packages green
- [x] `make lint` — 0 issues
- [x] `grep '^func emit[A-Z]'
module/processor/patterndetectorprocessor/patterndetector.go` returns
only `emitAll`
- [x] `TestNoopFallback_IncVerdict_TableDriven` — 15/15 pass (pins
per-pattern confidence labels)
- [x] `TestPatternDetector_IBLinkFlapWiring*` (3 cases) — pass
- [x] `TestPatternDetector_PodEvicted*` — pass
- [x] `TestPatternDetector_CUDAOOMWiringPartialVerdictWhenFBMissing` —
pass

```release-notes
Internal refactor only. No wire-format, config, or operator-facing change.
Generic `emitAll` is now the single emit path for all twelve registered
detectors; partial-confidence gating + IncVerdict telemetry labels
unchanged.
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant