refactor(patterns): full appendVerdict consolidation#521
Merged
Conversation
Lane M introduced the generic appendVerdict[V verdictAttrer] seam but only 4 of 9 active detector files routed through it. The 5 wrapped runners (cuda_oom, checkpointer_hang, dataloader_hang, nccl_bootstrap, silent_data_corruption) still called the lower-level appendVerdictRecord directly with hand-built verdictCommon literals and per-pattern stamp closures — a half-refactor that left the attribute-key vocabulary split across producer (patterns) and consumer (patterndetectorprocessor) packages. This change finishes the consolidation: - Adds Common() + PutAttrs(pcommon.Map) methods to the 5 remaining verdict types in module/pkg/patterns/ (CUDAOOMVerdict, CheckpointerHangVerdict, DataLoaderHangVerdict, NCCLBootstrapTimeoutVerdict, SilentDataCorruptionVerdict) so they satisfy the verdictAttrer interface alongside the 7 already wired. - Moves the per-pattern attribute-key constants into the canonical block in module/pkg/patterns/verdict.go next to the keys for the other 7 detectors — wire-format vocabulary now lives entirely with the producer. - Replaces the 5 inline appendXxxVerdict wrappers in the processor with appendVerdict[V] call-site uses. Inlines appendVerdictRecord into appendVerdict (it had a single caller). Drops the now-dead verdictCommon type alias, putStrIfSet helper, and orphaned verdictAttrConfidence / verdictAttrPCIeAERGPUID constants from the processor. - All 12 detector emit paths now go through one generic emitter; the closure plumbing is gone. Verdict wire-format is preserved bit-for-bit: PutAttrs methods stamp the same keys with the same values in the same order; concrete-type json.Marshal still honors per-verdict json tags; the diagnostic kind label flows from VerdictCommon.Kind set by each verdict's Common(). LoC delta: +266 / -369 (-103 net). Pre+post test pass set identical; the one pre-existing failure on the base branch (TestPatternDetector_NegativeFixturesEmitNoVerdicts/multi-rank-disk-pressure) is a pod_evicted detector regression unaffected by this refactor. Closes wave-end-audit next-wave item #6. Signed-off-by: Tri Lam <tri@maydow.com>
8 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 4, 2026
## Summary Closes the loop on PR #521 (Lane N). #521's body claimed "full appendVerdict consolidation" but left two typed wrappers in place — `emitPodEvicted` and `emitIBLinkFlap` — because the reviewer correctly flagged the gating + telemetry-label divergence as a behavior risk, and the original carve-out comment justified them with "Go generics cannot abstract over a struct field." That's an 85% consolidation, not 100%. This PR removes both wrappers and routes their callsites through the generic `emitAll[V verdictAttrer]` directly. ### Root cause of the wrapper residue The wrappers existed because `PodEvictedVerdict.Confidence` and `IBLinkFlapVerdict.Confidence` are **exported struct fields**, and adding a `Confidence()` method to the `verdictAttrer` interface would have shadowed the field name on the same type. The fix is one level up: `VerdictCommon` (the existing shared-boilerplate carrier returned by every verdict's `Common()` method) was the right place for Confidence all along. It's already the interface seam `emitAll` reads through — moving Confidence there lets the generic path gate partial verdicts and stamp the real confidence label on `IncVerdict` without touching field names on individual verdict types. ### What changes - **`module/pkg/patterns/verdict.go`** — add `Confidence Confidence` to `VerdictCommon`. - **7 verdict `Common()` methods** — populate `Confidence: v.Confidence` (PodEvicted, IBLinkFlap, CUDAOOM, Checkpointer, DataLoader, NCCLBootstrap, SilentDataCorruption). The other 5 detectors (NCCLHang, XidCorrelation, HBMECC, ThermalThrottle, PCIeAER) have no `Confidence` field, so their `Common()` returns the zero value (`""`) — byte-identical to today's `IncVerdict(patternID, "")` tick. - **`patterndetector.go`** — `emitAll` now reads `c := v.Common()`, gates `c.Confidence == ConfidencePartial && !emitPartial`, and ticks `IncVerdict(c.PatternID, string(c.Confidence))`. Both wrapper funcs deleted; both callsites call `emitAll` directly. ### Behavior-identical proof | Detector | Path before | Path after | Telemetry label before | Telemetry label after | |---|---|---|---|---| | PodEvicted | `emitPodEvicted` | `emitAll` | `string(v.Confidence)` | `string(c.Confidence)` — same | | IBLinkFlap | `emitIBLinkFlap` | `emitAll` | `string(v.Confidence)` | `string(c.Confidence)` — same | | NCCLHang / XidCorrelation / HBMECC / ThermalThrottle / PCIeAER | `emitAll` | `emitAll` | `""` | `""` (zero-value Confidence) — same | | CUDAOOM / Checkpointer / SDC / NCCLBootstrap / DataLoader | own `run*` helper (unchanged) | own `run*` helper (unchanged) | n/a | n/a | Partial-gating preserved: `TestPatternDetector_IBLinkFlapWiringPartialSuppressed`, `TestPatternDetector_PodEvictedWiringPartial*`, `TestPatternDetector_CUDAOOMWiringPartialVerdictWhenFBMissing` all pass. Pinned by `TestNoopFallback_IncVerdict_TableDriven` (15 sub-cases covering every pattern × confidence combination) — green. ### Success criteria ``` $ grep -nE '^func emit[A-Z]' module/processor/patterndetectorprocessor/patterndetector.go 285:func emitAll[V verdictAttrer](p *patterndetectorProcessor, ld plog.Logs, verdicts []V) { ``` Only `emitAll` remains. ## Test plan - [x] `cd module && go build ./...` — clean - [x] `cd module && go test ./...` — 12 packages green - [x] `make lint` — 0 issues - [x] `grep '^func emit[A-Z]' module/processor/patterndetectorprocessor/patterndetector.go` returns only `emitAll` - [x] `TestNoopFallback_IncVerdict_TableDriven` — 15/15 pass (pins per-pattern confidence labels) - [x] `TestPatternDetector_IBLinkFlapWiring*` (3 cases) — pass - [x] `TestPatternDetector_PodEvicted*` — pass - [x] `TestPatternDetector_CUDAOOMWiringPartialVerdictWhenFBMissing` — pass ```release-notes Internal refactor only. No wire-format, config, or operator-facing change. Generic `emitAll` is now the single emit path for all twelve registered detectors; partial-confidence gating + IncVerdict telemetry labels unchanged. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Finishes the
appendVerdict[V verdictAttrer]consolidation that Lane M (#508 / merged as #520) started. Before this change, only 7 of 12 verdict types satisfied theverdictAttrerinterface — the other 5 (cuda_oom, checkpointer_hang, dataloader_hang, nccl_bootstrap, silent_data_corruption) bypassed the generic emitter and called the lower-levelappendVerdictRecorddirectly with hand-builtverdictCommon{...}literals + per-pattern stamp closures.Root cause
The half-refactor wasn't an interface limitation — it was an unfinished migration. The 5 verdict types simply never got
Common()+PutAttrs()methods written for them, and the per-pattern attribute-key constants stayed in the consumer (patterndetectorprocessor) package instead of moving to the producer (patterns) package alongside the keys for the other 7 detectors. The result was three-tier indirection (runner →appendXxxVerdictwrapper →appendVerdictRecord) for half the detectors and one-tier (runner →appendVerdict) for the other half — inconsistent enough that adding a new pattern required a coin-flip on which template to follow.What changed
Common()+PutAttrs(pcommon.Map)methods to the 5 remaining verdict types inmodule/pkg/patterns/.module/pkg/patterns/verdict.gonext to the keys for the other 7 — wire-format vocabulary is now entirely owned by the producer.appendXxxVerdictwrappers withappendVerdict[V]call-site uses.appendVerdictRecordintoappendVerdict(single caller after the consolidation).verdictCommontype alias,putStrIfSethelper, and orphanedverdictAttrConfidence/verdictAttrPCIeAERGPUIDconstants from the processor.All 12 detector emit paths now route through one generic emitter.
Wire-format preservation
Verdict wire-format is preserved bit-for-bit:
PutAttrsmethods stamp the same keys with the same values in the same order; concrete-typejson.Marshalstill honors per-verdict json tags (becauseVis concrete at the generic call site); the diagnostic kind label flows fromVerdictCommon.Kindset by each verdict'sCommon().LoC delta
+266 / -369 (-103 net) across 12 files. Net deletion because 5 wrapper functions + their per-pattern key-constant blocks collapsed into method-based stamping with shared constants.
Closes-the-loop
Wave-end-audit next-wave item #6 (full
appendVerdictconsolidation).Test plan
cd module && go test ./...green (full module, including pattern-detector golden + verdict-attribute tests that pin the wire format).make lintgreen (0 issues; golangci-lint catches dead constants + unused helpers).go vet ./...clean.PutAttrsmethod stamps the same key/value/order as the previous inline closure.