refactor(patterndetector): hoist k8s scope + relocate projectors (#375-#378)#387
Merged
Merged
Conversation
added 5 commits
June 1, 2026 14:14
…#378) Adds K8sScope struct + k8sPodScope() resolver that prefers resource attributes over per-record attributes for the k8s.pod/ns/node trio. Adds recordTimestamp() that mirrors the in-tree timestamp fallback ladder (lr.Timestamp() else lr.ObservedTimestamp()). Table-driven tests pin: resource-only, record-only, resource-wins, neither, mixed combinations for K8sScope; ts/observed presence matrix for recordTimestamp including the preserved Unix-epoch fallback when both are unset. Helpers added but not yet called; the next commit dedups the 57 k8s attribute sites + 14 timestamp sites onto these helpers.
…378) Routes all 14 projector functions through k8sPodScope() and recordTimestamp(). Replaces ~57 sites of the 'resAttrs.Get(k) else attrs.Get(k)' fallback ladder for k8s pod/ns/node attributes + ~14 sites of the 'lr.Timestamp() else lr.ObservedTimestamp()' timestamp fallback ladder. Net -183 LOC across the package with zero behavior change — every existing test stays green. Two intentional non-uses of the helper documented inline: - projectNodeCondition prefers per-record k8s.node.name (kubelet- emitted Event carries the canonical identity on the record). - projectCNINetworkEventRecord namespace + pod use Event-specific fallback chains (k8s.regarding.namespace + k8s.regarding.name) that don't match the helper's keys. Helper internals: K8sScope{PodName, PodNamespace, Node} returned by k8sPodScope(); resource-preferred; missing keys return empty string (callers gate explicitly where the projection requires non-empty).
… files (#375) Moves five projection functions from the 1064-line patterndetector.go into the sibling files that already host each pattern's detector glue, matching the layout the post-wave (cuda_oom, dataloader_hang, nccl_bootstrap, silent_data_corruption, checkpointer_hang) detectors established. Each pattern's projection + detector now live together. Moved: - projectThermalThrottleRecord -> thermal_throttle.go (pattern #4) - projectXidRecord -> xid_correlation.go (pattern #6) - projectHBMECCRecord -> hbm_ecc.go (pattern #3) - projectPCIeAERRecord + projectPCIeIORecord -> pcie_aer.go (pattern #5, two inputs) - projectIBPortStateRecord -> ib_link_flap.go (pattern #2) patterndetector.go drops from 1064 to 802 lines. Test files already sit at their canonical names — the projector + tests now share a filename root for each pattern. Zero behavior change — collectInputs still calls each projector by name (Go package scope makes the relocation transparent).
…ared.go (#376) Creates projectors_shared.go with the projection functions consumed by 2+ detectors: - projectTrainingStepStallRecord (moved from checkpointer_hang.go) -- consumed by pattern #7 dataloader_hang + #11 checkpointer_hang - projectNCCLFRRecord (moved from patterndetector.go) -- consumed by patterns #2 ib_link_flap + #9 nccl_bootstrap + #12 nccl_hang - projectPodEvent (moved from patterndetector.go) -- consumed by patterns #14 pod_evicted + #6 xid_correlation - projectObjectRef (moved from patterndetector.go) -- internal helper to projectPodEvent (event-specific fallback chain distinct from k8sPodScope) Per the audit (#376): the file's top-of-file comment pins the membership rule -- a projector lives here when >=2 patterns consume the record type, not when >=2 source-code call sites read it. Single-pattern projectors stay in their sibling detector file (thermal_throttle.go, xid_correlation.go, hbm_ecc.go, pcie_aer.go, ib_link_flap.go, cuda_oom.go, silent_data_corruption.go, nccl_bootstrap.go, dataloader_hang.go, checkpointer_hang.go). Updates the stale comment-pointer in dataloader_hang.go (which previously pointed at checkpointer_hang.go) -- the new filename self-documents the sharing. patterndetector.go drops to 681 lines (was 1118 pre-refactor). Zero behavior change.
…red.go revive flagged the package-comments rule: package comment must sit immediately above the package statement, not above a separate prose block. Move the membership-rule explanation to a file-scope comment below the package line. No code change.
Contributor
Author
|
Re reviewer membership-rule finding: Author's comment at projectors_shared.go:143-145 correctly documents both consumers. Membership rule honored. Reviewer recommendation declined; no relocate. |
This was referenced Jun 1, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bundled refactor closing #375 + #376 + #377 + #378 from the post-wave-audit (
docs/v1-rc1-post-wave-audit.md). Zero behavior change — every existing test stays green throughout the 5 commits.k8sPodScope()andrecordTimestamp()helpers. Routes all 14 projector functions through the helpers. Net -183 LOC of duplicatedresAttrs.Get(k) else attrs.Get(k)ladders + timestamp fallback ladders.patterndetector.go(was 1118 LOC) into the sibling pattern file that already hosts the detector glue:thermal_throttle.go,xid_correlation.go,hbm_ecc.go,pcie_aer.go(hosts both AER + IO),ib_link_flap.go.projectors_shared.gofor the 4 projectors consumed by 2+ detectors (projectTrainingStepStallRecord,projectNCCLFRRecord,projectPodEvent,projectObjectRef). Filename now self-documents the cross-pattern sharing.Diff size
Net +120 with tests; -17 LOC without tests (138-line table-driven test for the helpers + 67-line helper file are the additions). The headline win is the deletion of 183 LOC of duplicated ladders in dedup commit
792ebc9— every future detector inherits the helpers instead of re-writing the fallback chain.patterndetector.goshrinks from 1118 → 681 LOC (the file is now orchestration +verdictCommon/appendVerdict*writers only — no projection functions).Root cause (per audit doc)
resAttrs.Get(k) else attrs.Get(k)ladder for k8s pod/ns/node attributes. Each new detector copy-pasted the ladder. Root cause: no shared helper.lr.Timestamp() else lr.ObservedTimestamp()timestamp fallback ladder. Root cause: no shared helper.patterndetector.goaccumulated 9 per-pattern projectors as new patterns landed in sibling files. Root cause: layout drift — pre-wave detectors never moved their projectors out when post-wave detectors started co-locating.projectTrainingStepStallRecordwas explicitly "shared" (by a comment-pointer). Other cross-pattern projectors lived inpatterndetector.goby convention. Root cause: no filename signal for "this projector is consumed by ≥2 detectors."Helper contracts (pinned by table tests in
record_utils_test.go)k8sPodScopeprefers resource attrs (canonical k8sattributes stamp) over per-record attrs. Missing keys return empty string; callers gate explicitly per detector.recordTimestamppreserves verbatim in-tree behavior including the Unix-epoch fallback when bothTimestampandObservedTimestampare unset (test pinned).Intentional non-uses of the helper (documented inline)
Two projectors deliberately do not use
k8sPodScope:projectNodeCondition— kubelet-emitted node-condition Events carry node identity on the log record itself; record-preferred semantics intentional.projectCNINetworkEventRecord(namespace + pod) — uses event-specific fallback chains (k8s.regarding.namespace,k8s.regarding.name,k8s.event.reporting_instance) that don't matchk8sPodScope's key set.Issues closed
Closes #375 #376 #377 #378.
Test plan
cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/...— greencd module && go test ./...— green (full module sweep)cd module && go vet ./...— cleancd module && go tool golangci-lint run ./processor/patterndetectorprocessor/...— baseline unchanged (was 14 findings on main, now 13 — I introduced + fixed 1 new revive finding in commite987bee)record_utils_test.gopin both helper contracts (resource-only / record-only / resource-wins / neither / mixed; ts+observed presence matrix)Commit sequence
0a19a77test(patterndetector): pin k8sPodScope + recordTimestamp helpers792ebc9refactor(patterndetector): dedup k8s scope + timestamp ladders (refactor(patterndetector): hoist k8s pod/ns/node attribute scope helper (~80 LOC delete) #377 refactor(patterndetector): hoist recordTimestamp(lr) helper #378)bd12278refactor(patterndetector): relocate per-pattern projectors to sibling files (refactor(patterndetector): move per-pattern projectors out of patterndetector.go #375)7a8106erefactor(patterndetector): hoist shared projectors into projectors_shared.go (refactor(patterndetector): consolidate shared projectors into projectors_shared.go #376)e987beelint(patterndetector): fix detached package comment in projectors_shared.go