Skip to content

refactor(patterndetector): hoist k8s scope + relocate projectors (#375-#378)#387

Merged
trilamsr merged 5 commits into
mainfrom
refactor/patterndetector-dedup-375-378
Jun 1, 2026
Merged

refactor(patterndetector): hoist k8s scope + relocate projectors (#375-#378)#387
trilamsr merged 5 commits into
mainfrom
refactor/patterndetector-dedup-375-378

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor
NONE — internal refactor of `patterndetectorprocessor`. Zero behavior change; no new public API, config, or wire-format surface. Closes #375 #376 #377 #378. Net -183 LOC of duplicated k8s-scope + timestamp fallback ladders via two new package-internal helpers (`k8sPodScope`, `recordTimestamp`); per-pattern projectors relocated into sibling files; cross-pattern projectors hoisted into `projectors_shared.go`. `patterndetector.go` shrinks 1118 → 681 LOC.

Summary

Bundled refactor closing #375 + #376 + #377 + #378 from the post-wave-audit (docs/v1-rc1-post-wave-audit.md). Zero behavior change — every existing test stays green throughout the 5 commits.

Diff size

14 files changed, 800 insertions(+), 680 deletions(-)

Net +120 with tests; -17 LOC without tests (138-line table-driven test for the helpers + 67-line helper file are the additions). The headline win is the deletion of 183 LOC of duplicated ladders in dedup commit 792ebc9 — every future detector inherits the helpers instead of re-writing the fallback chain.

patterndetector.go shrinks from 1118 → 681 LOC (the file is now orchestration + verdictCommon/appendVerdict* writers only — no projection functions).

Root cause (per audit doc)

  1. refactor(patterndetector): hoist k8s pod/ns/node attribute scope helper (~80 LOC delete) #377: 57 sites repeat a resAttrs.Get(k) else attrs.Get(k) ladder for k8s pod/ns/node attributes. Each new detector copy-pasted the ladder. Root cause: no shared helper.
  2. refactor(patterndetector): hoist recordTimestamp(lr) helper #378: 14 sites repeat a lr.Timestamp() else lr.ObservedTimestamp() timestamp fallback ladder. Root cause: no shared helper.
  3. refactor(patterndetector): move per-pattern projectors out of patterndetector.go #375: patterndetector.go accumulated 9 per-pattern projectors as new patterns landed in sibling files. Root cause: layout drift — pre-wave detectors never moved their projectors out when post-wave detectors started co-locating.
  4. refactor(patterndetector): consolidate shared projectors into projectors_shared.go #376: only projectTrainingStepStallRecord was explicitly "shared" (by a comment-pointer). Other cross-pattern projectors lived in patterndetector.go by convention. Root cause: no filename signal for "this projector is consumed by ≥2 detectors."

Helper contracts (pinned by table tests in record_utils_test.go)

type K8sScope struct { PodName, PodNamespace, Node string }
func k8sPodScope(lr plog.LogRecord, resAttrs pcommon.Map) K8sScope
func recordTimestamp(lr plog.LogRecord) time.Time
  • k8sPodScope prefers resource attrs (canonical k8sattributes stamp) over per-record attrs. Missing keys return empty string; callers gate explicitly per detector.
  • recordTimestamp preserves verbatim in-tree behavior including the Unix-epoch fallback when both Timestamp and ObservedTimestamp are unset (test pinned).

Intentional non-uses of the helper (documented inline)

Two projectors deliberately do not use k8sPodScope:

  • projectNodeCondition — kubelet-emitted node-condition Events carry node identity on the log record itself; record-preferred semantics intentional.
  • projectCNINetworkEventRecord (namespace + pod) — uses event-specific fallback chains (k8s.regarding.namespace, k8s.regarding.name, k8s.event.reporting_instance) that don't match k8sPodScope's key set.

Issues closed

Closes #375 #376 #377 #378.

Test plan

  • cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... — green
  • cd module && go test ./... — green (full module sweep)
  • cd module && go vet ./... — clean
  • cd module && go tool golangci-lint run ./processor/patterndetectorprocessor/... — baseline unchanged (was 14 findings on main, now 13 — I introduced + fixed 1 new revive finding in commit e987bee)
  • New regression tests in record_utils_test.go pin both helper contracts (resource-only / record-only / resource-wins / neither / mixed; ts+observed presence matrix)

Commit sequence

  1. 0a19a77 test(patterndetector): pin k8sPodScope + recordTimestamp helpers
  2. 792ebc9 refactor(patterndetector): dedup k8s scope + timestamp ladders (refactor(patterndetector): hoist k8s pod/ns/node attribute scope helper (~80 LOC delete) #377 refactor(patterndetector): hoist recordTimestamp(lr) helper #378)
  3. bd12278 refactor(patterndetector): relocate per-pattern projectors to sibling files (refactor(patterndetector): move per-pattern projectors out of patterndetector.go #375)
  4. 7a8106e refactor(patterndetector): hoist shared projectors into projectors_shared.go (refactor(patterndetector): consolidate shared projectors into projectors_shared.go #376)
  5. e987bee lint(patterndetector): fix detached package comment in projectors_shared.go

Tri Lam added 5 commits June 1, 2026 14:14
…#378)

Adds K8sScope struct + k8sPodScope() resolver that prefers resource
attributes over per-record attributes for the k8s.pod/ns/node trio.
Adds recordTimestamp() that mirrors the in-tree timestamp fallback
ladder (lr.Timestamp() else lr.ObservedTimestamp()).

Table-driven tests pin: resource-only, record-only, resource-wins,
neither, mixed combinations for K8sScope; ts/observed presence matrix
for recordTimestamp including the preserved Unix-epoch fallback when
both are unset.

Helpers added but not yet called; the next commit dedups the 57 k8s
attribute sites + 14 timestamp sites onto these helpers.
…378)

Routes all 14 projector functions through k8sPodScope() and
recordTimestamp(). Replaces ~57 sites of the
'resAttrs.Get(k) else attrs.Get(k)' fallback ladder for k8s
pod/ns/node attributes + ~14 sites of the 'lr.Timestamp() else
lr.ObservedTimestamp()' timestamp fallback ladder.

Net -183 LOC across the package with zero behavior change — every
existing test stays green.

Two intentional non-uses of the helper documented inline:
  - projectNodeCondition prefers per-record k8s.node.name (kubelet-
    emitted Event carries the canonical identity on the record).
  - projectCNINetworkEventRecord namespace + pod use Event-specific
    fallback chains (k8s.regarding.namespace + k8s.regarding.name)
    that don't match the helper's keys.

Helper internals: K8sScope{PodName, PodNamespace, Node} returned by
k8sPodScope(); resource-preferred; missing keys return empty string
(callers gate explicitly where the projection requires non-empty).
… files (#375)

Moves five projection functions from the 1064-line patterndetector.go
into the sibling files that already host each pattern's detector glue,
matching the layout the post-wave (cuda_oom, dataloader_hang,
nccl_bootstrap, silent_data_corruption, checkpointer_hang) detectors
established. Each pattern's projection + detector now live together.

Moved:
  - projectThermalThrottleRecord -> thermal_throttle.go (pattern #4)
  - projectXidRecord            -> xid_correlation.go (pattern #6)
  - projectHBMECCRecord         -> hbm_ecc.go (pattern #3)
  - projectPCIeAERRecord +
    projectPCIeIORecord         -> pcie_aer.go (pattern #5, two inputs)
  - projectIBPortStateRecord    -> ib_link_flap.go (pattern #2)

patterndetector.go drops from 1064 to 802 lines. Test files already
sit at their canonical names — the projector + tests now share a
filename root for each pattern.

Zero behavior change — collectInputs still calls each projector by
name (Go package scope makes the relocation transparent).
…ared.go (#376)

Creates projectors_shared.go with the projection functions consumed
by 2+ detectors:

  - projectTrainingStepStallRecord (moved from checkpointer_hang.go)
    -- consumed by pattern #7 dataloader_hang + #11 checkpointer_hang
  - projectNCCLFRRecord (moved from patterndetector.go)
    -- consumed by patterns #2 ib_link_flap + #9 nccl_bootstrap + #12 nccl_hang
  - projectPodEvent (moved from patterndetector.go)
    -- consumed by patterns #14 pod_evicted + #6 xid_correlation
  - projectObjectRef (moved from patterndetector.go)
    -- internal helper to projectPodEvent (event-specific fallback chain
       distinct from k8sPodScope)

Per the audit (#376): the file's top-of-file comment pins the
membership rule -- a projector lives here when >=2 patterns consume
the record type, not when >=2 source-code call sites read it.
Single-pattern projectors stay in their sibling detector file
(thermal_throttle.go, xid_correlation.go, hbm_ecc.go, pcie_aer.go,
ib_link_flap.go, cuda_oom.go, silent_data_corruption.go,
nccl_bootstrap.go, dataloader_hang.go, checkpointer_hang.go).

Updates the stale comment-pointer in dataloader_hang.go (which
previously pointed at checkpointer_hang.go) -- the new filename
self-documents the sharing.

patterndetector.go drops to 681 lines (was 1118 pre-refactor).
Zero behavior change.
…red.go

revive flagged the package-comments rule: package comment must sit
immediately above the package statement, not above a separate prose
block. Move the membership-rule explanation to a file-scope comment
below the package line.

No code change.
@trilamsr

trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Re reviewer membership-rule finding: projectPodEvent IS consumed by 2 patterns. Direct grep finds only 1 call site (patterndetector.go:395), but the projected events slice is consumed downstream by BOTH det.Evaluate(events, nodeConds) (pattern #14 pod_evicted, line 255) AND xidDet.Evaluate(xidRecs, events) (pattern #6 xid_correlation joins on pod events to attribute Xid to workload, line 272).

Author's comment at projectors_shared.go:143-145 correctly documents both consumers. Membership rule honored. projectObjectRef stays co-resident as it's the projection-side helper for projectPodEvent (single declaration site keeps the fallback chain in one place).

Reviewer recommendation declined; no relocate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor(patterndetector): move per-pattern projectors out of patterndetector.go

1 participant