feat(dataloader): pattern-7 hang detector + verdict schema#346
Conversation
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
|
Addressed reviewer findings: 🔴 Fixed (real correctness issue): tie-break (worker_killed + storage_event both join) now emits BOTH evidence refs in evidence_trail (3 entries: stall + dataloader_error + storage_event). Discriminator stays worker_killed per spec, but the trail is union-of-correlated-signals so operators see node-wide storage failure when present. EventReason still bound to chosen discriminator (worker_killed → empty; storage_event → reason). Schema description clarifies. Tests: extended TestDataLoaderHangDetector_PrefersWorkerKilledOverStorage to assert len=3 trail + storage_event kind; new TestPatternDetector_DataLoaderHangWiringBothDiscriminatorsJoin processor-level test. 🟡 Quick wins:
Spec follow-ups (deferred, will file):
Acknowledged, no action: |
…ader-hang # Conflicts: # docs/ATTRIBUTES.md # module/processor/patterndetectorprocessor/config.go
Summary
☐ plannedto☑ shipped.docs/patterns/07-dataloader-hang.md: a stall alone never fires; the verdict requires either a same-poddataloader.error_classlog (worker_killed) or a same-nodeFailedMount/VolumeMountFailureKubernetes Event (storage_event) within the correlation window.phase=="eval"validation pauses, and sub-threshold cadence variance.Tests
module/pkg/patterns/dataloader_hang_test.go, 14 cases): canonical worker-killed + storage-event triggers, the three false-positive guards (warmup, eval, sub-threshold), per-pod / per-node join scopes, window semantics, discriminator priority when both layers join, configurable threshold, deterministic ordering, schema conformance + 10 drift-rejection falsifiers.module/processor/patterndetectorprocessor/dataloader_hang_test.go, 5 cases): worker-killed wiring with promoted scalars, storage-event wiring, stall-alone-no-fire conservative-guard wiring, YAML threshold knob, sub-second validation rejection.make checkclean (gofumpt, golangci-lint 0 issues, go vet, go mod verify, attribute-namespace-check 73/73 documented).Schema
module/pkg/patterns/testdata/dataloader_hang_verdict.schema.json— pins:pattern.idconst"7"discriminatorenum{worker_killed, storage_event}confidenceenum{full, partial}(v1 emits onlyfull;partialreserved for the future py-spy degraded path per spec Open Q#2)tracecore.alert.dataloader_hang.stall_secondsinteger ≥ 0evidence_trail.kindenum{training_step_stall, dataloader_error, storage_event}withadditionalProperties: falseSpec ambiguity choices (per PRINCIPLES.md: maximize operator UX, minimize false positives)
confidence=fullonly when a discriminator joins;confidence=partialis reserved in the schema enum so we can land the py-spy branch additively. Trade-off: stalls without storage / worker evidence are silent — operators triage off pattern Make RFC process optional, not gated #6 (stragglers) and pattern Add developer foundation: Git hooks suite, issue templates, editorconfig #11 (checkpointer) instead.tracecore.alert.training_step_stalled.no_progress_secondsbridge attribute, mirroring the pattern-5tracecore.alert.pcie_rate_collapse.*shape. OTTL recipe wire-up follows the metrics→logs convention from RFC-0014; the recipe itself is the integration-gap follow-up (out of scope for the detector PR).dataloader.error_classstring; the recipe owns per-driver regex stanzas.gen_ai.training.phaseupstream): detector reads it as a tracecore-ext alpha attribute; documented indocs/ATTRIBUTES.md. Switches to upstream-semconv when O4 lands it.worker_killedwins — pod-scoped is more precise (storage events on the node may be unrelated to the failing pod's mount). Pinned byTestDataLoaderHangDetector_PrefersWorkerKilledOverStorage.Closes
None — no GitHub issue exists for pattern #7. Closes the spec-vs-detector gap from
docs/patterns/07-dataloader-hang.md(status flips from☐ plannedto☑ shipped).