Skip to content

feat(dataloader): pattern-7 hang detector + verdict schema#346

Merged
trilamsr merged 5 commits into
mainfrom
feat/pattern-7-dataloader-hang
Jun 1, 2026
Merged

feat(dataloader): pattern-7 hang detector + verdict schema#346
trilamsr merged 5 commits into
mainfrom
feat/pattern-7-dataloader-hang

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Ships pattern Add AI review + PR-creation gates and lifecycle skills #7 (dataloader hang) end-to-end: detector library, JSON-schema verdict shape, processor wiring, integration tests, and docs flip from ☐ planned to ☑ shipped.
  • Conservative two-discriminator rule per docs/patterns/07-dataloader-hang.md: a stall alone never fires; the verdict requires either a same-pod dataloader.error_class log (worker_killed) or a same-node FailedMount / VolumeMountFailure Kubernetes Event (storage_event) within the correlation window.
  • Three false-positive guards land with the detector: step-0/1 warmup-prefetch, phase=="eval" validation pauses, and sub-threshold cadence variance.
patterns: pattern #7 (dataloader hang) detector ships with worker_killed + storage_event discriminators, conservative warmup / eval-phase / sub-threshold guards, and a pinned JSON-schema verdict shape.

Tests

cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... -run DataLoader
  • Library (module/pkg/patterns/dataloader_hang_test.go, 14 cases): canonical worker-killed + storage-event triggers, the three false-positive guards (warmup, eval, sub-threshold), per-pod / per-node join scopes, window semantics, discriminator priority when both layers join, configurable threshold, deterministic ordering, schema conformance + 10 drift-rejection falsifiers.
  • Processor (module/processor/patterndetectorprocessor/dataloader_hang_test.go, 5 cases): worker-killed wiring with promoted scalars, storage-event wiring, stall-alone-no-fire conservative-guard wiring, YAML threshold knob, sub-second validation rejection.
  • make check clean (gofumpt, golangci-lint 0 issues, go vet, go mod verify, attribute-namespace-check 73/73 documented).

Schema

module/pkg/patterns/testdata/dataloader_hang_verdict.schema.json — pins:

  • pattern.id const "7"
  • discriminator enum {worker_killed, storage_event}
  • confidence enum {full, partial} (v1 emits only full; partial reserved for the future py-spy degraded path per spec Open Q#2)
  • tracecore.alert.dataloader_hang.stall_seconds integer ≥ 0
  • evidence_trail.kind enum {training_step_stall, dataloader_error, storage_event} with additionalProperties: false

Spec ambiguity choices (per PRINCIPLES.md: maximize operator UX, minimize false positives)

  • Open Q#2 (py-spy degraded path): deferred. Spec notes the py-spy / parca-agent receiver is undecided at v1 (RFC-0013 §7 slates pyspy for delete; parca-agent contract not yet authored). v1 detector emits confidence=full only when a discriminator joins; confidence=partial is reserved in the schema enum so we can land the py-spy branch additively. Trade-off: stalls without storage / worker evidence are silent — operators triage off pattern Make RFC process optional, not gated #6 (stragglers) and pattern Add developer foundation: Git hooks suite, issue templates, editorconfig #11 (checkpointer) instead.
  • Open Q#1 (step-progress signal source): detector consumes a tracecore.alert.training_step_stalled.no_progress_seconds bridge attribute, mirroring the pattern-5 tracecore.alert.pcie_rate_collapse.* shape. OTTL recipe wire-up follows the metrics→logs convention from RFC-0014; the recipe itself is the integration-gap follow-up (out of scope for the detector PR).
  • Open Q#3 (FUSE error vocabulary): deferred to the OTTL recipe — detector reads a single dataloader.error_class string; the recipe owns per-driver regex stanzas.
  • Open Q#4 (gen_ai.training.phase upstream): detector reads it as a tracecore-ext alpha attribute; documented in docs/ATTRIBUTES.md. Switches to upstream-semconv when O4 lands it.
  • Discriminator priority: when both a same-pod worker-killed log and a same-node storage event join the stall, worker_killed wins — pod-scoped is more precise (storage events on the node may be unrelated to the failing pod's mount). Pinned by TestDataLoaderHangDetector_PrefersWorkerKilledOverStorage.

Closes

None — no GitHub issue exists for pattern #7. Closes the spec-vs-detector gap from docs/patterns/07-dataloader-hang.md (status flips from ☐ planned to ☑ shipped).

@trilamsr

trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Addressed reviewer findings:

🔴 Fixed (real correctness issue): tie-break (worker_killed + storage_event both join) now emits BOTH evidence refs in evidence_trail (3 entries: stall + dataloader_error + storage_event). Discriminator stays worker_killed per spec, but the trail is union-of-correlated-signals so operators see node-wide storage failure when present. EventReason still bound to chosen discriminator (worker_killed → empty; storage_event → reason). Schema description clarifies. Tests: extended TestDataLoaderHangDetector_PrefersWorkerKilledOverStorage to assert len=3 trail + storage_event kind; new TestPatternDetector_DataLoaderHangWiringBothDiscriminatorsJoin processor-level test.

🟡 Quick wins:

  • Added TestDataLoaderHangDetector_MissingPhaseField confirming fail-open guard semantics.
  • Pinned dataloaderHangStorageReasons allowlist to k8sobjectsreceiver v0.130.0 / k8s.io/api transitive in comment.

Spec follow-ups (deferred, will file):

Acknowledged, no action: dataloader.* namespace per RFC-0013 tracecore-owned policy.

…ader-hang

# Conflicts:
#	docs/ATTRIBUTES.md
#	module/processor/patterndetectorprocessor/config.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant