Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Pattern walkthroughs

Operator-facing pages for each NORTHSTARS Appendix A root-cause pattern this receiver enables. Walkthrough = symptom → receiver-emitted-signal → query → alert → escalation.

The point: an operator hitting a Pattern-#N incident shouldn't have to read tracecore source to understand what metrics signal it. Each walkthrough is self-contained.

Input sources (post-RFC-0013)

Input signals come from upstream-adopted receivers per the adoption matrix in RFC-0013 §2. The pattern detection engine itself (tracecore-components/processor/patterndetectorprocessor) stays in-tree as the differentiator (the moat), not the receivers underneath.

Cross-vendor gpu.vendor resource attribute (nvidia | amd | intel | habana) is normalized by an OTTL transform processor in the bundled recipe; it is also an upstream-contribution target (propose to OTel hw.* semconv).

Pattern outputs (M17 / M18 / M19) preserve their attribute names (gpu.id, kernelevents.xid, gpu.vendor) and the 11-entry k8s.event.hint event hint enum across the adoption swap. These are customer-stable contracts - see RFC-0013 §3.

These pages assume:

  • The bundled OTel recipe is deployed (Helm chart tracecore-recipes).
  • The relevant vendor exporter is reachable on its scrape endpoint (e.g., dcgm-exporter running as a DaemonSet for NVIDIA hosts).
  • An OTLP backend is receiving the metrics (Prometheus, Datadog, Honeycomb, Mimir all confirmed in the backend matrix).

Seven of NORTHSTARS Appendix A's 15 patterns have operator-facing walkthroughs today — the DCGM-observable subset plus pattern #2 (InfiniBand link flap, fabric-observable) and pattern #14 (pod evicted, k8s-events-observable). The remaining patterns carry a design spec under this directory pending implementation, or sit in the "reserved / unfilled" table below where the pattern number is named in NORTHSTARS Appendix A but no detector has been written yet. New walkthroughs land alongside the detector that surfaces the pattern.

Operator walkthroughs (shipped)

Pattern File DCGM signal
#1 NVLink silent degradation 01-nvlink-degradation-walkthrough.md per-link hw.gpu.nvlink.io Tx/Rx divergence
#2 InfiniBand link flap 02-ib-link-flap-walkthrough.md hw.network.ib.port.state ACTIVE↔DOWN transitions (>=2 in 2m window) joined to same-node NCCL FR stuck collective
#3 Uncorrectable HBM ECC 03-hbm-ecc-walkthrough.md hw.errors{error.type=uncorrected} non-zero
#4 Thermal throttling cascade 04-thermal-throttle-walkthrough.md hw.gpu.throttle.duration{reason=thermal} rate-of-change
#5 PCIe AER cascade 05-pcie-aer-walkthrough.md hw.gpu.io Tx/Rx counter discontinuities
#7 Dataloader hang 07-dataloader-hang.md tracecore.alert.training_step_stalled.* + dataloader.error_class OR k8s.event.reason{FailedMount|VolumeMountFailure}
#14 Pod evicted / NodeNotReady 14-pod-evicted-walkthrough.md k8s.event.hint=pod_evicted joined to k8s.node.condition.{type,status} transition within join_window (default 30s)

Design specs (planned detectors — TDD red-test inputs)

Engineering-facing pattern-design specs for the 8 unspec'd v1 patterns. Each follows a fixed shape: Symptom / Layers crossed / Signal sources / Detector evaluation rule / Verdict attributes / Edge cases / Tested-against / Open questions. Status ☐ planned until a detector lands; promotes to an operator walkthrough when the detector ships.

Pattern Spec Status
#2 InfiniBand link flap 02-ib-link-flap.md ☑ shipped (operator walkthrough: 02-ib-link-flap-walkthrough.md)
#8 NCCL timeout, no hardware cause 08-nccl-timeout-no-hw.md ☐ planned
#9 NCCL bootstrap timeout 09-nccl-bootstrap-timeout.md ☑ shipped
#10 CUDA OOM, deceptive allocator 10-cuda-oom-deceptive.md ☐ planned (#303 filed)
#11 Checkpointer hang 11-checkpointer-hang.md ☑ shipped (detector + processor wiring)
#12 Loss spike → NaN 12-loss-spike-nan.md ☐ planned
#13 Silent data corruption 13-silent-data-corruption.md ☑ shipped

Reserved / unfilled (NORTHSTARS Appendix A patterns with no doc yet)

Pattern numbers named in NORTHSTARS Appendix A that have neither a design spec nor an operator walkthrough in this directory. Listed explicitly so the numeric gaps in the tables above are documented, not silent.

Pattern Status Rationale
#6 Stragglers from slow node (data_time 3× normal on one node) ☐ unimplemented — milestone M18 — no detector, no spec, no walkthrough yet NORTHSTARS Appendix A names this pattern and M6 (v0) targets coverage. The chaos injector failure-inject cpu-steal lands the symptom (per M4b rubric), but the cross-rank data_time aggregation detector is build-time coupled to M17's cross_rank.go infra and is at risk per the NORTHSTARS-coupling note in M21 v0.1.0 release. NORTHSTARS Appendix A's "☑ in-tree detector" mark for this pattern is aspirational, not current state — to be reconciled when the detector lands. Doc will land alongside the detector; design-spec follow-up tracked at the M18 milestone.
#15 Image pull / FailedMount on restart ☐ no spec yet (NORTHSTARS Appendix A entry; no detector planned for v1) The k8s.event.hint 11-entry enum (RFC-0013 §3) already carries mount_failure and image_pull_failure hints — the input signal exists. A detector would join repeated mount/pull failures on the same job's retry pods within a window; not yet scheduled.

When a reserved pattern's detector lands, it promotes to the "Design specs" table (☐ planned → ☑ shipped) or directly to the "Operator walkthroughs (shipped)" table — same numeric ID, same filename convention.

Correlation-window semantics

Three v1 detectors use three different correlation-window shapes — chosen independently to match each pattern's physical event-ordering. Operators tuning windows hit these without warning today; this table is the cross-link.

Pattern Shape Knob(s) Why
#7 Dataloader hang One-sided look-back dataloader_hang_correlation_window Discriminator (worker error, storage event, py-spy sample) must precede the stall — cause→symptom ordering.
#11 Checkpointer hang Asymmetric (-30s, +60s) checkpointer_hang_backward_window, checkpointer_hang_forward_window Phase-record vs. stall-record race — torch may log "checkpoint begin" before OR after the sampler reports the stall. Two independently tunable legs.
#13 Silent data corruption Symmetric, job-scoped (no explicit window — job-bounded) SDC counter rise and accuracy regression are concurrent symptoms of the same corruption, not cause→symptom. Order-free attribution.

A unified asymmetric two-knob form would silently zero one leg for #7 and not apply to #13 — operators would tune knobs with no physical meaning for their pattern.

Replay test fixture

The scripted-replay test fixture that previously lived at components/receivers/dcgm/pattern_replay_test.go was retired with the in-tree DCGM receiver in RFC-0013 PR-F.1. Pattern replay now flows through the upstream dcgm-exporter recipe — see docs/integrations/prometheus-scrape.md for the recipe and the synthetic-sample shape that each pattern's walkthrough below references.

When this directory is wrong

Patterns are emergent - operators in the field find them. Contributions welcome via the standard PR flow; the format below should be preserved for consistency.

Filename convention

Every file in this directory uses a zero-padded numeric prefix matching the NORTHSTARS Appendix A pattern number, so lexsort and pattern-number ordering agree:

  • NN-slug.md — engineering-facing design spec (the TDD red-test input). One per pattern. Lands first when a detector is spec'd before implementation; remains as the contract once the detector ships.
  • NN-slug-walkthrough.md — operator-facing runbook (Symptom → Signal → Query → Alert → Escalation → Replay). Lands when the detector ships and an operator needs to triage a real incident. May coexist with the spec at NN-slug.md (they cross-reference).

Pattern #2 carries both today (02-ib-link-flap.md spec + 02-ib-link-flap-walkthrough.md runbook). New patterns follow the same split when both audiences need a page.

Format

Each pattern walkthrough has the same sections:

  1. Symptom - what the operator sees at the workload level.
  2. Why DCGM sees it - physical mechanism.
  3. Receiver-emitted signal - exact metric + attribute set.
  4. Query - PromQL / OTLP filter syntax.
  5. Alert - when to page.
  6. Escalation - who owns the fix.
  7. Replay - how to reproduce in test, citing the pattern_replay_test.go fixture.