Operator-facing pages for each NORTHSTARS Appendix A root-cause pattern this receiver enables. Walkthrough = symptom → receiver-emitted-signal → query → alert → escalation.
The point: an operator hitting a Pattern-#N incident shouldn't have to read tracecore source to understand what metrics signal it. Each walkthrough is self-contained.
Input signals come from upstream-adopted receivers per the adoption
matrix in RFC-0013 §2. The
pattern detection engine itself
(tracecore-components/processor/patterndetectorprocessor) stays
in-tree as the differentiator (the moat), not the receivers
underneath.
Cross-vendor gpu.vendor resource attribute (nvidia | amd |
intel | habana) is normalized by an OTTL transform processor
in the bundled recipe; it is also an upstream-contribution target
(propose to OTel hw.* semconv).
Pattern outputs (M17 / M18 / M19) preserve their attribute names
(gpu.id, kernelevents.xid, gpu.vendor) and the 11-entry
k8s.event.hint event hint enum across the adoption swap. These are
customer-stable contracts - see RFC-0013 §3.
These pages assume:
- The bundled OTel recipe is deployed (Helm chart
tracecore-recipes). - The relevant vendor exporter is reachable on its scrape endpoint
(e.g.,
dcgm-exporterrunning as a DaemonSet for NVIDIA hosts). - An OTLP backend is receiving the metrics (Prometheus, Datadog, Honeycomb, Mimir all confirmed in the backend matrix).
Seven of NORTHSTARS Appendix A's 15 patterns have operator-facing walkthroughs today — the DCGM-observable subset plus pattern #2 (InfiniBand link flap, fabric-observable) and pattern #14 (pod evicted, k8s-events-observable). The remaining patterns carry a design spec under this directory pending implementation, or sit in the "reserved / unfilled" table below where the pattern number is named in NORTHSTARS Appendix A but no detector has been written yet. New walkthroughs land alongside the detector that surfaces the pattern.
| Pattern | File | DCGM signal |
|---|---|---|
| #1 NVLink silent degradation | 01-nvlink-degradation-walkthrough.md | per-link hw.gpu.nvlink.io Tx/Rx divergence |
| #2 InfiniBand link flap | 02-ib-link-flap-walkthrough.md | hw.network.ib.port.state ACTIVE↔DOWN transitions (>=2 in 2m window) joined to same-node NCCL FR stuck collective |
| #3 Uncorrectable HBM ECC | 03-hbm-ecc-walkthrough.md | hw.errors{error.type=uncorrected} non-zero |
| #4 Thermal throttling cascade | 04-thermal-throttle-walkthrough.md | hw.gpu.throttle.duration{reason=thermal} rate-of-change |
| #5 PCIe AER cascade | 05-pcie-aer-walkthrough.md | hw.gpu.io Tx/Rx counter discontinuities |
| #7 Dataloader hang | 07-dataloader-hang.md | tracecore.alert.training_step_stalled.* + dataloader.error_class OR k8s.event.reason{FailedMount|VolumeMountFailure} |
#14 Pod evicted / NodeNotReady |
14-pod-evicted-walkthrough.md | k8s.event.hint=pod_evicted joined to k8s.node.condition.{type,status} transition within join_window (default 30s) |
Engineering-facing pattern-design specs for the 8 unspec'd v1 patterns. Each follows a fixed shape: Symptom / Layers crossed / Signal sources / Detector evaluation rule / Verdict attributes / Edge cases / Tested-against / Open questions. Status ☐ planned until a detector lands; promotes to an operator walkthrough when the detector ships.
| Pattern | Spec | Status |
|---|---|---|
| #2 InfiniBand link flap | 02-ib-link-flap.md | ☑ shipped (operator walkthrough: 02-ib-link-flap-walkthrough.md) |
| #8 NCCL timeout, no hardware cause | 08-nccl-timeout-no-hw.md | ☐ planned |
| #9 NCCL bootstrap timeout | 09-nccl-bootstrap-timeout.md | ☑ shipped |
| #10 CUDA OOM, deceptive allocator | 10-cuda-oom-deceptive.md | ☐ planned (#303 filed) |
| #11 Checkpointer hang | 11-checkpointer-hang.md | ☑ shipped (detector + processor wiring) |
| #12 Loss spike → NaN | 12-loss-spike-nan.md | ☐ planned |
| #13 Silent data corruption | 13-silent-data-corruption.md | ☑ shipped |
Pattern numbers named in NORTHSTARS Appendix A that have neither a design spec nor an operator walkthrough in this directory. Listed explicitly so the numeric gaps in the tables above are documented, not silent.
| Pattern | Status | Rationale |
|---|---|---|
#6 Stragglers from slow node (data_time 3× normal on one node) |
☐ unimplemented — milestone M18 — no detector, no spec, no walkthrough yet | NORTHSTARS Appendix A names this pattern and M6 (v0) targets coverage. The chaos injector failure-inject cpu-steal lands the symptom (per M4b rubric), but the cross-rank data_time aggregation detector is build-time coupled to M17's cross_rank.go infra and is at risk per the NORTHSTARS-coupling note in M21 v0.1.0 release. NORTHSTARS Appendix A's "☑ in-tree detector" mark for this pattern is aspirational, not current state — to be reconciled when the detector lands. Doc will land alongside the detector; design-spec follow-up tracked at the M18 milestone. |
#15 Image pull / FailedMount on restart |
☐ no spec yet (NORTHSTARS Appendix A entry; no detector planned for v1) | The k8s.event.hint 11-entry enum (RFC-0013 §3) already carries mount_failure and image_pull_failure hints — the input signal exists. A detector would join repeated mount/pull failures on the same job's retry pods within a window; not yet scheduled. |
When a reserved pattern's detector lands, it promotes to the "Design specs" table (☐ planned → ☑ shipped) or directly to the "Operator walkthroughs (shipped)" table — same numeric ID, same filename convention.
Three v1 detectors use three different correlation-window shapes — chosen independently to match each pattern's physical event-ordering. Operators tuning windows hit these without warning today; this table is the cross-link.
| Pattern | Shape | Knob(s) | Why |
|---|---|---|---|
| #7 Dataloader hang | One-sided look-back | dataloader_hang_correlation_window |
Discriminator (worker error, storage event, py-spy sample) must precede the stall — cause→symptom ordering. |
| #11 Checkpointer hang | Asymmetric (-30s, +60s) | checkpointer_hang_backward_window, checkpointer_hang_forward_window |
Phase-record vs. stall-record race — torch may log "checkpoint begin" before OR after the sampler reports the stall. Two independently tunable legs. |
| #13 Silent data corruption | Symmetric, job-scoped | (no explicit window — job-bounded) | SDC counter rise and accuracy regression are concurrent symptoms of the same corruption, not cause→symptom. Order-free attribution. |
A unified asymmetric two-knob form would silently zero one leg for #7 and not apply to #13 — operators would tune knobs with no physical meaning for their pattern.
The scripted-replay test fixture that previously lived at
components/receivers/dcgm/pattern_replay_test.go was retired with
the in-tree DCGM receiver in RFC-0013 PR-F.1. Pattern replay now
flows through the upstream dcgm-exporter recipe — see
docs/integrations/prometheus-scrape.md
for the recipe and the synthetic-sample shape that each pattern's
walkthrough below references.
Patterns are emergent - operators in the field find them. Contributions welcome via the standard PR flow; the format below should be preserved for consistency.
Every file in this directory uses a zero-padded numeric prefix matching the NORTHSTARS Appendix A pattern number, so lexsort and pattern-number ordering agree:
NN-slug.md— engineering-facing design spec (the TDD red-test input). One per pattern. Lands first when a detector is spec'd before implementation; remains as the contract once the detector ships.NN-slug-walkthrough.md— operator-facing runbook (Symptom → Signal → Query → Alert → Escalation → Replay). Lands when the detector ships and an operator needs to triage a real incident. May coexist with the spec atNN-slug.md(they cross-reference).
Pattern #2 carries both today (02-ib-link-flap.md spec +
02-ib-link-flap-walkthrough.md runbook). New patterns follow the
same split when both audiences need a page.
Each pattern walkthrough has the same sections:
- Symptom - what the operator sees at the workload level.
- Why DCGM sees it - physical mechanism.
- Receiver-emitted signal - exact metric + attribute set.
- Query - PromQL / OTLP filter syntax.
- Alert - when to page.
- Escalation - who owns the fix.
- Replay - how to reproduce in test, citing the pattern_replay_test.go fixture.