From 0bd23c8b754ad55728ae74d6e0cab5fdc63f9505 Mon Sep 17 00:00:00 2001 From: Tri Lam Date: Wed, 3 Jun 2026 17:34:04 -0700 Subject: [PATCH] chore(docs): unify patterns/ naming to NN-slug[-walkthrough].md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patterns 1-5 carried `pattern-N-slug.md` while 7-13 used the lexsort- stable `NN-slug.md` prefix. Pattern #2 had both an engineering design spec (`02-ib-link-flap.md`) AND an operator walkthrough (`pattern-2-ib-link-flap.md`) — distinct doc types, not duplicates. Unify everything under a single numeric-prefix scheme that preserves the spec/walkthrough split as a filename suffix: - `NN-slug.md` = engineering design spec (TDD red-test input) - `NN-slug-walkthrough.md` = operator runbook (symptom -> escalation) Renames (5): pattern-1-nvlink-degradation.md -> 01-nvlink-degradation-walkthrough.md pattern-2-ib-link-flap.md -> 02-ib-link-flap-walkthrough.md pattern-3-hbm-ecc.md -> 03-hbm-ecc-walkthrough.md pattern-4-thermal-throttle.md -> 04-thermal-throttle-walkthrough.md pattern-5-pcie-aer.md -> 05-pcie-aer-walkthrough.md Inbound refs updated in 9 files (MILESTONES, NORTHSTARS, prometheus-scrape, RFC-0014, M4b followup, README, 02-ib-link-flap spec, 3 module/pkg/patterns Go files, thermal_throttle replay manifest). README "Filename convention" section now documents the NN-/NN-walkthrough split explicitly. Pattern #2 keeps both files — they cross-reference each other (spec links to walkthrough for operators; walkthrough links to spec for the engineering contract). docs/patterns/README.md tables already reflected this dual-doc reality; this change makes the filenames honest about it. `scripts/recipes-path-check{.sh,_test.sh}` retained — that gate lints commit-subject references to a non-existent `recipes/pattern-N/` *directory* (issue #427) and is unrelated to `docs/patterns/` filenames. Pre-change `make doc-check`: exit 0 (217 anchors + 1105 md links). Post-change `make doc-check`: exit 0 (same counts; zero broken refs). Signed-off-by: Tri Lam --- docs/MILESTONES.md | 2 +- docs/NORTHSTARS.md | 8 ++--- docs/followups/M4b.md | 7 +++-- docs/integrations/prometheus-scrape.md | 10 +++--- ...d => 01-nvlink-degradation-walkthrough.md} | 0 ...flap.md => 02-ib-link-flap-walkthrough.md} | 0 docs/patterns/02-ib-link-flap.md | 2 +- ...3-hbm-ecc.md => 03-hbm-ecc-walkthrough.md} | 0 ....md => 04-thermal-throttle-walkthrough.md} | 0 ...pcie-aer.md => 05-pcie-aer-walkthrough.md} | 0 docs/patterns/README.md | 31 +++++++++++++++---- .../0014-metrics-to-logs-pattern-input.md | 2 +- module/pkg/patterns/hbm_ecc.go | 4 +-- module/pkg/patterns/pcie_aer.go | 4 +-- module/pkg/patterns/thermal_throttle.go | 2 +- .../thermal_throttle/canonical/manifest.json | 2 +- 16 files changed, 47 insertions(+), 27 deletions(-) rename docs/patterns/{pattern-1-nvlink-degradation.md => 01-nvlink-degradation-walkthrough.md} (100%) rename docs/patterns/{pattern-2-ib-link-flap.md => 02-ib-link-flap-walkthrough.md} (100%) rename docs/patterns/{pattern-3-hbm-ecc.md => 03-hbm-ecc-walkthrough.md} (100%) rename docs/patterns/{pattern-4-thermal-throttle.md => 04-thermal-throttle-walkthrough.md} (100%) rename docs/patterns/{pattern-5-pcie-aer.md => 05-pcie-aer-walkthrough.md} (100%) diff --git a/docs/MILESTONES.md b/docs/MILESTONES.md index ab10363a..680356f7 100644 --- a/docs/MILESTONES.md +++ b/docs/MILESTONES.md @@ -315,7 +315,7 @@ Lane 6 covers NVIDIA-side device telemetry (DCGM), NCCL collective diagnostics ( - **Status:** ☐ - **Depends on:** M8 cgo, M9, M11, M4b -- **Reference:** [`docs/patterns/pattern-1-nvlink-degradation.md`](patterns/pattern-1-nvlink-degradation.md) +- **Reference:** [`docs/patterns/01-nvlink-degradation-walkthrough.md`](patterns/01-nvlink-degradation-walkthrough.md) - **Hardware:** replay corpus: none. Integration variants exercising real DCGM: Linux + NVIDIA GPU (flood-gated). **Rubric summary:** Replay fixture `internal/synthesis/replay/nvlink_silent/fixture_xid79_nccl_hang/` (synthetic Xid 79 dmesg + NCCL FR hang pickle across 8 ranks) → exactly one `pattern.id="1"` verdict within 60s. **Cross-rank join export:** M17 creates `pkg/nccl/fr_parser/cross_rank.go` exporting `JoinByCollectiveSeq(perRank [][]Record) StepTree` as public API (M18 build-time consumer); tolerates `N-1` missing ranks (`state=unknown_hung`). Evidence trail (a) Xid 79 / Xid 74 [Hopper/Ampere] kernel event, (b) DCGM `hw.gpu.nvlink.io` 50%-of-median divergence, (c) NCCL FR `state=started, !completed`; missing layer → `confidence=partial`. Structured-query lint (no raw PromQL). Hermetic replay (no network/GPU); `//go:build integration` for real DCGM variants. Headline regex `^NVLink degradation on rank \d+ at step \d+$`. Negative + adjacent-pattern fixtures (#2/#6/#8) zero-verdict. NFR: zero FP on healthy + adjacent corpora; 64-rank join ≤100ms p99 *(unverified)*; NCCL 2.29.x vs 2.30.x golden-identical; component+rank+step+missing-layer in every return path. diff --git a/docs/NORTHSTARS.md b/docs/NORTHSTARS.md index bb4a7d03..679706df 100644 --- a/docs/NORTHSTARS.md +++ b/docs/NORTHSTARS.md @@ -407,11 +407,11 @@ The pattern set tracecore is built to root-cause end-to-end. Coverage of this li | # | Pattern | Symptom | Layers crossed | Spec | |---|---|---|---|---| -| 1 | NVLink silent degradation | Xid 79/74; NCCL hangs on AllReduce | DCGM + Xid + NVLink fabric + NCCL | ☑ [walkthrough](patterns/pattern-1-nvlink-degradation.md) | +| 1 | NVLink silent degradation | Xid 79/74; NCCL hangs on AllReduce | DCGM + Xid + NVLink fabric + NCCL | ☑ [walkthrough](patterns/01-nvlink-degradation-walkthrough.md) | | 2 | InfiniBand link flap | NCCL "HCA error" or sudden bandwidth collapse | IB/RDMA + NCCL | ☐ [spec](patterns/02-ib-link-flap.md) | -| 3 | Uncorrectable HBM ECC | Xid 48/63/64; single GPU dies | DCGM + dmesg | ☑ [walkthrough](patterns/pattern-3-hbm-ecc.md) | -| 4 | Thermal throttling cascade | Rack throttles; GPUs become stragglers | DCGM + power telemetry + stragglers | ☑ [walkthrough](patterns/pattern-4-thermal-throttle.md) | -| 5 | PCIe AER cascade | Correctable PCIe errors; no crash | PCIe/BMC + DCGM | ☑ [walkthrough](patterns/pattern-5-pcie-aer.md) | +| 3 | Uncorrectable HBM ECC | Xid 48/63/64; single GPU dies | DCGM + dmesg | ☑ [walkthrough](patterns/03-hbm-ecc-walkthrough.md) | +| 4 | Thermal throttling cascade | Rack throttles; GPUs become stragglers | DCGM + power telemetry + stragglers | ☑ [walkthrough](patterns/04-thermal-throttle-walkthrough.md) | +| 5 | PCIe AER cascade | Correctable PCIe errors; no crash | PCIe/BMC + DCGM | ☑ [walkthrough](patterns/05-pcie-aer-walkthrough.md) | | 6 | Stragglers from slow node | `data_time` 3× normal on one node | CPU/DRAM + dataloader I/O + stragglers | ☑ in-tree detector | | 7 | Dataloader hang | Worker death, FUSE stall, S3 throttling | Dataloader + storage + Python runtime | ☐ [spec](patterns/07-dataloader-hang.md) | | 8 | NCCL timeout, no hardware cause | All-reduce hangs without HW signal | NCCL + distributed framework | ☐ [spec](patterns/08-nccl-timeout-no-hw.md) | diff --git a/docs/followups/M4b.md b/docs/followups/M4b.md index a0ab8278..fcbc226c 100644 --- a/docs/followups/M4b.md +++ b/docs/followups/M4b.md @@ -87,10 +87,11 @@ implementation gated on an upstream package that has not landed yet. cover `failure-inject` at all. Fix: extend the flag help text + add a RUNBOOK section. *Trigger:* first operator script that depends on the exit code, OR M21 docs sweep. -- **`docs/patterns/pattern-14-pod-evicted.md` missing.** Patterns +- **`docs/patterns/14-pod-evicted-walkthrough.md` missing.** Patterns #1, #3, #4, #5 each carry a research-note companion under - `docs/patterns/`. M19 inverted the order (detector first) and - the companion was not tracked. *Trigger:* first M17/M18 + `docs/patterns/` (per the `NN-slug-walkthrough.md` convention in + `docs/patterns/README.md`). M19 inverted the order (detector first) + and the companion was not tracked. *Trigger:* first M17/M18 contributor notices the asymmetry, OR M21 docs sweep. - **Verdict JSON Schema path pattern-#14-specific at package root.** `module/pkg/patterns/testdata/verdict.schema.json` hard- diff --git a/docs/integrations/prometheus-scrape.md b/docs/integrations/prometheus-scrape.md index 3cc971ca..762a787e 100644 --- a/docs/integrations/prometheus-scrape.md +++ b/docs/integrations/prometheus-scrape.md @@ -296,7 +296,7 @@ The link index lift uses `Int(ExtractPatterns(metric.name, so the resulting attribute is integer-typed (matches the semconv proposal's `hw.gpu.nvlink.link: int`). Per-link decomposition is the diagnostic-critical surface for -[pattern #1 silent NVLink degradation](../patterns/pattern-1-nvlink-degradation.md); +[pattern #1 silent NVLink degradation](../patterns/01-nvlink-degradation-walkthrough.md); without it the alert query has no group-by axis. > **dcgm-exporter opt-in required.** The @@ -325,7 +325,7 @@ references. The attribute names match the semconv `hw.errors` shape (see [hw common](https://opentelemetry.io/docs/specs/semconv/hardware/common/)). -[Pattern #3 doc](../patterns/pattern-3-hbm-ecc.md) consumes the +[Pattern #3 doc](../patterns/03-hbm-ecc-walkthrough.md) consumes the `error.persistence=volatile` row in its alert query. ### Pattern #4 — Thermal throttle cascade @@ -352,7 +352,7 @@ resolves the vocabulary. Tracked at [#272](https://github.com/TraceCoreAI/tracecore/issues/272) for the upstream proposal extension. -[Pattern #4 doc](../patterns/pattern-4-thermal-throttle.md) alerts +[Pattern #4 doc](../patterns/04-thermal-throttle-walkthrough.md) alerts on the `reason=thermal` row; the other reasons are diagnostic context (`power` correlates with PSU sag, `hw_slowdown` is the "GPU has decided to clock itself down" hard signal). @@ -372,7 +372,7 @@ GPU set. The `pci_bus_id` label is lifted to the resource-level `hw.gpu.pci.bdf` so pattern #5's escalation matrix can cross- reference dmesg `PCIe Bus Error: Corrected` lines against the -same BDF without joining series. [Pattern #5 doc](../patterns/pattern-5-pcie-aer.md) +same BDF without joining series. [Pattern #5 doc](../patterns/05-pcie-aer-walkthrough.md) shows the divergence query in PromQL form. ### Pattern #10 — CUDA OOM (framebuffer) @@ -446,7 +446,7 @@ statement evaluates. Renaming first would short-circuit the attribute stamps because the second statement's guard would no longer find the original name. -[Pattern #2 doc](../patterns/pattern-2-ib-link-flap.md) consumes the +[Pattern #2 doc](../patterns/02-ib-link-flap-walkthrough.md) consumes the joined record via [`projectIBPortStateRecord`](../../module/processor/patterndetectorprocessor/ib_link_flap.go) (gate: `hw.network.ib.port.state` AND `hw.network.ib.device` AND diff --git a/docs/patterns/pattern-1-nvlink-degradation.md b/docs/patterns/01-nvlink-degradation-walkthrough.md similarity index 100% rename from docs/patterns/pattern-1-nvlink-degradation.md rename to docs/patterns/01-nvlink-degradation-walkthrough.md diff --git a/docs/patterns/pattern-2-ib-link-flap.md b/docs/patterns/02-ib-link-flap-walkthrough.md similarity index 100% rename from docs/patterns/pattern-2-ib-link-flap.md rename to docs/patterns/02-ib-link-flap-walkthrough.md diff --git a/docs/patterns/02-ib-link-flap.md b/docs/patterns/02-ib-link-flap.md index 4aa96f61..883fb5c2 100644 --- a/docs/patterns/02-ib-link-flap.md +++ b/docs/patterns/02-ib-link-flap.md @@ -1,6 +1,6 @@ # Pattern #2 — InfiniBand link flap -**Status:** ☑ shipped — detector library at `module/pkg/patterns/ib_link_flap.go`, processor wiring at `module/processor/patterndetectorprocessor/ib_link_flap.go`. Operator-facing walkthrough at [pattern-2-ib-link-flap.md](pattern-2-ib-link-flap.md). +**Status:** ☑ shipped — detector library at `module/pkg/patterns/ib_link_flap.go`, processor wiring at `module/processor/patterndetectorprocessor/ib_link_flap.go`. Operator-facing walkthrough at [02-ib-link-flap-walkthrough.md](02-ib-link-flap-walkthrough.md). Design spec for the pattern-#2 detector. Distinct from the operator-facing walkthroughs (`pattern-N-*.md`) — this is the engineering contract that a TDD red test gets written against. diff --git a/docs/patterns/pattern-3-hbm-ecc.md b/docs/patterns/03-hbm-ecc-walkthrough.md similarity index 100% rename from docs/patterns/pattern-3-hbm-ecc.md rename to docs/patterns/03-hbm-ecc-walkthrough.md diff --git a/docs/patterns/pattern-4-thermal-throttle.md b/docs/patterns/04-thermal-throttle-walkthrough.md similarity index 100% rename from docs/patterns/pattern-4-thermal-throttle.md rename to docs/patterns/04-thermal-throttle-walkthrough.md diff --git a/docs/patterns/pattern-5-pcie-aer.md b/docs/patterns/05-pcie-aer-walkthrough.md similarity index 100% rename from docs/patterns/pattern-5-pcie-aer.md rename to docs/patterns/05-pcie-aer-walkthrough.md diff --git a/docs/patterns/README.md b/docs/patterns/README.md index f52c29c0..1bb3e507 100644 --- a/docs/patterns/README.md +++ b/docs/patterns/README.md @@ -41,11 +41,11 @@ Five of [NORTHSTARS Appendix A's 15 patterns](../NORTHSTARS.md#appendix-a-the-15 | Pattern | File | DCGM signal | |---|---|---| -| #1 NVLink silent degradation | [pattern-1-nvlink-degradation.md](pattern-1-nvlink-degradation.md) | per-link `hw.gpu.nvlink.io` Tx/Rx divergence | -| #2 InfiniBand link flap | [pattern-2-ib-link-flap.md](pattern-2-ib-link-flap.md) | `hw.network.ib.port.state` ACTIVE↔DOWN transitions (>=2 in 2m window) joined to same-node NCCL FR stuck collective | -| #3 Uncorrectable HBM ECC | [pattern-3-hbm-ecc.md](pattern-3-hbm-ecc.md) | `hw.errors{error.type=uncorrected}` non-zero | -| #4 Thermal throttling cascade | [pattern-4-thermal-throttle.md](pattern-4-thermal-throttle.md) | `hw.gpu.throttle.duration{reason=thermal}` rate-of-change | -| #5 PCIe AER cascade | [pattern-5-pcie-aer.md](pattern-5-pcie-aer.md) | `hw.gpu.io` Tx/Rx counter discontinuities | +| #1 NVLink silent degradation | [01-nvlink-degradation-walkthrough.md](01-nvlink-degradation-walkthrough.md) | per-link `hw.gpu.nvlink.io` Tx/Rx divergence | +| #2 InfiniBand link flap | [02-ib-link-flap-walkthrough.md](02-ib-link-flap-walkthrough.md) | `hw.network.ib.port.state` ACTIVE↔DOWN transitions (>=2 in 2m window) joined to same-node NCCL FR stuck collective | +| #3 Uncorrectable HBM ECC | [03-hbm-ecc-walkthrough.md](03-hbm-ecc-walkthrough.md) | `hw.errors{error.type=uncorrected}` non-zero | +| #4 Thermal throttling cascade | [04-thermal-throttle-walkthrough.md](04-thermal-throttle-walkthrough.md) | `hw.gpu.throttle.duration{reason=thermal}` rate-of-change | +| #5 PCIe AER cascade | [05-pcie-aer-walkthrough.md](05-pcie-aer-walkthrough.md) | `hw.gpu.io` Tx/Rx counter discontinuities | | #7 Dataloader hang | [07-dataloader-hang.md](07-dataloader-hang.md) | `tracecore.alert.training_step_stalled.*` + `dataloader.error_class` OR `k8s.event.reason{FailedMount\|VolumeMountFailure}` | ## Design specs (planned detectors — TDD red-test inputs) @@ -54,7 +54,7 @@ Engineering-facing pattern-design specs for the 8 unspec'd v1 patterns. Each fol | Pattern | Spec | Status | |---|---|---| -| #2 InfiniBand link flap | [02-ib-link-flap.md](02-ib-link-flap.md) | ☑ shipped (operator walkthrough: [pattern-2-ib-link-flap.md](pattern-2-ib-link-flap.md)) | +| #2 InfiniBand link flap | [02-ib-link-flap.md](02-ib-link-flap.md) | ☑ shipped (operator walkthrough: [02-ib-link-flap-walkthrough.md](02-ib-link-flap-walkthrough.md)) | | #8 NCCL timeout, no hardware cause | [08-nccl-timeout-no-hw.md](08-nccl-timeout-no-hw.md) | ☐ planned | | #9 NCCL bootstrap timeout | [09-nccl-bootstrap-timeout.md](09-nccl-bootstrap-timeout.md) | ☑ shipped | | #10 CUDA OOM, deceptive allocator | [10-cuda-oom-deceptive.md](10-cuda-oom-deceptive.md) | ☐ planned ([#303](https://github.com/TraceCoreAI/tracecore/issues/303) filed) | @@ -90,6 +90,25 @@ Patterns are emergent - operators in the field find them. Contributions welcome via the standard PR flow; the format below should be preserved for consistency. +## Filename convention + +Every file in this directory uses a zero-padded numeric prefix +matching the NORTHSTARS Appendix A pattern number, so lexsort and +pattern-number ordering agree: + +- **`NN-slug.md`** — engineering-facing design spec (the TDD + red-test input). One per pattern. Lands first when a detector is + spec'd before implementation; remains as the contract once the + detector ships. +- **`NN-slug-walkthrough.md`** — operator-facing runbook (Symptom → + Signal → Query → Alert → Escalation → Replay). Lands when the + detector ships and an operator needs to triage a real incident. + May coexist with the spec at `NN-slug.md` (they cross-reference). + +Pattern #2 carries both today (`02-ib-link-flap.md` spec + +`02-ib-link-flap-walkthrough.md` runbook). New patterns follow the +same split when both audiences need a page. + ## Format Each pattern walkthrough has the same sections: diff --git a/docs/rfcs/0014-metrics-to-logs-pattern-input.md b/docs/rfcs/0014-metrics-to-logs-pattern-input.md index 13e63afd..89dfb369 100644 --- a/docs/rfcs/0014-metrics-to-logs-pattern-input.md +++ b/docs/rfcs/0014-metrics-to-logs-pattern-input.md @@ -102,7 +102,7 @@ No operator-visible change at this RFC. PR-A's OTTL transform (already shipped) - Issue [#260](https://github.com/tracecoreai/tracecore/issues/260) — recipe extension + metrics-side detector plumbing. - [`docs/rfcs/0013-distro-first-pivot.md`](0013-distro-first-pivot.md) §3 (customer-stable contracts), §5 (upstream contribution policy), §6 (the four in-house moat scopes). -- [`docs/patterns/pattern-1-nvlink-degradation.md`](../patterns/pattern-1-nvlink-degradation.md) — the first metrics-sourced pattern unblocked by this decision. +- [`docs/patterns/01-nvlink-degradation-walkthrough.md`](../patterns/01-nvlink-degradation-walkthrough.md) — the first metrics-sourced pattern unblocked by this decision. - [`docs/proposals/semconv-hw-gpu-extensions.md`](../proposals/semconv-hw-gpu-extensions.md) §3 — the `hw.gpu.nvlink.io` shape PR-A's OTTL transform emits. - [transformprocessor README @ v0.130.0](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config) — the upstream signal-context table cited above. - [opentelemetry-collector-contrib connector tree @ v0.130.0](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/v0.130.0/connector) — surveyed for any metrics→logs primitive. diff --git a/module/pkg/patterns/hbm_ecc.go b/module/pkg/patterns/hbm_ecc.go index 033a9172..f4951ffd 100644 --- a/module/pkg/patterns/hbm_ecc.go +++ b/module/pkg/patterns/hbm_ecc.go @@ -13,7 +13,7 @@ import ( // DefaultHBMECCWindow is the maximum gap between a DCGM uncorrected // double-bit volatile ECC counter rise and an uncorrectable Xid // kernel event on the same GPU for the detector to join them. 5min -// mirrors the alert window in docs/patterns/pattern-3-hbm-ecc.md +// mirrors the alert window in docs/patterns/03-hbm-ecc-walkthrough.md // (PromQL `increase(...[5m]) > 0`) and the typical operator-facing // "is this happening now" window. Operators on long scrape paths // (30s+ scrape interval) raise this via @@ -42,7 +42,7 @@ var uncorrectableHBMXidCodes = map[int]struct{}{ // double-bit volatile ECC counter delta the detector consumes. The // patterndetectorprocessor builds these from log records derived // (via OTTL transform on the metrics→logs path) from the customer- -// stable `hw.errors` Counter (per the docs/patterns/pattern-3-hbm-ecc.md +// stable `hw.errors` Counter (per the docs/patterns/03-hbm-ecc-walkthrough.md // receiver-emitted signal). Detectors read HBMECCRecord values // directly — no plog grep — so a schema rename surfaces as a // compile error. diff --git a/module/pkg/patterns/pcie_aer.go b/module/pkg/patterns/pcie_aer.go index d99daf6a..ebb06302 100644 --- a/module/pkg/patterns/pcie_aer.go +++ b/module/pkg/patterns/pcie_aer.go @@ -13,7 +13,7 @@ import ( // DefaultPCIeAERWindow is the maximum gap between a PCIe AER kernel // message and a same-BDF hw.gpu.io rate-collapse sample for the // detector to join them. 5min mirrors the rate window in -// docs/patterns/pattern-5-pcie-aer.md (PromQL `rate(...[5m])`) and +// docs/patterns/05-pcie-aer-walkthrough.md (PromQL `rate(...[5m])`) and // the typical post-AER Tx/Rx renegotiation latency — the link // re-trains to a lower Gen/width within tens of seconds, and the // next scrape lands on the new rate within one scrape interval. @@ -70,7 +70,7 @@ type PCIeAERRecord struct { // rate sample. The patterndetectorprocessor builds these from log // records derived (via OTTL transform on the metrics→logs path) // from the customer-stable hw.gpu.io Counter (per the -// docs/patterns/pattern-5-pcie-aer.md receiver-emitted signal). The +// docs/patterns/05-pcie-aer-walkthrough.md receiver-emitted signal). The // ADR-0001 PR-B metrics-path consumer pattern shipped for cuda_oom // (#10) via issue #437; the PCIe AER metrics-path consumer is a // pending sibling follow-up under RFC-0014. BaselineBytesPerSecond diff --git a/module/pkg/patterns/thermal_throttle.go b/module/pkg/patterns/thermal_throttle.go index 34e02469..f43400a2 100644 --- a/module/pkg/patterns/thermal_throttle.go +++ b/module/pkg/patterns/thermal_throttle.go @@ -14,7 +14,7 @@ import ( // DefaultThermalThrottleWindow is the rolling time window over which // per-GPU thermal throttle deltas are summed before applying the // cascade predicate. 5min mirrors the spec's PromQL `[5m]` rate -// window in docs/patterns/pattern-4-thermal-throttle.md and the +// window in docs/patterns/04-thermal-throttle-walkthrough.md and the // typical operator-facing "is this happening now" horizon. Operators // on long scrape paths (30s+ scrape interval) raise this via // ThermalThrottleDetector.Window. diff --git a/module/pkg/replay/thermal_throttle/canonical/manifest.json b/module/pkg/replay/thermal_throttle/canonical/manifest.json index f1a55629..2851cbb4 100644 --- a/module/pkg/replay/thermal_throttle/canonical/manifest.json +++ b/module/pkg/replay/thermal_throttle/canonical/manifest.json @@ -1,6 +1,6 @@ { "pattern_id": "4", "fixture_name": "canonical_4gpu_cascade", - "description": "Four GPUs on gpu-node-0001 each accumulate >=30s of thermal throttle duration within the 5-min rolling window. Pattern #4 canonical positive case — the half-rack cascade described in docs/patterns/pattern-4-thermal-throttle.md escalation step 2. Headline must name thermal throttling and the node; remediation must reference airflow / HVAC / cooling / facilities.", + "description": "Four GPUs on gpu-node-0001 each accumulate >=30s of thermal throttle duration within the 5-min rolling window. Pattern #4 canonical positive case — the half-rack cascade described in docs/patterns/04-thermal-throttle-walkthrough.md escalation step 2. Headline must name thermal throttling and the node; remediation must reference airflow / HVAC / cooling / facilities.", "expected_timing": "4 GPU records spaced 10s apart, each ThrottleDelta >= 30s; one verdict naming the node and 4 GPUIDs" }