diff --git a/docs/patterns/14-pod-evicted-walkthrough.md b/docs/patterns/14-pod-evicted-walkthrough.md new file mode 100644 index 00000000..c4e2bebb --- /dev/null +++ b/docs/patterns/14-pod-evicted-walkthrough.md @@ -0,0 +1,257 @@ +# Pattern #14 — Pod evicted / `NodeNotReady` + +A training rank disappears mid-step because the kubelet evicted its +pod after the node entered a resource-pressure condition (disk, +memory, or PID). The remaining ranks block on the next collective +and the operator sees "NCCL hang" — but the real cause is one node +running out of headroom, several seconds upstream. Tracecore catches +the eviction Event, joins it back to the node-pressure transition +that triggered it, and emits a one-line verdict naming the pod, the +node, and the pressure root. + +**Input:** Kubernetes events scraped by the upstream +[`k8sobjectsreceiver`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/k8sobjectsreceiver) +in `watch` mode against the `events` resource, normalized by an OTTL +`transform` stanza onto the customer-stable 11-entry +`k8s.event.hint` enum (`pod_evicted`, `mount_failure`, `backoff`, +`oom_killed`, `node_unhealthy`, …). Recipe at +[`docs/integrations/k8sobjects-events.md`](../integrations/k8sobjects-events.md). + +## Symptom + +- A multi-rank training job loses one rank without an obvious GPU, + fabric, or framework error; the workload either fails on the next + AllReduce or `kubectl get pods -l job-name=` shows one pod in + `Failed` / `Evicted` state. +- `kubectl describe pod ` reports + `Status: Failed`, `Reason: Evicted`, with a free-form note such as + `The node was low on resource: ephemeral-storage.` or + `…resource: memory.` or `…resource: pids.` +- `kubectl describe node ` shows the matching node-status + condition transitioned to `True` in the same window — + `DiskPressure`, `MemoryPressure`, or `PIDPressure`. +- DCGM and Xid streams on the node are clean. The node itself stayed + up — it was the *pod* that got evicted, not the node that crashed. +- Often clusters: one node hits disk pressure, kubelet evicts the + largest consumer (training pods are usually it), the freed pod + reschedules elsewhere, the job retries — but downstream cohort + ranks have already errored on the missing peer. + +## Why `k8sobjectsreceiver` sees it + +The Kubernetes control plane writes a structured `Event` object +(`events.k8s.io/v1`) every time the kubelet's eviction-manager +evicts a pod. The Event carries `Reason: Evicted` plus a `note` +field naming the pressure root in free-form English (the kubelet +templates it). Independently, the node-status-controller writes the +matching `NodeCondition` (`DiskPressure=True`, `MemoryPressure=True`, +or `PIDPressure=True`) onto the Node object's `.status.conditions` +array within a few seconds of detection. + +`k8sobjectsreceiver` in `mode: watch` against the `events` resource +emits one log record per Event the API server publishes — no +client-side dedup, no polling latency past the watch-stream's +~1s heartbeat. The bundled OTTL recipe normalizes the +`reason: "Evicted"` Event into `k8s.event.hint: "pod_evicted"`, +which the pattern detector matches structurally against the typed +`Record` model (no string-grep against the raw Event JSON). + +The same receiver, configured separately against the `nodes` +resource, streams `NodeCondition` transitions — the detector's +second evidence layer. The join window between the two is operator- +tunable (default 30s, matching kubelet's typical eviction-reaction +latency). + +## Receiver-emitted signal + +Two log streams, both flowing from `k8sobjectsreceiver` → +OTTL `transform` → `patterndetectorprocessor`: + +**Stream 1 — Pod-Evicted Event** (per-eviction log record): + +| Attribute | Type | Value | +|---|---|---| +| `k8s.event.reason` | string | `Evicted` (upstream Event Reason). | +| `k8s.event.hint` | string | `pod_evicted` (OTTL-normalized). | +| `k8s.event.uid` | string | API-server-assigned Event UID. | +| `k8s.pod.name` | string | Evicted pod's `metadata.name`. | +| `k8s.pod.namespace` | string | Evicted pod's `metadata.namespace`. | +| `k8s.node.name` | string | Kubelet's `reporting_instance` — the + node the pod was evicted from. | +| (body) | string | The kubelet's free-form note, e.g. `"The node was + low on resource: ephemeral-storage."` — the detector parses this + to derive pressure root for the headline. | + +**Stream 2 — NodeCondition transition** (per-transition log record): + +| Attribute | Type | Value | +|---|---|---| +| `k8s.node.name` | string | Node carrying the transitioning + condition. | +| `k8s.node.condition.type` | string | `DiskPressure` \| + `MemoryPressure` \| `PIDPressure`. | +| `k8s.node.condition.status` | string | `True` (transition into + pressure) — `False` transitions don't trigger the detector. | +| (timestamp) | nanos | `lastTransitionTime` from the condition. | + +The 11-entry `k8s.event.hint` enum is a **customer-stable contract** +(RFC-0013 §3). Operators writing dashboards may filter on +`hint = "pod_evicted"` and that filter survives every upstream +schema change to the underlying Event shape — the OTTL recipe +absorbs the churn. + +## Query (OTLP filter — post-OTTL) + +For operators dashboarding directly off the log stream while the +detector verdict wiring lands, filter for evicted-pod events and +group by `(k8s.node.name, pressure_root)`: + +``` +# Count of pod-evicted events in the last 5 minutes, +# grouped by node and (operator-parsed) pressure root: +sum by (k8s.node.name, pressure_root) ( + count_over_time({k8s.event.hint="pod_evicted"}[5m]) +) + +# Same window, NodeCondition transitions into pressure on any node: +sum by (k8s.node.name, k8s.node.condition.type) ( + count_over_time({ + k8s.node.condition.status="True", + k8s.node.condition.type=~"DiskPressure|MemoryPressure|PIDPressure" + }[5m]) +) +``` + +The `5m` window covers the kubelet's eviction-soft thresholds — by +default, eviction-hard fires within ~tens of seconds, so a 5-minute +look-back guarantees the operator sees both legs of any clustered +eviction storm. + +## Alert + +```yaml +- alert: PodEvictedOnTrainingNode + expr: | + sum by (k8s.node.name) ( + count_over_time({k8s.event.hint="pod_evicted", + k8s.pod.namespace="training"}[5m]) + ) >= 1 + for: 0s + labels: + severity: critical + annotations: + summary: "Training pod evicted on {{ $labels.k8s_node_name }}" + description: | + One or more pods in the `training` namespace were evicted from + {{ $labels.k8s_node_name }} in the last 5 minutes. The + patterndetector processor will emit a confidence=full + pod_evicted verdict when the matching NodeCondition transition + joins within the configured `join_window` (default 30s). +``` + +`for: 0s` — eviction is irreversible, every occurrence is alert- +worthy. Tune the namespace selector to the training workloads' real +namespace; the default install assumes `training`. + +## Escalation + +1. **Capture the evidence before reschedule wipes it.** `kubectl get + events -n --field-selector reason=Evicted -o json` — + the kubelet's note field names the pressure root verbatim. +2. **Identify the pressure root on the node.** + `kubectl describe node ` for the condition transitions, plus + on-node `df -h /var/lib/kubelet` (disk), `free -h` (memory), or + `ps -eLf | wc -l` vs. `cat /proc/sys/kernel/pid_max` (pid). +3. **Disk pressure.** Most common in training: dataset shards, log + tarballs, or `core` dumps filling the kubelet root. Relocate the + training write path to NVMe / dedicated PVC, tighten the kubelet + `--eviction-hard nodefs.available` threshold, or scale the node's + `imagefs` mount. +4. **Memory pressure.** Reduce per-rank GPU+host memory budget, + raise the pod's `resources.requests.memory` so the scheduler + places elsewhere, or evict noisy-neighbor pods proactively. +5. **PID pressure.** Cap the workload's fork rate (PyTorch + `DataLoader num_workers` is the usual culprit), or raise + `kernel.pid_max` on the host. +6. **Recurring evictions on the same node:** the node is + under-provisioned for the workload class — drain and re-pool, or + add it to the workload's `nodeAntiAffinity` exclusion. + +## Replay + +The detector library is exercised by Go test fixtures and a +canonical replay manifest, not a synthetic on-cluster reproducer: + +- `module/pkg/patterns/pod_evicted_test.go` — unit tests on the + detector library covering the canonical disk-pressure full-join, + memory-pressure and PID-pressure variants, partial-confidence + fallback (no node-condition joined), negative-hint short-circuit + (`Killing` / `Preempted` / `FailedScheduling` don't fire), and the + `JoinWindow` boundary. +- `module/pkg/patterns/pod_evicted_bench_test.go` — allocation + + latency bench against an 819-eviction fixture; the ≤2 allocs/event + NORTHSTAR is the steady-state target. +- `module/pkg/replay/pod_evicted/canonical/` — JSON-fixture replay + manifest (`manifest.json`, `events.json`, `node_conditions.json`, + `golden.json`) consumed by `module/pkg/replay/runner.go`. The + canonical case asserts headline regex + `/Pod .* evicted at .* due to disk pressure/` and remediation + regex `/relocate.*NVMe/`. Negative fixtures live under + `module/pkg/replay/pod_evicted/_negative/`; real-world capture + shapes under `module/pkg/replay/pod_evicted/_real_world/`. + +Copy the canonical fixture's `events.json` + +`node_conditions.json` shape into a downstream recipe-validation +harness — they match the wire shape `k8sobjectsreceiver` emits +against a real cluster, so a recipe-side OTTL change can be unit- +tested without a live API server. + +## Detector status + +Detector implemented: ☑ shipped (library + processor wiring) — +[`patterns.PodEvictedDetector`](../../module/pkg/patterns/pod_evicted.go) +emits one verdict per evicted pod, joining the most recent +matching `NodeCondition` transition within `JoinWindow` to promote +`Confidence` from `partial` to `full`. Operator-facing YAML keys on +the processor: + +| Key | Default | Effect | +|---|---|---| +| `join_window` | `30s` | Max gap between a node-pressure condition's `lastTransitionTime` and an evicted pod's `event_time` for the detector to join them. Floor: 1s. Raise on clusters running long `eviction-soft` thresholds. | +| `emit_partial_verdicts` | `true` | When `false`, evictions with no joined node-condition (the cause is unobserved) are suppressed before forwarding. | + +A minimal config sits in +[`module/processor/patterndetectorprocessor/example_config.yaml`](../../module/processor/patterndetectorprocessor/example_config.yaml). + +## Verdict shape + +The processor emits one log record per evicted pod, carrying the +verdict JSON in `pattern.verdict_json` plus these promoted scalars +(per the issue +[#270](https://github.com/TraceCoreAI/tracecore/issues/270) +scalar-promotion contract): + +| Attribute | Type | Description | +|---|---|---| +| `pattern.id` | string | `14` | +| `pattern.confidence` | string | `full` (node-condition joined) or `partial` (eviction Event alone). | +| `pattern.headline` | string | `"Pod / evicted at due to pressure"` — root is `disk` \| `memory` \| `pid`. | +| `pattern.remediation` | string | Detector-controlled prose keyed by pressure root (e.g. disk → "relocate the training write path to NVMe; tighten kubelet `--eviction-hard nodefs.available`"). | +| `pattern.verdict_json` | string | Full evidence trail (the Pod-Evicted Event + the joined NodeCondition transition). | +| `k8s.pod.name` | string | Evicted pod. | +| `k8s.pod.namespace` | string | Evicted pod's namespace. | +| `k8s.node.name` | string | Node the pod was evicted from. | +| `k8s.event.reason` | string | Upstream Event Reason — always `Evicted` for this pattern. | + +The full attribute set lives in +[`docs/ATTRIBUTES.md`](../ATTRIBUTES.md). + +## Integration recipe + +The OTTL stanza that normalizes raw `events.k8s.io/v1` Event JSON +into the `k8s.event.hint` 11-entry enum ships at +[`docs/integrations/k8sobjects-events.md`](../integrations/k8sobjects-events.md). +The same recipe streams `NodeCondition` transitions (separate +`k8sobjectsreceiver` instance against `nodes`) to feed the +detector's second evidence layer. The bundled Helm chart wires both +in the default values. diff --git a/docs/patterns/README.md b/docs/patterns/README.md index 1bb3e507..8c4ef21e 100644 --- a/docs/patterns/README.md +++ b/docs/patterns/README.md @@ -35,7 +35,7 @@ These pages assume: - An OTLP backend is receiving the metrics (Prometheus, Datadog, Honeycomb, Mimir all confirmed in the backend matrix). -Five of [NORTHSTARS Appendix A's 15 patterns](../NORTHSTARS.md#appendix-a-the-15-named-root-cause-patterns) have operator-facing walkthroughs today — the DCGM-observable subset plus pattern #2 (InfiniBand link flap, fabric-observable). The remaining patterns either ship a detector with its own README contract (pattern #14 pod-evicted in `patterndetectorprocessor/`) or carry a design spec under this directory pending implementation. New walkthroughs land alongside the detector that surfaces the pattern. +Seven of [NORTHSTARS Appendix A's 15 patterns](../NORTHSTARS.md#appendix-a-the-15-named-root-cause-patterns) have operator-facing walkthroughs today — the DCGM-observable subset plus pattern #2 (InfiniBand link flap, fabric-observable) and pattern #14 (pod evicted, k8s-events-observable). The remaining patterns carry a design spec under this directory pending implementation, or sit in the "reserved / unfilled" table below where the pattern number is named in NORTHSTARS Appendix A but no detector has been written yet. New walkthroughs land alongside the detector that surfaces the pattern. ## Operator walkthroughs (shipped) @@ -47,6 +47,7 @@ Five of [NORTHSTARS Appendix A's 15 patterns](../NORTHSTARS.md#appendix-a-the-15 | #4 Thermal throttling cascade | [04-thermal-throttle-walkthrough.md](04-thermal-throttle-walkthrough.md) | `hw.gpu.throttle.duration{reason=thermal}` rate-of-change | | #5 PCIe AER cascade | [05-pcie-aer-walkthrough.md](05-pcie-aer-walkthrough.md) | `hw.gpu.io` Tx/Rx counter discontinuities | | #7 Dataloader hang | [07-dataloader-hang.md](07-dataloader-hang.md) | `tracecore.alert.training_step_stalled.*` + `dataloader.error_class` OR `k8s.event.reason{FailedMount\|VolumeMountFailure}` | +| #14 Pod evicted / `NodeNotReady` | [14-pod-evicted-walkthrough.md](14-pod-evicted-walkthrough.md) | `k8s.event.hint=pod_evicted` joined to `k8s.node.condition.{type,status}` transition within `join_window` (default 30s) | ## Design specs (planned detectors — TDD red-test inputs) @@ -62,6 +63,23 @@ Engineering-facing pattern-design specs for the 8 unspec'd v1 patterns. Each fol | #12 Loss spike → NaN | [12-loss-spike-nan.md](12-loss-spike-nan.md) | ☐ planned | | #13 Silent data corruption | [13-silent-data-corruption.md](13-silent-data-corruption.md) | ☑ shipped | +## Reserved / unfilled (NORTHSTARS Appendix A patterns with no doc yet) + +Pattern numbers named in NORTHSTARS Appendix A that have neither a +design spec nor an operator walkthrough in this directory. Listed +explicitly so the numeric gaps in the tables above are documented, +not silent. + +| Pattern | Status | Rationale | +|---|---|---| +| #6 Stragglers from slow node (`data_time` 3× normal on one node) | ☐ unimplemented — milestone M18 — no detector, no spec, no walkthrough yet | NORTHSTARS Appendix A names this pattern and M6 (v0) targets coverage. The chaos injector `failure-inject cpu-steal` lands the symptom (per [M4b rubric](../history/MILESTONES-shipped-lanes.md#m4b-failure-injection-harness)), but the cross-rank `data_time` aggregation detector is build-time coupled to M17's `cross_rank.go` infra and is at risk per the NORTHSTARS-coupling note in [M21 v0.1.0 release](../MILESTONES.md#m21-v010-release). NORTHSTARS Appendix A's "☑ in-tree detector" mark for this pattern is aspirational, not current state — to be reconciled when the detector lands. Doc will land alongside the detector; design-spec follow-up tracked at the M18 milestone. | +| #15 Image pull / `FailedMount` on restart | ☐ no spec yet (NORTHSTARS Appendix A entry; no detector planned for v1) | The `k8s.event.hint` 11-entry enum (RFC-0013 §3) already carries `mount_failure` and `image_pull_failure` hints — the input signal exists. A detector would join repeated mount/pull failures on the same job's retry pods within a window; not yet scheduled. | + +When a reserved pattern's detector lands, it promotes to the +"Design specs" table (☐ planned → ☑ shipped) or directly to the +"Operator walkthroughs (shipped)" table — same numeric ID, same +filename convention. + ## Correlation-window semantics Three v1 detectors use three different correlation-window shapes — chosen independently to match each pattern's physical event-ordering. Operators tuning windows hit these without warning today; this table is the cross-link.