Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
257 changes: 257 additions & 0 deletions docs/patterns/14-pod-evicted-walkthrough.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
# Pattern #14 — Pod evicted / `NodeNotReady`

A training rank disappears mid-step because the kubelet evicted its
pod after the node entered a resource-pressure condition (disk,
memory, or PID). The remaining ranks block on the next collective
and the operator sees "NCCL hang" — but the real cause is one node
running out of headroom, several seconds upstream. Tracecore catches
the eviction Event, joins it back to the node-pressure transition
that triggered it, and emits a one-line verdict naming the pod, the
node, and the pressure root.

**Input:** Kubernetes events scraped by the upstream
[`k8sobjectsreceiver`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/k8sobjectsreceiver)
in `watch` mode against the `events` resource, normalized by an OTTL
`transform` stanza onto the customer-stable 11-entry
`k8s.event.hint` enum (`pod_evicted`, `mount_failure`, `backoff`,
`oom_killed`, `node_unhealthy`, …). Recipe at
[`docs/integrations/k8sobjects-events.md`](../integrations/k8sobjects-events.md).

## Symptom

- A multi-rank training job loses one rank without an obvious GPU,
fabric, or framework error; the workload either fails on the next
AllReduce or `kubectl get pods -l job-name=<job>` shows one pod in
`Failed` / `Evicted` state.
- `kubectl describe pod <evicted>` reports
`Status: Failed`, `Reason: Evicted`, with a free-form note such as
`The node was low on resource: ephemeral-storage.` or
`…resource: memory.` or `…resource: pids.`
- `kubectl describe node <node>` shows the matching node-status
condition transitioned to `True` in the same window —
`DiskPressure`, `MemoryPressure`, or `PIDPressure`.
- DCGM and Xid streams on the node are clean. The node itself stayed
up — it was the *pod* that got evicted, not the node that crashed.
- Often clusters: one node hits disk pressure, kubelet evicts the
largest consumer (training pods are usually it), the freed pod
reschedules elsewhere, the job retries — but downstream cohort
ranks have already errored on the missing peer.

## Why `k8sobjectsreceiver` sees it

The Kubernetes control plane writes a structured `Event` object
(`events.k8s.io/v1`) every time the kubelet's eviction-manager
evicts a pod. The Event carries `Reason: Evicted` plus a `note`
field naming the pressure root in free-form English (the kubelet
templates it). Independently, the node-status-controller writes the
matching `NodeCondition` (`DiskPressure=True`, `MemoryPressure=True`,
or `PIDPressure=True`) onto the Node object's `.status.conditions`
array within a few seconds of detection.

`k8sobjectsreceiver` in `mode: watch` against the `events` resource
emits one log record per Event the API server publishes — no
client-side dedup, no polling latency past the watch-stream's
~1s heartbeat. The bundled OTTL recipe normalizes the
`reason: "Evicted"` Event into `k8s.event.hint: "pod_evicted"`,
which the pattern detector matches structurally against the typed
`Record` model (no string-grep against the raw Event JSON).

The same receiver, configured separately against the `nodes`
resource, streams `NodeCondition` transitions — the detector's
second evidence layer. The join window between the two is operator-
tunable (default 30s, matching kubelet's typical eviction-reaction
latency).

## Receiver-emitted signal

Two log streams, both flowing from `k8sobjectsreceiver` →
OTTL `transform` → `patterndetectorprocessor`:

**Stream 1 — Pod-Evicted Event** (per-eviction log record):

| Attribute | Type | Value |
|---|---|---|
| `k8s.event.reason` | string | `Evicted` (upstream Event Reason). |
| `k8s.event.hint` | string | `pod_evicted` (OTTL-normalized). |
| `k8s.event.uid` | string | API-server-assigned Event UID. |
| `k8s.pod.name` | string | Evicted pod's `metadata.name`. |
| `k8s.pod.namespace` | string | Evicted pod's `metadata.namespace`. |
| `k8s.node.name` | string | Kubelet's `reporting_instance` — the
node the pod was evicted from. |
| (body) | string | The kubelet's free-form note, e.g. `"The node was
low on resource: ephemeral-storage."` — the detector parses this
to derive pressure root for the headline. |

**Stream 2 — NodeCondition transition** (per-transition log record):

| Attribute | Type | Value |
|---|---|---|
| `k8s.node.name` | string | Node carrying the transitioning
condition. |
| `k8s.node.condition.type` | string | `DiskPressure` \|
`MemoryPressure` \| `PIDPressure`. |
| `k8s.node.condition.status` | string | `True` (transition into
pressure) — `False` transitions don't trigger the detector. |
| (timestamp) | nanos | `lastTransitionTime` from the condition. |

The 11-entry `k8s.event.hint` enum is a **customer-stable contract**
(RFC-0013 §3). Operators writing dashboards may filter on
`hint = "pod_evicted"` and that filter survives every upstream
schema change to the underlying Event shape — the OTTL recipe
absorbs the churn.

## Query (OTLP filter — post-OTTL)

For operators dashboarding directly off the log stream while the
detector verdict wiring lands, filter for evicted-pod events and
group by `(k8s.node.name, pressure_root)`:

```
# Count of pod-evicted events in the last 5 minutes,
# grouped by node and (operator-parsed) pressure root:
sum by (k8s.node.name, pressure_root) (
count_over_time({k8s.event.hint="pod_evicted"}[5m])
)

# Same window, NodeCondition transitions into pressure on any node:
sum by (k8s.node.name, k8s.node.condition.type) (
count_over_time({
k8s.node.condition.status="True",
k8s.node.condition.type=~"DiskPressure|MemoryPressure|PIDPressure"
}[5m])
)
```

The `5m` window covers the kubelet's eviction-soft thresholds — by
default, eviction-hard fires within ~tens of seconds, so a 5-minute
look-back guarantees the operator sees both legs of any clustered
eviction storm.

## Alert

```yaml
- alert: PodEvictedOnTrainingNode
expr: |
sum by (k8s.node.name) (
count_over_time({k8s.event.hint="pod_evicted",
k8s.pod.namespace="training"}[5m])
) >= 1
for: 0s
labels:
severity: critical
annotations:
summary: "Training pod evicted on {{ $labels.k8s_node_name }}"
description: |
One or more pods in the `training` namespace were evicted from
{{ $labels.k8s_node_name }} in the last 5 minutes. The
patterndetector processor will emit a confidence=full
pod_evicted verdict when the matching NodeCondition transition
joins within the configured `join_window` (default 30s).
```

`for: 0s` — eviction is irreversible, every occurrence is alert-
worthy. Tune the namespace selector to the training workloads' real
namespace; the default install assumes `training`.

## Escalation

1. **Capture the evidence before reschedule wipes it.** `kubectl get
events -n <namespace> --field-selector reason=Evicted -o json` —
the kubelet's note field names the pressure root verbatim.
2. **Identify the pressure root on the node.**
`kubectl describe node <node>` for the condition transitions, plus
on-node `df -h /var/lib/kubelet` (disk), `free -h` (memory), or
`ps -eLf | wc -l` vs. `cat /proc/sys/kernel/pid_max` (pid).
3. **Disk pressure.** Most common in training: dataset shards, log
tarballs, or `core` dumps filling the kubelet root. Relocate the
training write path to NVMe / dedicated PVC, tighten the kubelet
`--eviction-hard nodefs.available` threshold, or scale the node's
`imagefs` mount.
4. **Memory pressure.** Reduce per-rank GPU+host memory budget,
raise the pod's `resources.requests.memory` so the scheduler
places elsewhere, or evict noisy-neighbor pods proactively.
5. **PID pressure.** Cap the workload's fork rate (PyTorch
`DataLoader num_workers` is the usual culprit), or raise
`kernel.pid_max` on the host.
6. **Recurring evictions on the same node:** the node is
under-provisioned for the workload class — drain and re-pool, or
add it to the workload's `nodeAntiAffinity` exclusion.

## Replay

The detector library is exercised by Go test fixtures and a
canonical replay manifest, not a synthetic on-cluster reproducer:

- `module/pkg/patterns/pod_evicted_test.go` — unit tests on the
detector library covering the canonical disk-pressure full-join,
memory-pressure and PID-pressure variants, partial-confidence
fallback (no node-condition joined), negative-hint short-circuit
(`Killing` / `Preempted` / `FailedScheduling` don't fire), and the
`JoinWindow` boundary.
- `module/pkg/patterns/pod_evicted_bench_test.go` — allocation +
latency bench against an 819-eviction fixture; the ≤2 allocs/event
NORTHSTAR is the steady-state target.
- `module/pkg/replay/pod_evicted/canonical/` — JSON-fixture replay
manifest (`manifest.json`, `events.json`, `node_conditions.json`,
`golden.json`) consumed by `module/pkg/replay/runner.go`. The
canonical case asserts headline regex
`/Pod .* evicted at .* due to disk pressure/` and remediation
regex `/relocate.*NVMe/`. Negative fixtures live under
`module/pkg/replay/pod_evicted/_negative/`; real-world capture
shapes under `module/pkg/replay/pod_evicted/_real_world/`.

Copy the canonical fixture's `events.json` +
`node_conditions.json` shape into a downstream recipe-validation
harness — they match the wire shape `k8sobjectsreceiver` emits
against a real cluster, so a recipe-side OTTL change can be unit-
tested without a live API server.

## Detector status

Detector implemented: ☑ shipped (library + processor wiring) —
[`patterns.PodEvictedDetector`](../../module/pkg/patterns/pod_evicted.go)
emits one verdict per evicted pod, joining the most recent
matching `NodeCondition` transition within `JoinWindow` to promote
`Confidence` from `partial` to `full`. Operator-facing YAML keys on
the processor:

| Key | Default | Effect |
|---|---|---|
| `join_window` | `30s` | Max gap between a node-pressure condition's `lastTransitionTime` and an evicted pod's `event_time` for the detector to join them. Floor: 1s. Raise on clusters running long `eviction-soft` thresholds. |
| `emit_partial_verdicts` | `true` | When `false`, evictions with no joined node-condition (the cause is unobserved) are suppressed before forwarding. |

A minimal config sits in
[`module/processor/patterndetectorprocessor/example_config.yaml`](../../module/processor/patterndetectorprocessor/example_config.yaml).

## Verdict shape

The processor emits one log record per evicted pod, carrying the
verdict JSON in `pattern.verdict_json` plus these promoted scalars
(per the issue
[#270](https://github.com/TraceCoreAI/tracecore/issues/270)
scalar-promotion contract):

| Attribute | Type | Description |
|---|---|---|
| `pattern.id` | string | `14` |
| `pattern.confidence` | string | `full` (node-condition joined) or `partial` (eviction Event alone). |
| `pattern.headline` | string | `"Pod <ns>/<name> evicted at <ts> due to <root> pressure"` — root is `disk` \| `memory` \| `pid`. |
| `pattern.remediation` | string | Detector-controlled prose keyed by pressure root (e.g. disk → "relocate the training write path to NVMe; tighten kubelet `--eviction-hard nodefs.available`"). |
| `pattern.verdict_json` | string | Full evidence trail (the Pod-Evicted Event + the joined NodeCondition transition). |
| `k8s.pod.name` | string | Evicted pod. |
| `k8s.pod.namespace` | string | Evicted pod's namespace. |
| `k8s.node.name` | string | Node the pod was evicted from. |
| `k8s.event.reason` | string | Upstream Event Reason — always `Evicted` for this pattern. |

The full attribute set lives in
[`docs/ATTRIBUTES.md`](../ATTRIBUTES.md).

## Integration recipe

The OTTL stanza that normalizes raw `events.k8s.io/v1` Event JSON
into the `k8s.event.hint` 11-entry enum ships at
[`docs/integrations/k8sobjects-events.md`](../integrations/k8sobjects-events.md).
The same recipe streams `NodeCondition` transitions (separate
`k8sobjectsreceiver` instance against `nodes`) to feed the
detector's second evidence layer. The bundled Helm chart wires both
in the default values.
20 changes: 19 additions & 1 deletion docs/patterns/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ These pages assume:
- An OTLP backend is receiving the metrics (Prometheus, Datadog,
Honeycomb, Mimir all confirmed in the backend matrix).

Five of [NORTHSTARS Appendix A's 15 patterns](../NORTHSTARS.md#appendix-a-the-15-named-root-cause-patterns) have operator-facing walkthroughs today — the DCGM-observable subset plus pattern #2 (InfiniBand link flap, fabric-observable). The remaining patterns either ship a detector with its own README contract (pattern #14 pod-evicted in `patterndetectorprocessor/`) or carry a design spec under this directory pending implementation. New walkthroughs land alongside the detector that surfaces the pattern.
Seven of [NORTHSTARS Appendix A's 15 patterns](../NORTHSTARS.md#appendix-a-the-15-named-root-cause-patterns) have operator-facing walkthroughs today — the DCGM-observable subset plus pattern #2 (InfiniBand link flap, fabric-observable) and pattern #14 (pod evicted, k8s-events-observable). The remaining patterns carry a design spec under this directory pending implementation, or sit in the "reserved / unfilled" table below where the pattern number is named in NORTHSTARS Appendix A but no detector has been written yet. New walkthroughs land alongside the detector that surfaces the pattern.

## Operator walkthroughs (shipped)

Expand All @@ -47,6 +47,7 @@ Five of [NORTHSTARS Appendix A's 15 patterns](../NORTHSTARS.md#appendix-a-the-15
| #4 Thermal throttling cascade | [04-thermal-throttle-walkthrough.md](04-thermal-throttle-walkthrough.md) | `hw.gpu.throttle.duration{reason=thermal}` rate-of-change |
| #5 PCIe AER cascade | [05-pcie-aer-walkthrough.md](05-pcie-aer-walkthrough.md) | `hw.gpu.io` Tx/Rx counter discontinuities |
| #7 Dataloader hang | [07-dataloader-hang.md](07-dataloader-hang.md) | `tracecore.alert.training_step_stalled.*` + `dataloader.error_class` OR `k8s.event.reason{FailedMount\|VolumeMountFailure}` |
| #14 Pod evicted / `NodeNotReady` | [14-pod-evicted-walkthrough.md](14-pod-evicted-walkthrough.md) | `k8s.event.hint=pod_evicted` joined to `k8s.node.condition.{type,status}` transition within `join_window` (default 30s) |

## Design specs (planned detectors — TDD red-test inputs)

Expand All @@ -62,6 +63,23 @@ Engineering-facing pattern-design specs for the 8 unspec'd v1 patterns. Each fol
| #12 Loss spike → NaN | [12-loss-spike-nan.md](12-loss-spike-nan.md) | ☐ planned |
| #13 Silent data corruption | [13-silent-data-corruption.md](13-silent-data-corruption.md) | ☑ shipped |

## Reserved / unfilled (NORTHSTARS Appendix A patterns with no doc yet)

Pattern numbers named in NORTHSTARS Appendix A that have neither a
design spec nor an operator walkthrough in this directory. Listed
explicitly so the numeric gaps in the tables above are documented,
not silent.

| Pattern | Status | Rationale |
|---|---|---|
| #6 Stragglers from slow node (`data_time` 3× normal on one node) | ☐ unimplemented — milestone M18 — no detector, no spec, no walkthrough yet | NORTHSTARS Appendix A names this pattern and M6 (v0) targets coverage. The chaos injector `failure-inject cpu-steal` lands the symptom (per [M4b rubric](../history/MILESTONES-shipped-lanes.md#m4b-failure-injection-harness)), but the cross-rank `data_time` aggregation detector is build-time coupled to M17's `cross_rank.go` infra and is at risk per the NORTHSTARS-coupling note in [M21 v0.1.0 release](../MILESTONES.md#m21-v010-release). NORTHSTARS Appendix A's "☑ in-tree detector" mark for this pattern is aspirational, not current state — to be reconciled when the detector lands. Doc will land alongside the detector; design-spec follow-up tracked at the M18 milestone. |
| #15 Image pull / `FailedMount` on restart | ☐ no spec yet (NORTHSTARS Appendix A entry; no detector planned for v1) | The `k8s.event.hint` 11-entry enum (RFC-0013 §3) already carries `mount_failure` and `image_pull_failure` hints — the input signal exists. A detector would join repeated mount/pull failures on the same job's retry pods within a window; not yet scheduled. |

When a reserved pattern's detector lands, it promotes to the
"Design specs" table (☐ planned → ☑ shipped) or directly to the
"Operator walkthroughs (shipped)" table — same numeric ID, same
filename convention.

## Correlation-window semantics

Three v1 detectors use three different correlation-window shapes — chosen independently to match each pattern's physical event-ordering. Operators tuning windows hit these without warning today; this table is the cross-link.
Expand Down
Loading