TraceCoreAI · trilamsr · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/docs/patterns/14-pod-evicted-walkthrough.md b/docs/patterns/14-pod-evicted-walkthrough.md
@@ -0,0 +1,257 @@
+# Pattern #14 — Pod evicted / `NodeNotReady`
+
+A training rank disappears mid-step because the kubelet evicted its
+pod after the node entered a resource-pressure condition (disk,
+memory, or PID). The remaining ranks block on the next collective
+and the operator sees "NCCL hang" — but the real cause is one node
+running out of headroom, several seconds upstream. Tracecore catches
+the eviction Event, joins it back to the node-pressure transition
+that triggered it, and emits a one-line verdict naming the pod, the
+node, and the pressure root.
+
+**Input:** Kubernetes events scraped by the upstream
+[`k8sobjectsreceiver`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/k8sobjectsreceiver)
+in `watch` mode against the `events` resource, normalized by an OTTL
+`transform` stanza onto the customer-stable 11-entry
+`k8s.event.hint` enum (`pod_evicted`, `mount_failure`, `backoff`,
+`oom_killed`, `node_unhealthy`, …). Recipe at
+[`docs/integrations/k8sobjects-events.md`](../integrations/k8sobjects-events.md).
+
+## Symptom
+
+- A multi-rank training job loses one rank without an obvious GPU,
+  fabric, or framework error; the workload either fails on the next
+  AllReduce or `kubectl get pods -l job-name=<job>` shows one pod in
+  `Failed` / `Evicted` state.
+- `kubectl describe pod <evicted>` reports
+  `Status: Failed`, `Reason: Evicted`, with a free-form note such as
+  `The node was low on resource: ephemeral-storage.` or
+  `…resource: memory.` or `…resource: pids.`
+- `kubectl describe node <node>` shows the matching node-status
+  condition transitioned to `True` in the same window —
+  `DiskPressure`, `MemoryPressure`, or `PIDPressure`.
+- DCGM and Xid streams on the node are clean. The node itself stayed
+  up — it was the *pod* that got evicted, not the node that crashed.
+- Often clusters: one node hits disk pressure, kubelet evicts the
+  largest consumer (training pods are usually it), the freed pod
+  reschedules elsewhere, the job retries — but downstream cohort
+  ranks have already errored on the missing peer.
+
+## Why `k8sobjectsreceiver` sees it
+
+The Kubernetes control plane writes a structured `Event` object
+(`events.k8s.io/v1`) every time the kubelet's eviction-manager
+evicts a pod. The Event carries `Reason: Evicted` plus a `note`
+field naming the pressure root in free-form English (the kubelet
+templates it). Independently, the node-status-controller writes the
+matching `NodeCondition` (`DiskPressure=True`, `MemoryPressure=True`,
+or `PIDPressure=True`) onto the Node object's `.status.conditions`
+array within a few seconds of detection.
+
+`k8sobjectsreceiver` in `mode: watch` against the `events` resource
+emits one log record per Event the API server publishes — no
+client-side dedup, no polling latency past the watch-stream's
+~1s heartbeat. The bundled OTTL recipe normalizes the
+`reason: "Evicted"` Event into `k8s.event.hint: "pod_evicted"`,
+which the pattern detector matches structurally against the typed
+`Record` model (no string-grep against the raw Event JSON).
+
+The same receiver, configured separately against the `nodes`
+resource, streams `NodeCondition` transitions — the detector's
+second evidence layer. The join window between the two is operator-
+tunable (default 30s, matching kubelet's typical eviction-reaction
+latency).
+
+## Receiver-emitted signal
+
+Two log streams, both flowing from `k8sobjectsreceiver` →
+OTTL `transform` → `patterndetectorprocessor`:
+
+**Stream 1 — Pod-Evicted Event** (per-eviction log record):
+
+| Attribute | Type | Value |
+|---|---|---|
+| `k8s.event.reason` | string | `Evicted` (upstream Event Reason). |
+| `k8s.event.hint` | string | `pod_evicted` (OTTL-normalized). |
+| `k8s.event.uid` | string | API-server-assigned Event UID. |
+| `k8s.pod.name` | string | Evicted pod's `metadata.name`. |
+| `k8s.pod.namespace` | string | Evicted pod's `metadata.namespace`. |
+| `k8s.node.name` | string | Kubelet's `reporting_instance` — the
+  node the pod was evicted from. |
+| (body) | string | The kubelet's free-form note, e.g. `"The node was
+  low on resource: ephemeral-storage."` — the detector parses this
+  to derive pressure root for the headline. |
+
+**Stream 2 — NodeCondition transition** (per-transition log record):
+
+| Attribute | Type | Value |
+|---|---|---|
+| `k8s.node.name` | string | Node carrying the transitioning
+  condition. |
+| `k8s.node.condition.type` | string | `DiskPressure` \|
+  `MemoryPressure` \| `PIDPressure`. |
+| `k8s.node.condition.status` | string | `True` (transition into
+  pressure) — `False` transitions don't trigger the detector. |
+| (timestamp) | nanos | `lastTransitionTime` from the condition. |
+
+The 11-entry `k8s.event.hint` enum is a **customer-stable contract**
+(RFC-0013 §3). Operators writing dashboards may filter on
+`hint = "pod_evicted"` and that filter survives every upstream
+schema change to the underlying Event shape — the OTTL recipe
+absorbs the churn.
+
+## Query (OTLP filter — post-OTTL)
+
+For operators dashboarding directly off the log stream while the
+detector verdict wiring lands, filter for evicted-pod events and
+group by `(k8s.node.name, pressure_root)`:
+
+```
+# Count of pod-evicted events in the last 5 minutes,
+# grouped by node and (operator-parsed) pressure root:
+sum by (k8s.node.name, pressure_root) (
+  count_over_time({k8s.event.hint="pod_evicted"}[5m])
+)
+
+# Same window, NodeCondition transitions into pressure on any node:
+sum by (k8s.node.name, k8s.node.condition.type) (
+  count_over_time({
+    k8s.node.condition.status="True",
+    k8s.node.condition.type=~"DiskPressure|MemoryPressure|PIDPressure"
+  }[5m])
+)
+```
+
+The `5m` window covers the kubelet's eviction-soft thresholds — by
+default, eviction-hard fires within ~tens of seconds, so a 5-minute
+look-back guarantees the operator sees both legs of any clustered
+eviction storm.
+
+## Alert
+
+```yaml
+- alert: PodEvictedOnTrainingNode
+  expr: |
+    sum by (k8s.node.name) (
+      count_over_time({k8s.event.hint="pod_evicted",
+                       k8s.pod.namespace="training"}[5m])
+    ) >= 1
+  for: 0s
+  labels:
+    severity: critical
+  annotations:
+    summary: "Training pod evicted on {{ $labels.k8s_node_name }}"
+    description: |
+      One or more pods in the `training` namespace were evicted from
+      {{ $labels.k8s_node_name }} in the last 5 minutes. The
+      patterndetector processor will emit a confidence=full
+      pod_evicted verdict when the matching NodeCondition transition
+      joins within the configured `join_window` (default 30s).
+```
+
+`for: 0s` — eviction is irreversible, every occurrence is alert-
+worthy. Tune the namespace selector to the training workloads' real
+namespace; the default install assumes `training`.
+
+## Escalation
+
+1. **Capture the evidence before reschedule wipes it.** `kubectl get
+   events -n <namespace> --field-selector reason=Evicted -o json` —
+   the kubelet's note field names the pressure root verbatim.
+2. **Identify the pressure root on the node.**
+   `kubectl describe node <node>` for the condition transitions, plus
+   on-node `df -h /var/lib/kubelet` (disk), `free -h` (memory), or
+   `ps -eLf | wc -l` vs. `cat /proc/sys/kernel/pid_max` (pid).
+3. **Disk pressure.** Most common in training: dataset shards, log
+   tarballs, or `core` dumps filling the kubelet root. Relocate the
+   training write path to NVMe / dedicated PVC, tighten the kubelet
+   `--eviction-hard nodefs.available` threshold, or scale the node's
+   `imagefs` mount.
+4. **Memory pressure.** Reduce per-rank GPU+host memory budget,
+   raise the pod's `resources.requests.memory` so the scheduler
+   places elsewhere, or evict noisy-neighbor pods proactively.
+5. **PID pressure.** Cap the workload's fork rate (PyTorch
+   `DataLoader num_workers` is the usual culprit), or raise
+   `kernel.pid_max` on the host.
+6. **Recurring evictions on the same node:** the node is
+   under-provisioned for the workload class — drain and re-pool, or
+   add it to the workload's `nodeAntiAffinity` exclusion.
+
+## Replay
+
+The detector library is exercised by Go test fixtures and a
+canonical replay manifest, not a synthetic on-cluster reproducer:
+
+- `module/pkg/patterns/pod_evicted_test.go` — unit tests on the
+  detector library covering the canonical disk-pressure full-join,
+  memory-pressure and PID-pressure variants, partial-confidence
+  fallback (no node-condition joined), negative-hint short-circuit
+  (`Killing` / `Preempted` / `FailedScheduling` don't fire), and the
+  `JoinWindow` boundary.
+- `module/pkg/patterns/pod_evicted_bench_test.go` — allocation +
+  latency bench against an 819-eviction fixture; the ≤2 allocs/event
+  NORTHSTAR is the steady-state target.
+- `module/pkg/replay/pod_evicted/canonical/` — JSON-fixture replay
+  manifest (`manifest.json`, `events.json`, `node_conditions.json`,
+  `golden.json`) consumed by `module/pkg/replay/runner.go`. The
+  canonical case asserts headline regex
+  `/Pod .* evicted at .* due to disk pressure/` and remediation
+  regex `/relocate.*NVMe/`. Negative fixtures live under
+  `module/pkg/replay/pod_evicted/_negative/`; real-world capture
+  shapes under `module/pkg/replay/pod_evicted/_real_world/`.
+
+Copy the canonical fixture's `events.json` +
+`node_conditions.json` shape into a downstream recipe-validation
+harness — they match the wire shape `k8sobjectsreceiver` emits
+against a real cluster, so a recipe-side OTTL change can be unit-
+tested without a live API server.
+
+## Detector status
+
+Detector implemented: ☑ shipped (library + processor wiring) —
+[`patterns.PodEvictedDetector`](../../module/pkg/patterns/pod_evicted.go)
+emits one verdict per evicted pod, joining the most recent
+matching `NodeCondition` transition within `JoinWindow` to promote
+`Confidence` from `partial` to `full`. Operator-facing YAML keys on
+the processor:
+
+| Key | Default | Effect |
+|---|---|---|
+| `join_window` | `30s` | Max gap between a node-pressure condition's `lastTransitionTime` and an evicted pod's `event_time` for the detector to join them. Floor: 1s. Raise on clusters running long `eviction-soft` thresholds. |
+| `emit_partial_verdicts` | `true` | When `false`, evictions with no joined node-condition (the cause is unobserved) are suppressed before forwarding. |
+
+A minimal config sits in
+[`module/processor/patterndetectorprocessor/example_config.yaml`](../../module/processor/patterndetectorprocessor/example_config.yaml).
+
+## Verdict shape
+
+The processor emits one log record per evicted pod, carrying the
+verdict JSON in `pattern.verdict_json` plus these promoted scalars
+(per the issue
+[#270](https://github.com/TraceCoreAI/tracecore/issues/270)
+scalar-promotion contract):
+
+| Attribute | Type | Description |
+|---|---|---|
+| `pattern.id` | string | `14` |
+| `pattern.confidence` | string | `full` (node-condition joined) or `partial` (eviction Event alone). |
+| `pattern.headline` | string | `"Pod <ns>/<name> evicted at <ts> due to <root> pressure"` — root is `disk` \| `memory` \| `pid`. |
+| `pattern.remediation` | string | Detector-controlled prose keyed by pressure root (e.g. disk → "relocate the training write path to NVMe; tighten kubelet `--eviction-hard nodefs.available`"). |
+| `pattern.verdict_json` | string | Full evidence trail (the Pod-Evicted Event + the joined NodeCondition transition). |
+| `k8s.pod.name` | string | Evicted pod. |
+| `k8s.pod.namespace` | string | Evicted pod's namespace. |
+| `k8s.node.name` | string | Node the pod was evicted from. |
+| `k8s.event.reason` | string | Upstream Event Reason — always `Evicted` for this pattern. |
+
+The full attribute set lives in
+[`docs/ATTRIBUTES.md`](../ATTRIBUTES.md).
+
+## Integration recipe
+
+The OTTL stanza that normalizes raw `events.k8s.io/v1` Event JSON
+into the `k8s.event.hint` 11-entry enum ships at
+[`docs/integrations/k8sobjects-events.md`](../integrations/k8sobjects-events.md).
+The same recipe streams `NodeCondition` transitions (separate
+`k8sobjectsreceiver` instance against `nodes`) to feed the
+detector's second evidence layer. The bundled Helm chart wires both
+in the default values.
diff --git a/docs/patterns/README.md b/docs/patterns/README.md
@@ -35,7 +35,7 @@ These pages assume:
 - An OTLP backend is receiving the metrics (Prometheus, Datadog,
   Honeycomb, Mimir all confirmed in the backend matrix).
 
-Five of [NORTHSTARS Appendix A's 15 patterns](../NORTHSTARS.md#appendix-a-the-15-named-root-cause-patterns) have operator-facing walkthroughs today — the DCGM-observable subset plus pattern #2 (InfiniBand link flap, fabric-observable). The remaining patterns either ship a detector with its own README contract (pattern #14 pod-evicted in `patterndetectorprocessor/`) or carry a design spec under this directory pending implementation. New walkthroughs land alongside the detector that surfaces the pattern.
+Seven of [NORTHSTARS Appendix A's 15 patterns](../NORTHSTARS.md#appendix-a-the-15-named-root-cause-patterns) have operator-facing walkthroughs today — the DCGM-observable subset plus pattern #2 (InfiniBand link flap, fabric-observable) and pattern #14 (pod evicted, k8s-events-observable). The remaining patterns carry a design spec under this directory pending implementation, or sit in the "reserved / unfilled" table below where the pattern number is named in NORTHSTARS Appendix A but no detector has been written yet. New walkthroughs land alongside the detector that surfaces the pattern.
 
 ## Operator walkthroughs (shipped)
 
@@ -47,6 +47,7 @@ Five of [NORTHSTARS Appendix A's 15 patterns](../NORTHSTARS.md#appendix-a-the-15
 | #4 Thermal throttling cascade | [04-thermal-throttle-walkthrough.md](04-thermal-throttle-walkthrough.md) | `hw.gpu.throttle.duration{reason=thermal}` rate-of-change |
 | #5 PCIe AER cascade | [05-pcie-aer-walkthrough.md](05-pcie-aer-walkthrough.md) | `hw.gpu.io` Tx/Rx counter discontinuities |
 | #7 Dataloader hang | [07-dataloader-hang.md](07-dataloader-hang.md) | `tracecore.alert.training_step_stalled.*` + `dataloader.error_class` OR `k8s.event.reason{FailedMount\|VolumeMountFailure}` |
+| #14 Pod evicted / `NodeNotReady` | [14-pod-evicted-walkthrough.md](14-pod-evicted-walkthrough.md) | `k8s.event.hint=pod_evicted` joined to `k8s.node.condition.{type,status}` transition within `join_window` (default 30s) |
 
 ## Design specs (planned detectors — TDD red-test inputs)
 
@@ -62,6 +63,23 @@ Engineering-facing pattern-design specs for the 8 unspec'd v1 patterns. Each fol
 | #12 Loss spike → NaN | [12-loss-spike-nan.md](12-loss-spike-nan.md) | ☐ planned |
 | #13 Silent data corruption | [13-silent-data-corruption.md](13-silent-data-corruption.md) | ☑ shipped |
 
+## Reserved / unfilled (NORTHSTARS Appendix A patterns with no doc yet)
+
+Pattern numbers named in NORTHSTARS Appendix A that have neither a
+design spec nor an operator walkthrough in this directory. Listed
+explicitly so the numeric gaps in the tables above are documented,
+not silent.
+
+| Pattern | Status | Rationale |
+|---|---|---|
+| #6 Stragglers from slow node (`data_time` 3× normal on one node) | ☐ unimplemented — milestone M18 — no detector, no spec, no walkthrough yet | NORTHSTARS Appendix A names this pattern and M6 (v0) targets coverage. The chaos injector `failure-inject cpu-steal` lands the symptom (per [M4b rubric](../history/MILESTONES-shipped-lanes.md#m4b-failure-injection-harness)), but the cross-rank `data_time` aggregation detector is build-time coupled to M17's `cross_rank.go` infra and is at risk per the NORTHSTARS-coupling note in [M21 v0.1.0 release](../MILESTONES.md#m21-v010-release). NORTHSTARS Appendix A's "☑ in-tree detector" mark for this pattern is aspirational, not current state — to be reconciled when the detector lands. Doc will land alongside the detector; design-spec follow-up tracked at the M18 milestone. |
+| #15 Image pull / `FailedMount` on restart | ☐ no spec yet (NORTHSTARS Appendix A entry; no detector planned for v1) | The `k8s.event.hint` 11-entry enum (RFC-0013 §3) already carries `mount_failure` and `image_pull_failure` hints — the input signal exists. A detector would join repeated mount/pull failures on the same job's retry pods within a window; not yet scheduled. |
+
+When a reserved pattern's detector lands, it promotes to the
+"Design specs" table (☐ planned → ☑ shipped) or directly to the
+"Operator walkthroughs (shipped)" table — same numeric ID, same
+filename convention.
+
 ## Correlation-window semantics
 
 Three v1 detectors use three different correlation-window shapes — chosen independently to match each pattern's physical event-ordering. Operators tuning windows hit these without warning today; this table is the cross-link.