TraceCoreAI · trilamsr · Jun 2, 2026 · Jun 2, 2026
diff --git a/docs/integrations/examples/filelog-container.yaml b/docs/integrations/examples/filelog-container.yaml
@@ -133,6 +133,75 @@ processors:
           # directly). Catch-all for the "transport died" runbook
           # branch.
           - 'set(attributes["dataloader.error_class"], "Connection reset by peer") where IsMatch(body, "Connection reset by peer") and attributes["dataloader.error_class"] == nil'
+
+  # Project PyTorch's `RuntimeError: CUDA out of memory. Tried to
+  # allocate X.YY <unit>. GPU N has a total capacity of ...` stderr
+  # line onto the customer-stable `cuda_oom.tried_alloc_bytes` (Int,
+  # bytes) + `cuda_oom.gpu_index` (Int) attributes that pattern #10's
+  # detector (module/processor/patterndetectorprocessor/cuda_oom.go,
+  # `projectCUDAOOMLogRecord`) consumes. The detector's projection
+  # gate is BOTH `cuda_oom.tried_alloc_bytes` AND `gpu.id` (PCI BDF
+  # per RFC-0013 §3); the stanzas below stamp the bytes scalar and
+  # the human-visible GPU index off the body. `gpu.id` is the
+  # operator-configurable mapping — `cuda_oom.gpu_index` is the
+  # CUDA-runtime ordinal that PyTorch's allocator prints, NOT a PCI
+  # BDF, so the recipe DOES NOT alias it onto `gpu.id`. Two paths
+  # to populate `gpu.id` are documented in
+  # docs/integrations/filelog-container.md §"`cuda_oom.*` attribute
+  # stanza (pattern #10)":
+  #   (a) k8sattributesprocessor + `nvidia.com/gpu` device-plugin
+  #       resource — the pod allocation maps to one PCI BDF, lifted
+  #       onto the log resource as `gpu.id`.
+  #   (b) a sibling DCGM BDF-lookup transform indexed by
+  #       `cuda_oom.gpu_index` — the DCGM exporter ships a per-host
+  #       (index → BDF) table on its scrape endpoint.
+  # Either path stamps `gpu.id` on the resource; the detector's
+  # resource-attr fallback (cuda_oom.go:65) reads it from there.
+  #
+  # Unit normalization: PyTorch's `format_size` emits `%.2f <unit>`
+  # with the four IEC binary prefixes below (KiB / MiB / GiB / TiB).
+  # OTTL Math Expressions support `*` and `+` on int64, so we capture
+  # `whole` (digits before the dot) + 2-digit `frac` (digits after)
+  # and compute `Int(whole)*UNIT + Int(frac)*UNIT/100`. The integer
+  # division floors the per-frac-unit step (max precision loss:
+  # ~10 MB on a 99.99 GiB alloc — three orders of magnitude under
+  # the detector's 5% fragmentation threshold).
+  #
+  # Per-unit branches instead of one omnibus regex: OTTL has no
+  # capture-group-conditional dispatch, so the multiplier must be a
+  # literal int64 per stanza. The four-row repetition is the smallest
+  # shape that compiles. The `where IsMatch(...)` guard is tight on
+  # `CUDA out of memory\. Tried to allocate` so a generic CUDA error
+  # (illegal memory access, NCCL watchdog) does not trip the stanza.
+  transform/cuda_oom:
+    log_statements:
+      - context: log
+        statements:
+          # ---- GPU index extraction (any OOM line) ----
+          # PyTorch prints `GPU N has a total capacity of ...` after
+          # the alloc-size scalar. The index is the CUDA-runtime
+          # ordinal, NOT a PCI BDF — the detector's `gpu.id` projection
+          # is satisfied via the k8sattributes / DCGM-lookup paths
+          # documented above; `cuda_oom.gpu_index` is operator-facing
+          # context the verdict's evidence trail uses.
+          - 'set(attributes["cuda_oom.gpu_index"], Int(ExtractPatterns(body, "GPU (?P<idx>\\d+) has a total capacity")["idx"])) where IsMatch(body, "CUDA out of memory\\. Tried to allocate") and IsMatch(body, "GPU \\d+ has a total capacity")'
+
+          # ---- KiB branch ----
+          # 1 KiB = 1024 B. frac-unit step = 1024/100 = 10 (floor).
+          - 'set(attributes["cuda_oom.tried_alloc_bytes"], Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) KiB")["w"]) * 1024 + Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) KiB")["f"]) * 10) where IsMatch(body, "CUDA out of memory\\. Tried to allocate \\d+\\.\\d{2} KiB")'
+
+          # ---- MiB branch ----
+          # 1 MiB = 1048576 B. frac-unit step = 1048576/100 = 10485 (floor).
+          - 'set(attributes["cuda_oom.tried_alloc_bytes"], Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) MiB")["w"]) * 1048576 + Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) MiB")["f"]) * 10485) where IsMatch(body, "CUDA out of memory\\. Tried to allocate \\d+\\.\\d{2} MiB")'
+
+          # ---- GiB branch ----
+          # 1 GiB = 1073741824 B. frac-unit step = 1073741824/100 = 10737418 (floor).
+          - 'set(attributes["cuda_oom.tried_alloc_bytes"], Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) GiB")["w"]) * 1073741824 + Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) GiB")["f"]) * 10737418) where IsMatch(body, "CUDA out of memory\\. Tried to allocate \\d+\\.\\d{2} GiB")'
+
+          # ---- TiB branch ----
+          # 1 TiB = 1099511627776 B. frac-unit step = 1099511627776/100 = 10995116277 (floor).
+          - 'set(attributes["cuda_oom.tried_alloc_bytes"], Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) TiB")["w"]) * 1099511627776 + Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) TiB")["f"]) * 10995116277) where IsMatch(body, "CUDA out of memory\\. Tried to allocate \\d+\\.\\d{2} TiB")'
+
   k8sattributes:
     auth_type: serviceAccount
     passthrough: false
@@ -181,6 +250,10 @@ service:
       # body strings produced by the container parser and stamp the
       # customer-stable `dataloader.error_class` /
       # `dataloader.worker_pid` attributes pattern #7's detector
-      # consumes.
-      processors: [k8sattributes, transform/dataloader_errors, batch]
+      # consumes. `transform/cuda_oom` runs alongside dataloader_errors
+      # (order-insensitive — they gate on disjoint body substrings)
+      # to stamp `cuda_oom.tried_alloc_bytes` + `cuda_oom.gpu_index`
+      # off PyTorch's `RuntimeError: CUDA out of memory` line for
+      # pattern #10's detector.
+      processors: [k8sattributes, transform/dataloader_errors, transform/cuda_oom, batch]
       exporters: [otlphttp]
diff --git a/docs/integrations/filelog-container.md b/docs/integrations/filelog-container.md
@@ -13,6 +13,11 @@ projects per-driver PyTorch `DataLoader` error vocabulary (FUSE, S3,
 Lustre, multiprocessing queue, worker-killed) onto the customer-stable
 `dataloader.error_class` / `dataloader.worker_pid` attributes that
 [pattern #7's detector](../patterns/07-dataloader-hang.md) consumes.
+A sibling `transform/cuda_oom` stanza projects PyTorch's
+`RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>` line
+onto the customer-stable `cuda_oom.tried_alloc_bytes` (Int, bytes;
+unit-normalized) + `cuda_oom.gpu_index` (Int) attributes that
+[pattern #10's detector](../patterns/10-cuda-oom-deceptive.md) consumes.
 Replaces the in-tree `containerstdout` receiver scheduled for deletion
 at v0.2.0 per
 [RFC-0013 §migration PR-K](../rfcs/0013-distro-first-pivot.md#migration-rollout)
@@ -169,6 +174,63 @@ at `module/pkg/patterns/dataloader_hang.go`).
 > error classes (e.g. a future Ceph-class driver) extend the table
 > here, not by widening an existing regex.
 
+## `cuda_oom.*` attribute stanza (pattern #10)
+
+The `transform/cuda_oom` processor projects PyTorch's canonical
+out-of-memory stderr line — `RuntimeError: CUDA out of memory. Tried
+to allocate 2.00 GiB. GPU 0 has a total capacity of 79.18 GiB of
+which 16.00 GiB is free.` — onto the customer-stable
+[`cuda_oom.tried_alloc_bytes`](../ATTRIBUTES.md) +
+[`cuda_oom.gpu_index`](../ATTRIBUTES.md) attributes that
+[pattern #10's detector](../patterns/10-cuda-oom-deceptive.md)
+(`projectCUDAOOMLogRecord` at
+`module/processor/patterndetectorprocessor/cuda_oom.go`) consumes.
+The detector's projection gate is BOTH `cuda_oom.tried_alloc_bytes`
+AND `gpu.id` (PCI BDF per
+[RFC-0013 §3](../rfcs/0013-distro-first-pivot.md#3-customer-stable-telemetry-contracts));
+this stanza stamps the bytes scalar and the human-visible GPU index
+off the body. `gpu.id` is **not** stamped here — the CUDA-runtime
+ordinal `cuda_oom.gpu_index` is a CUDA enumeration index, not a PCI
+BDF. Two operator-configurable paths populate `gpu.id` on the log
+resource so the detector's resource-attr fallback reads it:
+
+| `gpu.id` source path | When to use |
+|---|---|
+| **k8sattributesprocessor + `nvidia.com/gpu` device-plugin resource** | The trainer pod requests one GPU via `resources.limits.nvidia.com/gpu: 1`. The NVIDIA device plugin annotates the pod with the allocated PCI BDF (`nvidia.com/gpu-PCIDeviceBusID` since device-plugin v0.16). Extend `k8sattributes::extract::annotations` to lift this annotation onto the log resource as `gpu.id`. Cheapest path — already in the cluster's GPU scheduling fabric. |
+| **DCGM BDF-lookup transform indexed by `cuda_oom.gpu_index`** | Multi-GPU pods (one container ↔ N GPUs) where the device-plugin annotation is the per-pod list, not the per-OOM GPU. Scrape the DCGM exporter's `DCGM_FI_DEV_PCI_BUSID` series, materialize a per-host `{gpu_index → BDF}` lookup, then add a sibling OTTL stanza that joins `cuda_oom.gpu_index` against the table to stamp `gpu.id`. Sibling to the [pattern-2 / pattern-10 DCGM recipe](prometheus-scrape.md). |
+
+The recipe uses four per-unit-prefix branches (KiB / MiB / GiB / TiB)
+because OTTL has no capture-group-conditional dispatch — the
+multiplier must be a literal `int64` per stanza. The body match
+captures `whole` (digits before the decimal) and `frac` (two digits
+after) and computes
+`Int(whole) * UNIT + Int(frac) * (UNIT / 100)`. PyTorch's
+`format_size` always emits `%.2f`, so the 2-digit `frac` capture is
+exhaustive; the integer-divide-by-100 floor caps precision loss at
+under 1% of the unit base (max ~10 MB on a 99.99 GiB alloc, three
+orders of magnitude under the detector's 5% fragmentation threshold).
+
+| Body shape | Captured | Stamped attributes |
+|---|---|---|
+| `CUDA out of memory. Tried to allocate \d+\.\d{2} KiB` | `whole`, `frac` (×2 digits) | `cuda_oom.tried_alloc_bytes = whole*1024 + frac*10` |
+| `CUDA out of memory. Tried to allocate \d+\.\d{2} MiB` | `whole`, `frac` | `cuda_oom.tried_alloc_bytes = whole*1048576 + frac*10485` |
+| `CUDA out of memory. Tried to allocate \d+\.\d{2} GiB` | `whole`, `frac` | `cuda_oom.tried_alloc_bytes = whole*1073741824 + frac*10737418` |
+| `CUDA out of memory. Tried to allocate \d+\.\d{2} TiB` | `whole`, `frac` | `cuda_oom.tried_alloc_bytes = whole*1099511627776 + frac*10995116277` |
+| `... GPU \d+ has a total capacity` | `idx` | `cuda_oom.gpu_index = idx` |
+
+The `where IsMatch(body, "CUDA out of memory\. Tried to allocate")`
+guard is tight on the OOM-summary line, so generic CUDA errors
+(`an illegal memory access was encountered`, NCCL watchdog timeouts,
+`DataLoader worker (pid N) is killed`) do not trip the stanza —
+keeping the detector quiet on non-OOM stderr noise.
+
+> **Multi-line tracebacks.** A PyTorch OOM emits the summary line
+> followed by a Python traceback (`File "train.py", line 42, in ...`).
+> The container parser flattens each newline-delimited log line into
+> its own log record; only the summary line matches the regex above,
+> so the detector sees exactly one stamp per OOM event regardless of
+> traceback depth. This is pattern #10 spec Open Q#2's answer.
+
 ## Placeholders
 
 | Placeholder | What to fill in |
@@ -192,6 +254,8 @@ fails immediately instead of silently dropping logs.
 | High-cardinality label explosion | The container parser surfaces every label from `app.kubernetes.io/name` plus whatever you add under `extract::labels`. Audit the list against the receiving backend's cardinality budget before adding more. |
 | Pattern #7 verdict never fires despite known DataLoader stalls | The `transform/dataloader_errors` stanzas gate on substring matches against the container `body`. If your trainer wraps DataLoader errors (e.g. a custom logger that prefixes with JSON), the body shape changes. Confirm via `kubectl logs <trainer-pod> --container=<c> --previous 2>&1 | grep -E 'DataLoader worker|Transport endpoint|SlowDown|Stale file handle'` and extend the regexes in `transform/dataloader_errors`. |
 | `dataloader.error_class` empty on a known error line | The OTTL stanza fell through silently — the body substring did not match any branch. Add a row to the table above and a matching `set(attributes["dataloader.error_class"], ...)` statement. The detector's projection gate requires the attribute, so a missing class drops the discriminator. |
+| Pattern #10 verdict never fires despite a known CUDA OOM | The `transform/cuda_oom` stanzas gate on substring matches against the container `body`. Confirm via `kubectl logs <trainer-pod> --container=<c> --previous 2>&1 \| grep -E 'CUDA out of memory\. Tried to allocate'`. If the trainer wraps PyTorch errors (custom logger, JSON envelope), the body shape changes — extend the `IsMatch` predicates to match the wrapper format. Also check that `gpu.id` is being stamped onto the log resource via one of the two paths in the `cuda_oom.*` section: a missing `gpu.id` drops the projection at `cuda_oom.go`'s gate and the detector stays quiet. |
+| `cuda_oom.tried_alloc_bytes` stamped with a wildly wrong magnitude | A unit-prefix branch was modified without updating its multiplier, or the body shape drifted from `%.2f`. PyTorch's `format_size` has used `%.2f` for the entire CUDA-allocator lifetime; if a customer fork emits `%.0f` or `%.4f` the recipe's `\d{2}` capture misses, and the stanza fails open (no stamp) rather than producing a wrong value. Verify against `pytorch/c10/util/Exception.h`'s formatter. |
 
 Upstream component docs:
 [`receiver/filelogreceiver`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver),

diff --git a/docs/patterns/10-cuda-oom-deceptive.md b/docs/patterns/10-cuda-oom-deceptive.md
@@ -16,7 +16,7 @@ Training fails with `RuntimeError: CUDA out of memory. Tried to allocate X MiB.
 
 ## Signal sources
 
-- `filelogreceiver` tailing training-container stderr — OTTL stanza on the `RuntimeError: CUDA out of memory` line extracts `cuda_oom.tried_alloc_bytes`, `cuda_oom.total_bytes`, `cuda_oom.free_bytes`, `cuda_oom.gpu_index`.
+- `filelogreceiver` tailing training-container stderr — OTTL stanza on the `RuntimeError: CUDA out of memory` line extracts `cuda_oom.tried_alloc_bytes` (unit-normalized KiB/MiB/GiB/TiB → bytes) and `cuda_oom.gpu_index`. Recipe: [docs/integrations/filelog-container.md §`cuda_oom.*` attribute stanza (pattern #10)](../integrations/filelog-container.md#cuda_oom-attribute-stanza-pattern-10) (issue [#436](https://github.com/tracecoreai/tracecore/issues/436); sibling to detector PR [#338](https://github.com/tracecoreai/tracecore/pull/338) and metric-side recipe [#337](https://github.com/tracecoreai/tracecore/issues/337)). `gpu.id` (PCI BDF per RFC-0013 §3) is stamped via the sibling k8sattributesprocessor + `nvidia.com/gpu` device-plugin resource OR a DCGM BDF-lookup transform indexed by `cuda_oom.gpu_index` — the recipe documents both paths.
 - `prometheusreceiver` scraping `dcgm-exporter` — `DCGM_FI_DEV_FB_USED` / `DCGM_FI_DEV_FB_FREE` projected via OTTL transform to `hw.gpu.memory.{used,free}` Gauges (unit `By`) with `gpu.id` resource attr. Per-GPU `hw.gpu.memory.total = used + free` is computed at the metrics-to-logs bridge layer, not at OTTL — `transformprocessor` v0.130 cannot perform cross-series arithmetic on a metrics pipeline. Recipe: [docs/integrations/prometheus-scrape.md §Pattern #10](../integrations/prometheus-scrape.md#pattern-10--cuda-oom-framebuffer); bridge log-shape spec: [§Pattern #10 — `hw.gpu.memory.{free,total}`](../integrations/prometheus-scrape.md#pattern-10--hwgpumemoryfreetotal-issue-337) (issue [#337](https://github.com/tracecoreai/tracecore/issues/337)).
 - (optional) `torch.cuda.memory_summary()` dump from a faulthandler / SIGUSR2 hook — far richer fragmentation detail; out of v1 scope.
 
@@ -72,6 +72,6 @@ Per issue #303 scalar-promotion checklist:
 ## Open questions
 
 1. **`DCGM_FI_DEV_FB_*` OTTL recipe extension.** Resolved by issue [#337](https://github.com/tracecoreai/tracecore/issues/337): metric-side projection (`DCGM_FI_DEV_FB_USED` → `hw.gpu.memory.used`, `DCGM_FI_DEV_FB_FREE` → `hw.gpu.memory.free`) ships in [docs/integrations/prometheus-scrape.md §Pattern #10](../integrations/prometheus-scrape.md#pattern-10--cuda-oom-framebuffer). The `hw.gpu.memory.total = used + free` derivation + log-record emission belongs to the RFC-0014 PR-B `WithMetrics` bridge; the log-shape spec the bridge MUST honor is pinned in the recipe's [§Pattern #10 — `hw.gpu.memory.{free,total}`](../integrations/prometheus-scrape.md#pattern-10--hwgpumemoryfreetotal-issue-337) section.
-2. **filelogreceiver OTTL stanza for the OOM regex.** Sibling to #285. Multi-line `RuntimeError` traceback handling: OTTL recipe stops at the first stanza match or continues into the traceback?
+2. **filelogreceiver OTTL stanza for the OOM regex.** Resolved by issue [#436](https://github.com/tracecoreai/tracecore/issues/436): the `transform/cuda_oom` stanza ships in [docs/integrations/filelog-container.md §`cuda_oom.*` attribute stanza (pattern #10)](../integrations/filelog-container.md#cuda_oom-attribute-stanza-pattern-10). The recipe stops at the per-unit `where IsMatch(...) ... Tried to allocate \d+\.\d{2} <unit>` guard — multi-line traceback lines (`File "train.py", line 42, in ...`) do not match the OOM-summary regex and pass through untransformed, so the detector receives one `cuda_oom.tried_alloc_bytes` stamp per OOM event regardless of traceback depth.
 3. **Metrics-path on patterndetectorprocessor.** Per ADR-0001 PR-B — the processor today consumes logs only. CUDA-OOM joins a log to a metric. Either the metric is projected to a log via the metrics→logs OTTL bridge (RFC-0014 PR-B, also blocking pattern #3 today), or the processor grows a metrics input.
 4. **`cuda_oom.kind` enum namespace.** Should this be `pattern.cuda_oom.kind` or top-level `cuda_oom.kind`? ATTRIBUTES.md prefers `pattern.*` for tracecore-internal verdict scalars, but issue #303 used `cuda_oom.kind` directly. Reconcile.