Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 75 additions & 2 deletions docs/integrations/examples/filelog-container.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,75 @@ processors:
# directly). Catch-all for the "transport died" runbook
# branch.
- 'set(attributes["dataloader.error_class"], "Connection reset by peer") where IsMatch(body, "Connection reset by peer") and attributes["dataloader.error_class"] == nil'

# Project PyTorch's `RuntimeError: CUDA out of memory. Tried to
# allocate X.YY <unit>. GPU N has a total capacity of ...` stderr
# line onto the customer-stable `cuda_oom.tried_alloc_bytes` (Int,
# bytes) + `cuda_oom.gpu_index` (Int) attributes that pattern #10's
# detector (module/processor/patterndetectorprocessor/cuda_oom.go,
# `projectCUDAOOMLogRecord`) consumes. The detector's projection
# gate is BOTH `cuda_oom.tried_alloc_bytes` AND `gpu.id` (PCI BDF
# per RFC-0013 §3); the stanzas below stamp the bytes scalar and
# the human-visible GPU index off the body. `gpu.id` is the
# operator-configurable mapping — `cuda_oom.gpu_index` is the
# CUDA-runtime ordinal that PyTorch's allocator prints, NOT a PCI
# BDF, so the recipe DOES NOT alias it onto `gpu.id`. Two paths
# to populate `gpu.id` are documented in
# docs/integrations/filelog-container.md §"`cuda_oom.*` attribute
# stanza (pattern #10)":
# (a) k8sattributesprocessor + `nvidia.com/gpu` device-plugin
# resource — the pod allocation maps to one PCI BDF, lifted
# onto the log resource as `gpu.id`.
# (b) a sibling DCGM BDF-lookup transform indexed by
# `cuda_oom.gpu_index` — the DCGM exporter ships a per-host
# (index → BDF) table on its scrape endpoint.
# Either path stamps `gpu.id` on the resource; the detector's
# resource-attr fallback (cuda_oom.go:65) reads it from there.
#
# Unit normalization: PyTorch's `format_size` emits `%.2f <unit>`
# with the four IEC binary prefixes below (KiB / MiB / GiB / TiB).
# OTTL Math Expressions support `*` and `+` on int64, so we capture
# `whole` (digits before the dot) + 2-digit `frac` (digits after)
# and compute `Int(whole)*UNIT + Int(frac)*UNIT/100`. The integer
# division floors the per-frac-unit step (max precision loss:
# ~10 MB on a 99.99 GiB alloc — three orders of magnitude under
# the detector's 5% fragmentation threshold).
#
# Per-unit branches instead of one omnibus regex: OTTL has no
# capture-group-conditional dispatch, so the multiplier must be a
# literal int64 per stanza. The four-row repetition is the smallest
# shape that compiles. The `where IsMatch(...)` guard is tight on
# `CUDA out of memory\. Tried to allocate` so a generic CUDA error
# (illegal memory access, NCCL watchdog) does not trip the stanza.
transform/cuda_oom:
log_statements:
- context: log
statements:
# ---- GPU index extraction (any OOM line) ----
# PyTorch prints `GPU N has a total capacity of ...` after
# the alloc-size scalar. The index is the CUDA-runtime
# ordinal, NOT a PCI BDF — the detector's `gpu.id` projection
# is satisfied via the k8sattributes / DCGM-lookup paths
# documented above; `cuda_oom.gpu_index` is operator-facing
# context the verdict's evidence trail uses.
- 'set(attributes["cuda_oom.gpu_index"], Int(ExtractPatterns(body, "GPU (?P<idx>\\d+) has a total capacity")["idx"])) where IsMatch(body, "CUDA out of memory\\. Tried to allocate") and IsMatch(body, "GPU \\d+ has a total capacity")'

# ---- KiB branch ----
# 1 KiB = 1024 B. frac-unit step = 1024/100 = 10 (floor).
- 'set(attributes["cuda_oom.tried_alloc_bytes"], Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) KiB")["w"]) * 1024 + Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) KiB")["f"]) * 10) where IsMatch(body, "CUDA out of memory\\. Tried to allocate \\d+\\.\\d{2} KiB")'

# ---- MiB branch ----
# 1 MiB = 1048576 B. frac-unit step = 1048576/100 = 10485 (floor).
- 'set(attributes["cuda_oom.tried_alloc_bytes"], Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) MiB")["w"]) * 1048576 + Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) MiB")["f"]) * 10485) where IsMatch(body, "CUDA out of memory\\. Tried to allocate \\d+\\.\\d{2} MiB")'

# ---- GiB branch ----
# 1 GiB = 1073741824 B. frac-unit step = 1073741824/100 = 10737418 (floor).
- 'set(attributes["cuda_oom.tried_alloc_bytes"], Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) GiB")["w"]) * 1073741824 + Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) GiB")["f"]) * 10737418) where IsMatch(body, "CUDA out of memory\\. Tried to allocate \\d+\\.\\d{2} GiB")'

# ---- TiB branch ----
# 1 TiB = 1099511627776 B. frac-unit step = 1099511627776/100 = 10995116277 (floor).
- 'set(attributes["cuda_oom.tried_alloc_bytes"], Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) TiB")["w"]) * 1099511627776 + Int(ExtractPatterns(body, "Tried to allocate (?P<w>\\d+)\\.(?P<f>\\d{2}) TiB")["f"]) * 10995116277) where IsMatch(body, "CUDA out of memory\\. Tried to allocate \\d+\\.\\d{2} TiB")'

k8sattributes:
auth_type: serviceAccount
passthrough: false
Expand Down Expand Up @@ -181,6 +250,10 @@ service:
# body strings produced by the container parser and stamp the
# customer-stable `dataloader.error_class` /
# `dataloader.worker_pid` attributes pattern #7's detector
# consumes.
processors: [k8sattributes, transform/dataloader_errors, batch]
# consumes. `transform/cuda_oom` runs alongside dataloader_errors
# (order-insensitive — they gate on disjoint body substrings)
# to stamp `cuda_oom.tried_alloc_bytes` + `cuda_oom.gpu_index`
# off PyTorch's `RuntimeError: CUDA out of memory` line for
# pattern #10's detector.
processors: [k8sattributes, transform/dataloader_errors, transform/cuda_oom, batch]
exporters: [otlphttp]
64 changes: 64 additions & 0 deletions docs/integrations/filelog-container.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ projects per-driver PyTorch `DataLoader` error vocabulary (FUSE, S3,
Lustre, multiprocessing queue, worker-killed) onto the customer-stable
`dataloader.error_class` / `dataloader.worker_pid` attributes that
[pattern #7's detector](../patterns/07-dataloader-hang.md) consumes.
A sibling `transform/cuda_oom` stanza projects PyTorch's
`RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>` line
onto the customer-stable `cuda_oom.tried_alloc_bytes` (Int, bytes;
unit-normalized) + `cuda_oom.gpu_index` (Int) attributes that
[pattern #10's detector](../patterns/10-cuda-oom-deceptive.md) consumes.
Replaces the in-tree `containerstdout` receiver scheduled for deletion
at v0.2.0 per
[RFC-0013 §migration PR-K](../rfcs/0013-distro-first-pivot.md#migration-rollout)
Expand Down Expand Up @@ -169,6 +174,63 @@ at `module/pkg/patterns/dataloader_hang.go`).
> error classes (e.g. a future Ceph-class driver) extend the table
> here, not by widening an existing regex.

## `cuda_oom.*` attribute stanza (pattern #10)

The `transform/cuda_oom` processor projects PyTorch's canonical
out-of-memory stderr line — `RuntimeError: CUDA out of memory. Tried
to allocate 2.00 GiB. GPU 0 has a total capacity of 79.18 GiB of
which 16.00 GiB is free.` — onto the customer-stable
[`cuda_oom.tried_alloc_bytes`](../ATTRIBUTES.md) +
[`cuda_oom.gpu_index`](../ATTRIBUTES.md) attributes that
[pattern #10's detector](../patterns/10-cuda-oom-deceptive.md)
(`projectCUDAOOMLogRecord` at
`module/processor/patterndetectorprocessor/cuda_oom.go`) consumes.
The detector's projection gate is BOTH `cuda_oom.tried_alloc_bytes`
AND `gpu.id` (PCI BDF per
[RFC-0013 §3](../rfcs/0013-distro-first-pivot.md#3-customer-stable-telemetry-contracts));
this stanza stamps the bytes scalar and the human-visible GPU index
off the body. `gpu.id` is **not** stamped here — the CUDA-runtime
ordinal `cuda_oom.gpu_index` is a CUDA enumeration index, not a PCI
BDF. Two operator-configurable paths populate `gpu.id` on the log
resource so the detector's resource-attr fallback reads it:

| `gpu.id` source path | When to use |
|---|---|
| **k8sattributesprocessor + `nvidia.com/gpu` device-plugin resource** | The trainer pod requests one GPU via `resources.limits.nvidia.com/gpu: 1`. The NVIDIA device plugin annotates the pod with the allocated PCI BDF (`nvidia.com/gpu-PCIDeviceBusID` since device-plugin v0.16). Extend `k8sattributes::extract::annotations` to lift this annotation onto the log resource as `gpu.id`. Cheapest path — already in the cluster's GPU scheduling fabric. |
| **DCGM BDF-lookup transform indexed by `cuda_oom.gpu_index`** | Multi-GPU pods (one container ↔ N GPUs) where the device-plugin annotation is the per-pod list, not the per-OOM GPU. Scrape the DCGM exporter's `DCGM_FI_DEV_PCI_BUSID` series, materialize a per-host `{gpu_index → BDF}` lookup, then add a sibling OTTL stanza that joins `cuda_oom.gpu_index` against the table to stamp `gpu.id`. Sibling to the [pattern-2 / pattern-10 DCGM recipe](prometheus-scrape.md). |

The recipe uses four per-unit-prefix branches (KiB / MiB / GiB / TiB)
because OTTL has no capture-group-conditional dispatch — the
multiplier must be a literal `int64` per stanza. The body match
captures `whole` (digits before the decimal) and `frac` (two digits
after) and computes
`Int(whole) * UNIT + Int(frac) * (UNIT / 100)`. PyTorch's
`format_size` always emits `%.2f`, so the 2-digit `frac` capture is
exhaustive; the integer-divide-by-100 floor caps precision loss at
under 1% of the unit base (max ~10 MB on a 99.99 GiB alloc, three
orders of magnitude under the detector's 5% fragmentation threshold).

| Body shape | Captured | Stamped attributes |
|---|---|---|
| `CUDA out of memory. Tried to allocate \d+\.\d{2} KiB` | `whole`, `frac` (×2 digits) | `cuda_oom.tried_alloc_bytes = whole*1024 + frac*10` |
| `CUDA out of memory. Tried to allocate \d+\.\d{2} MiB` | `whole`, `frac` | `cuda_oom.tried_alloc_bytes = whole*1048576 + frac*10485` |
| `CUDA out of memory. Tried to allocate \d+\.\d{2} GiB` | `whole`, `frac` | `cuda_oom.tried_alloc_bytes = whole*1073741824 + frac*10737418` |
| `CUDA out of memory. Tried to allocate \d+\.\d{2} TiB` | `whole`, `frac` | `cuda_oom.tried_alloc_bytes = whole*1099511627776 + frac*10995116277` |
| `... GPU \d+ has a total capacity` | `idx` | `cuda_oom.gpu_index = idx` |

The `where IsMatch(body, "CUDA out of memory\. Tried to allocate")`
guard is tight on the OOM-summary line, so generic CUDA errors
(`an illegal memory access was encountered`, NCCL watchdog timeouts,
`DataLoader worker (pid N) is killed`) do not trip the stanza —
keeping the detector quiet on non-OOM stderr noise.

> **Multi-line tracebacks.** A PyTorch OOM emits the summary line
> followed by a Python traceback (`File "train.py", line 42, in ...`).
> The container parser flattens each newline-delimited log line into
> its own log record; only the summary line matches the regex above,
> so the detector sees exactly one stamp per OOM event regardless of
> traceback depth. This is pattern #10 spec Open Q#2's answer.

## Placeholders

| Placeholder | What to fill in |
Expand All @@ -192,6 +254,8 @@ fails immediately instead of silently dropping logs.
| High-cardinality label explosion | The container parser surfaces every label from `app.kubernetes.io/name` plus whatever you add under `extract::labels`. Audit the list against the receiving backend's cardinality budget before adding more. |
| Pattern #7 verdict never fires despite known DataLoader stalls | The `transform/dataloader_errors` stanzas gate on substring matches against the container `body`. If your trainer wraps DataLoader errors (e.g. a custom logger that prefixes with JSON), the body shape changes. Confirm via `kubectl logs <trainer-pod> --container=<c> --previous 2>&1 | grep -E 'DataLoader worker|Transport endpoint|SlowDown|Stale file handle'` and extend the regexes in `transform/dataloader_errors`. |
| `dataloader.error_class` empty on a known error line | The OTTL stanza fell through silently — the body substring did not match any branch. Add a row to the table above and a matching `set(attributes["dataloader.error_class"], ...)` statement. The detector's projection gate requires the attribute, so a missing class drops the discriminator. |
| Pattern #10 verdict never fires despite a known CUDA OOM | The `transform/cuda_oom` stanzas gate on substring matches against the container `body`. Confirm via `kubectl logs <trainer-pod> --container=<c> --previous 2>&1 \| grep -E 'CUDA out of memory\. Tried to allocate'`. If the trainer wraps PyTorch errors (custom logger, JSON envelope), the body shape changes — extend the `IsMatch` predicates to match the wrapper format. Also check that `gpu.id` is being stamped onto the log resource via one of the two paths in the `cuda_oom.*` section: a missing `gpu.id` drops the projection at `cuda_oom.go`'s gate and the detector stays quiet. |
| `cuda_oom.tried_alloc_bytes` stamped with a wildly wrong magnitude | A unit-prefix branch was modified without updating its multiplier, or the body shape drifted from `%.2f`. PyTorch's `format_size` has used `%.2f` for the entire CUDA-allocator lifetime; if a customer fork emits `%.0f` or `%.4f` the recipe's `\d{2}` capture misses, and the stanza fails open (no stamp) rather than producing a wrong value. Verify against `pytorch/c10/util/Exception.h`'s formatter. |

Upstream component docs:
[`receiver/filelogreceiver`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver),
Expand Down
4 changes: 2 additions & 2 deletions docs/patterns/10-cuda-oom-deceptive.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Training fails with `RuntimeError: CUDA out of memory. Tried to allocate X MiB.

## Signal sources

- `filelogreceiver` tailing training-container stderr — OTTL stanza on the `RuntimeError: CUDA out of memory` line extracts `cuda_oom.tried_alloc_bytes`, `cuda_oom.total_bytes`, `cuda_oom.free_bytes`, `cuda_oom.gpu_index`.
- `filelogreceiver` tailing training-container stderr — OTTL stanza on the `RuntimeError: CUDA out of memory` line extracts `cuda_oom.tried_alloc_bytes` (unit-normalized KiB/MiB/GiB/TiB → bytes) and `cuda_oom.gpu_index`. Recipe: [docs/integrations/filelog-container.md §`cuda_oom.*` attribute stanza (pattern #10)](../integrations/filelog-container.md#cuda_oom-attribute-stanza-pattern-10) (issue [#436](https://github.com/tracecoreai/tracecore/issues/436); sibling to detector PR [#338](https://github.com/tracecoreai/tracecore/pull/338) and metric-side recipe [#337](https://github.com/tracecoreai/tracecore/issues/337)). `gpu.id` (PCI BDF per RFC-0013 §3) is stamped via the sibling k8sattributesprocessor + `nvidia.com/gpu` device-plugin resource OR a DCGM BDF-lookup transform indexed by `cuda_oom.gpu_index` — the recipe documents both paths.
- `prometheusreceiver` scraping `dcgm-exporter` — `DCGM_FI_DEV_FB_USED` / `DCGM_FI_DEV_FB_FREE` projected via OTTL transform to `hw.gpu.memory.{used,free}` Gauges (unit `By`) with `gpu.id` resource attr. Per-GPU `hw.gpu.memory.total = used + free` is computed at the metrics-to-logs bridge layer, not at OTTL — `transformprocessor` v0.130 cannot perform cross-series arithmetic on a metrics pipeline. Recipe: [docs/integrations/prometheus-scrape.md §Pattern #10](../integrations/prometheus-scrape.md#pattern-10--cuda-oom-framebuffer); bridge log-shape spec: [§Pattern #10 — `hw.gpu.memory.{free,total}`](../integrations/prometheus-scrape.md#pattern-10--hwgpumemoryfreetotal-issue-337) (issue [#337](https://github.com/tracecoreai/tracecore/issues/337)).
- (optional) `torch.cuda.memory_summary()` dump from a faulthandler / SIGUSR2 hook — far richer fragmentation detail; out of v1 scope.

Expand Down Expand Up @@ -72,6 +72,6 @@ Per issue #303 scalar-promotion checklist:
## Open questions

1. **`DCGM_FI_DEV_FB_*` OTTL recipe extension.** Resolved by issue [#337](https://github.com/tracecoreai/tracecore/issues/337): metric-side projection (`DCGM_FI_DEV_FB_USED` → `hw.gpu.memory.used`, `DCGM_FI_DEV_FB_FREE` → `hw.gpu.memory.free`) ships in [docs/integrations/prometheus-scrape.md §Pattern #10](../integrations/prometheus-scrape.md#pattern-10--cuda-oom-framebuffer). The `hw.gpu.memory.total = used + free` derivation + log-record emission belongs to the RFC-0014 PR-B `WithMetrics` bridge; the log-shape spec the bridge MUST honor is pinned in the recipe's [§Pattern #10 — `hw.gpu.memory.{free,total}`](../integrations/prometheus-scrape.md#pattern-10--hwgpumemoryfreetotal-issue-337) section.
2. **filelogreceiver OTTL stanza for the OOM regex.** Sibling to #285. Multi-line `RuntimeError` traceback handling: OTTL recipe stops at the first stanza match or continues into the traceback?
2. **filelogreceiver OTTL stanza for the OOM regex.** Resolved by issue [#436](https://github.com/tracecoreai/tracecore/issues/436): the `transform/cuda_oom` stanza ships in [docs/integrations/filelog-container.md §`cuda_oom.*` attribute stanza (pattern #10)](../integrations/filelog-container.md#cuda_oom-attribute-stanza-pattern-10). The recipe stops at the per-unit `where IsMatch(...) ... Tried to allocate \d+\.\d{2} <unit>` guard — multi-line traceback lines (`File "train.py", line 42, in ...`) do not match the OOM-summary regex and pass through untransformed, so the detector receives one `cuda_oom.tried_alloc_bytes` stamp per OOM event regardless of traceback depth.
3. **Metrics-path on patterndetectorprocessor.** Per ADR-0001 PR-B — the processor today consumes logs only. CUDA-OOM joins a log to a metric. Either the metric is projected to a log via the metrics→logs OTTL bridge (RFC-0014 PR-B, also blocking pattern #3 today), or the processor grows a metrics input.
4. **`cuda_oom.kind` enum namespace.** Should this be `pattern.cuda_oom.kind` or top-level `cuda_oom.kind`? ATTRIBUTES.md prefers `pattern.*` for tracecore-internal verdict scalars, but issue #303 used `cuda_oom.kind` directly. Reconcile.
Loading