feat(integrations/examples): pattern-10 CUDA OOM filelog OTTL stanza#451
Conversation
Adds the `transform/cuda_oom` processor to `docs/integrations/examples/filelog-container.yaml` that projects PyTorch's canonical `RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>. GPU N has a total capacity of ...` stderr line onto the customer-stable `cuda_oom.tried_alloc_bytes` (Int, bytes; unit-normalized KiB/MiB/GiB/TiB) and `cuda_oom.gpu_index` (Int) attributes that pattern #10's detector (PR #338) reads via `projectCUDAOOMLogRecord`. Closes the load-bearing filelog→detector integration gap flagged in issue #303 follow-ups. Per-unit-branch shape (one stanza per KiB/MiB/GiB/TiB prefix) because OTTL has no capture-group-conditional dispatch — the multiplier must be a literal int64 per stanza. Uses OTTL Math Expressions (`Int(whole)*UNIT + Int(frac)*(UNIT/100)`) to handle PyTorch's `%.2f` `format_size` output; precision loss capped at <1% of the unit base, three orders of magnitude under the detector's 5% fragmentation threshold. `gpu.id` (PCI BDF per RFC-0013 §3) is NOT stamped by this transform — the CUDA-runtime ordinal `cuda_oom.gpu_index` is not a PCI BDF. Two operator-configurable paths documented in the recipe markdown: (a) k8sattributesprocessor + `nvidia.com/gpu-PCIDeviceBusID` device- plugin annotation, or (b) DCGM BDF-lookup transform indexed by `cuda_oom.gpu_index`. The detector's resource-attr fallback reads `gpu.id` from the log resource either way. Tests (TDD red→green): three new recipe-parity tests under `module/processor/patterndetectorprocessor/cuda_oom_recipe_test.go`: - `TestRecipe_CUDAOOM_StanzaPinsWireContract`: pins the 7 load- bearing tokens (`cuda_oom.tried_alloc_bytes`, `cuda_oom.gpu_index`, KiB/MiB/GiB/TiB unit prefixes, `transform/cuda_oom`) + pipeline- wiring against the live projector. Mirrors PR #393's IB-flap shape. - `TestRecipe_CUDAOOM_RoundTripFiresVerdict`: end-to-end gate — log records carrying the exact attribute shape the recipe stamps flow through CUDAOOMDetector and emit a kind=fragmentation verdict with the expected scalar-promotion contract. - `TestRecipe_CUDAOOM_RegexCoversCanonicalPyTorchMessages`: 5 canonical positive PyTorch OOM messages (KiB/MiB/GiB/TiB/fractional) + 3 negative messages (DataLoader worker killed, NCCL watchdog, illegal memory access). Exceeds the >=3-positive A-tier acceptance criterion. Validates clean: - `tracecore validate docs/integrations/examples/filelog-container.yaml` exits 0. - `make validator-recipe` covers this file (tested-against: tracecore). - `make doc-check` resolves the new pattern-10 cross-link. - Full `make ci-fast` green. Cross-links: - Pattern doc: `docs/patterns/10-cuda-oom-deceptive.md` §"Signal sources" now references this recipe and resolves Open Question #2 ("filelogreceiver OTTL stanza for the OOM regex"). - Recipe markdown: new §`cuda_oom.*` attribute stanza (pattern #10) in `docs/integrations/filelog-container.md` with the unit- normalization arithmetic table, the two `gpu.id` source paths, and a Failure-modes row. Closes #436. Refs #338, #303, #337. Signed-off-by: Tri Lam <tree@lumalabs.ai>
B/A/A+ CriteriaB (Acceptable): Recipe ships; regex matches target OOM patterns; at least one integration test passes; operator can copy+paste into their environment. A (Good): B + recipe fully documents both gpu.id sources and unit normalization; all 3+ integration tests pass; OTTL stanzas are justified (not redundant); operator has clear guidance on which gpu.id path applies to their stack. A+ (Excellent): A + simplicity-first (no dead code, <3-rule enforced, fold stanzas if safe); regex is robust (mixed-locale, version variance); multi-line traceback handling documented; processor name follows sibling conventions; ci-fast runtime acceptable; PR body signals close of #436 without defer. Findings
Simplification SweepClean. No dead code, no premature helpers (atoi is test-only, 4 call sites), no redundant abstractions. Comment density in YAML preamble (28 lines) is load-bearing for operators verifying gpu.id paths and unit math. Repetition across YAML/markdown sections is justified by audience (config authors vs. operator runbooks). VERDICT: A+ — Closes load-bearing input gap for pattern #10 detector
Ship as-is. |
Summary
transform/cuda_oomOTTL processor todocs/integrations/examples/filelog-container.yaml, stampingcuda_oom.tried_alloc_bytes(Int, bytes; unit-normalized KiB/MiB/GiB/TiB) andcuda_oom.gpu_index(Int) off PyTorch's canonicalRuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>. GPU N has a total capacity of ...stderr line.projectCUDAOOMLogRecord(module/processor/patterndetectorprocessor/cuda_oom.go) gates oncuda_oom.tried_alloc_bytes+gpu.idbut no upstream recipe stamped them, so the compiled detector received no real input at runtime.Root cause
Issue #303's deliverable list included
projectCUDAOOMLogRecord(shipped in PR #338) but explicitly deferred the filelog OTTL stanza to a sibling follow-up (issue #285 / #436). The detector compiled green and its wiring tests passed against synthetic plog input, but production stderr never carried the customer-stable attributes the projector reads. This PR is the missing link — a recipe-only change with zero detector-source edits.Recipe design
int64per stanza.Int(whole)*UNIT + Int(frac)*(UNIT/100)against PyTorch's%.2fformat_sizeshape (verified againstc10/cuda/CUDACachingAllocator.cpp). Integer-divide-by-100 floors per-frac-unit precision loss at <1% of the unit base — three orders of magnitude under the detector's 5% fragmentation threshold.gpu.idis NOT stamped here: the CUDA-runtime ordinalcuda_oom.gpu_indexis not a PCI BDF. The recipe markdown documents two operator paths: (a) k8sattributesprocessor +nvidia.com/gpu-PCIDeviceBusIDdevice-plugin annotation, or (b) DCGM BDF-lookup transform indexed bycuda_oom.gpu_index. The detector's resource-attr fallback readsgpu.idoff the log resource either way.where IsMatchguard onCUDA out of memory\. Tried to allocate— generic CUDA errors (illegal memory access, NCCL watchdog, DataLoader worker killed) do not trip the stanza.Tests
TDD red → green via three new tests in
module/processor/patterndetectorprocessor/cuda_oom_recipe_test.go:TestRecipe_CUDAOOM_StanzaPinsWireContract— pins 7 load-bearing tokens (cuda_oom.tried_alloc_bytes,cuda_oom.gpu_index, KiB/MiB/GiB/TiB,transform/cuda_oom) + pipeline-wiring against the live projector.TestRecipe_CUDAOOM_RoundTripFiresVerdict— end-to-end gate: recipe-shaped log records flow throughCUDAOOMDetectorand emit akind=fragmentationverdict with the expected scalar-promotion contract.TestRecipe_CUDAOOM_RegexCoversCanonicalPyTorchMessages— 5 canonical positives (KiB / MiB / GiB / GiB-fractional / TiB) + 3 negatives (DataLoader worker killed, NCCL watchdog, illegal memory access). Exceeds the ≥3-positive A-tier acceptance criterion from ottl(pattern-10): filelogreceiver stanza for PyTorch CUDA OOM line → cuda_oom.tried_alloc_bytes + gpu.id #436.Self-grade: A+
tracecore validateexit 0); regex extracts bytes + GPU index with unit normalization; documented. ✓make validator-recipecovers this file; regex tested against ≥3 canonical messages (5 positives total); negative cases verified. ✓IsMatchguard); cross-linked fromdocs/patterns/10-cuda-oom-deceptive.md§"Signal sources" + Open Question Bump the gh-actions group across 1 directory with 4 updates #2; new §cuda_oom.*attribute stanza indocs/integrations/filelog-container.mdwith unit-normalization arithmetic table, twogpu.idsource paths, and a Failure-modes row. ✓Cross-references
module/processor/patterndetectorprocessor/cuda_oom.go.docs/integrations/examples/prometheus-scrape.yaml.docs/patterns/10-cuda-oom-deceptive.md— Open Q#2 resolved.docs/integrations/examples/<target>.yaml).Test plan
go test ./processor/patterndetectorprocessor/ -run TestRecipe_CUDAOOM -count=1 -v— PASS (3 tests, 8 sub-tests)go test ./processor/patterndetectorprocessor/ -count=1— PASS (no regressions)make build—_build/tracecorecompiles via OCB./_build/tracecore validate --config=docs/integrations/examples/filelog-container.yaml— exit 0make validator-recipe— 9 validated, 3 skipped (non-linux host) of 12 recipe(s)make doc-check— PASS (new cross-link resolves)make ci-fast— PASS (lint, vet, mod-verify, attribute-namespace-check, doc-check)Closes #436.
Refs #338, #303, #337.