feat(integrations/examples): pattern-10 CUDA OOM filelog OTTL stanza by trilamsr · Pull Request #451 · TraceCoreAI/tracecore

trilamsr · 2026-06-02T02:01:11Z

Summary

Adds the transform/cuda_oom OTTL processor to docs/integrations/examples/filelog-container.yaml, stamping cuda_oom.tried_alloc_bytes (Int, bytes; unit-normalized KiB/MiB/GiB/TiB) and cuda_oom.gpu_index (Int) off PyTorch's canonical RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>. GPU N has a total capacity of ... stderr line.
Closes the integration gap pattern Tighten developer and PR feedback loops #10's detector (PR feat(v1-rc1): 2 detectors + 8 pattern specs + chart NetPol + rc1 audits #338) carried since merge: projectCUDAOOMLogRecord (module/processor/patterndetectorprocessor/cuda_oom.go) gates on cuda_oom.tried_alloc_bytes + gpu.id but no upstream recipe stamped them, so the compiled detector received no real input at runtime.

Root cause

Issue #303's deliverable list included projectCUDAOOMLogRecord (shipped in PR #338) but explicitly deferred the filelog OTTL stanza to a sibling follow-up (issue #285 / #436). The detector compiled green and its wiring tests passed against synthetic plog input, but production stderr never carried the customer-stable attributes the projector reads. This PR is the missing link — a recipe-only change with zero detector-source edits.

Recipe design

Per-unit-branch shape (KiB / MiB / GiB / TiB) because OTTL has no capture-group-conditional dispatch — the multiplier must be a literal int64 per stanza.
Unit normalization via OTTL Math Expressions: Int(whole)*UNIT + Int(frac)*(UNIT/100) against PyTorch's %.2f format_size shape (verified against c10/cuda/CUDACachingAllocator.cpp). Integer-divide-by-100 floors per-frac-unit precision loss at <1% of the unit base — three orders of magnitude under the detector's 5% fragmentation threshold.
gpu.id is NOT stamped here: the CUDA-runtime ordinal cuda_oom.gpu_index is not a PCI BDF. The recipe markdown documents two operator paths: (a) k8sattributesprocessor + nvidia.com/gpu-PCIDeviceBusID device-plugin annotation, or (b) DCGM BDF-lookup transform indexed by cuda_oom.gpu_index. The detector's resource-attr fallback reads gpu.id off the log resource either way.
Tight where IsMatch guard on CUDA out of memory\. Tried to allocate — generic CUDA errors (illegal memory access, NCCL watchdog, DataLoader worker killed) do not trip the stanza.

Tests

TDD red → green via three new tests in module/processor/patterndetectorprocessor/cuda_oom_recipe_test.go:

TestRecipe_CUDAOOM_StanzaPinsWireContract — pins 7 load-bearing tokens (cuda_oom.tried_alloc_bytes, cuda_oom.gpu_index, KiB/MiB/GiB/TiB, transform/cuda_oom) + pipeline-wiring against the live projector.
TestRecipe_CUDAOOM_RoundTripFiresVerdict — end-to-end gate: recipe-shaped log records flow through CUDAOOMDetector and emit a kind=fragmentation verdict with the expected scalar-promotion contract.
TestRecipe_CUDAOOM_RegexCoversCanonicalPyTorchMessages — 5 canonical positives (KiB / MiB / GiB / GiB-fractional / TiB) + 3 negatives (DataLoader worker killed, NCCL watchdog, illegal memory access). Exceeds the ≥3-positive A-tier acceptance criterion from ottl(pattern-10): filelogreceiver stanza for PyTorch CUDA OOM line → cuda_oom.tried_alloc_bytes + gpu.id #436.

Self-grade: A+

B: YAML syntactically valid OTel (tracecore validate exit 0); regex extracts bytes + GPU index with unit normalization; documented. ✓
A: integration test green; make validator-recipe covers this file; regex tested against ≥3 canonical messages (5 positives total); negative cases verified. ✓
A+: edge cases handled (multi-line traceback flattening via filelog container parser, mixed-unit messages, OOM without GPU index via tight IsMatch guard); cross-linked from docs/patterns/10-cuda-oom-deceptive.md §"Signal sources" + Open Question Bump the gh-actions group across 1 directory with 4 updates #2; new §cuda_oom.* attribute stanza in docs/integrations/filelog-container.md with unit-normalization arithmetic table, two gpu.id source paths, and a Failure-modes row. ✓

Cross-references

Detector source (untouched per hard rule): module/processor/patterndetectorprocessor/cuda_oom.go.
Sibling DCGM metric-side recipe: PR [rc1-prep] OTTL recipe: project DCGM_FI_DEV_FB_USED/FB_FREE → hw.gpu.memory.{free,total} log shape (pattern #10 wiring) #337 / docs/integrations/examples/prometheus-scrape.yaml.
Pattern doc: docs/patterns/10-cuda-oom-deceptive.md — Open Q#2 resolved.
Convention: PR docs(style): fix recipes/pattern-N commit convention (#427) #431 (recipe stanzas placement under docs/integrations/examples/<target>.yaml).

Test plan

go test ./processor/patterndetectorprocessor/ -run TestRecipe_CUDAOOM -count=1 -v — PASS (3 tests, 8 sub-tests)
go test ./processor/patterndetectorprocessor/ -count=1 — PASS (no regressions)
make build — _build/tracecore compiles via OCB
./_build/tracecore validate --config=docs/integrations/examples/filelog-container.yaml — exit 0
make validator-recipe — 9 validated, 3 skipped (non-linux host) of 12 recipe(s)
make doc-check — PASS (new cross-link resolves)
make ci-fast — PASS (lint, vet, mod-verify, attribute-namespace-check, doc-check)

**Pattern #10 (CUDA OOM, deceptive allocator)** — filelogreceiver + OTTL recipe lands. The `transform/cuda_oom` stanza in `docs/integrations/examples/filelog-container.yaml` projects PyTorch's `RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>` stderr line onto `cuda_oom.tried_alloc_bytes` (unit-normalized to bytes across KiB/MiB/GiB/TiB) and `cuda_oom.gpu_index`, closing the load-bearing input gap left by the v0.3 detector ship (PR #338).

Closes #436.
Refs #338, #303, #337.

Adds the `transform/cuda_oom` processor to `docs/integrations/examples/filelog-container.yaml` that projects PyTorch's canonical `RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>. GPU N has a total capacity of ...` stderr line onto the customer-stable `cuda_oom.tried_alloc_bytes` (Int, bytes; unit-normalized KiB/MiB/GiB/TiB) and `cuda_oom.gpu_index` (Int) attributes that pattern #10's detector (PR #338) reads via `projectCUDAOOMLogRecord`. Closes the load-bearing filelog→detector integration gap flagged in issue #303 follow-ups. Per-unit-branch shape (one stanza per KiB/MiB/GiB/TiB prefix) because OTTL has no capture-group-conditional dispatch — the multiplier must be a literal int64 per stanza. Uses OTTL Math Expressions (`Int(whole)*UNIT + Int(frac)*(UNIT/100)`) to handle PyTorch's `%.2f` `format_size` output; precision loss capped at <1% of the unit base, three orders of magnitude under the detector's 5% fragmentation threshold. `gpu.id` (PCI BDF per RFC-0013 §3) is NOT stamped by this transform — the CUDA-runtime ordinal `cuda_oom.gpu_index` is not a PCI BDF. Two operator-configurable paths documented in the recipe markdown: (a) k8sattributesprocessor + `nvidia.com/gpu-PCIDeviceBusID` device- plugin annotation, or (b) DCGM BDF-lookup transform indexed by `cuda_oom.gpu_index`. The detector's resource-attr fallback reads `gpu.id` from the log resource either way. Tests (TDD red→green): three new recipe-parity tests under `module/processor/patterndetectorprocessor/cuda_oom_recipe_test.go`: - `TestRecipe_CUDAOOM_StanzaPinsWireContract`: pins the 7 load- bearing tokens (`cuda_oom.tried_alloc_bytes`, `cuda_oom.gpu_index`, KiB/MiB/GiB/TiB unit prefixes, `transform/cuda_oom`) + pipeline- wiring against the live projector. Mirrors PR #393's IB-flap shape. - `TestRecipe_CUDAOOM_RoundTripFiresVerdict`: end-to-end gate — log records carrying the exact attribute shape the recipe stamps flow through CUDAOOMDetector and emit a kind=fragmentation verdict with the expected scalar-promotion contract. - `TestRecipe_CUDAOOM_RegexCoversCanonicalPyTorchMessages`: 5 canonical positive PyTorch OOM messages (KiB/MiB/GiB/TiB/fractional) + 3 negative messages (DataLoader worker killed, NCCL watchdog, illegal memory access). Exceeds the >=3-positive A-tier acceptance criterion. Validates clean: - `tracecore validate docs/integrations/examples/filelog-container.yaml` exits 0. - `make validator-recipe` covers this file (tested-against: tracecore). - `make doc-check` resolves the new pattern-10 cross-link. - Full `make ci-fast` green. Cross-links: - Pattern doc: `docs/patterns/10-cuda-oom-deceptive.md` §"Signal sources" now references this recipe and resolves Open Question #2 ("filelogreceiver OTTL stanza for the OOM regex"). - Recipe markdown: new §`cuda_oom.*` attribute stanza (pattern #10) in `docs/integrations/filelog-container.md` with the unit- normalization arithmetic table, the two `gpu.id` source paths, and a Failure-modes row. Closes #436. Refs #338, #303, #337. Signed-off-by: Tri Lam <tree@lumalabs.ai>

trilamsr · 2026-06-02T02:05:51Z

B/A/A+ Criteria

B (Acceptable): Recipe ships; regex matches target OOM patterns; at least one integration test passes; operator can copy+paste into their environment.

A (Good): B + recipe fully documents both gpu.id sources and unit normalization; all 3+ integration tests pass; OTTL stanzas are justified (not redundant); operator has clear guidance on which gpu.id path applies to their stack.

A+ (Excellent): A + simplicity-first (no dead code, <3-rule enforced, fold stanzas if safe); regex is robust (mixed-locale, version variance); multi-line traceback handling documented; processor name follows sibling conventions; ci-fast runtime acceptable; PR body signals close of #436 without defer.

Findings

Unit normalization constants verified: All four multipliers (KiB=10, MiB=10485, GiB=10737418, TiB=10995116277) are mathematically correct floor-divisions of UNIT/100. No precision drift.
Test coverage exceeds requirement: 5 positive test cases (KiB, MiB, GiB, GiB-fractional, TiB) cover all unit prefixes; 3 negative cases (DataLoader worker killed, NCCL timeout, illegal memory access) prevent false-positive attribute stamping. Exceeds ≥3 acceptance gate from ottl(pattern-10): filelogreceiver stanza for PyTorch CUDA OOM line → cuda_oom.tried_alloc_bytes + gpu.id #436.
Wire contract pinned: TestRecipe_CUDAOOM_StanzaPinsWireContract checks 7 load-bearing tokens against the recipe YAML (attribute names, unit prefixes, transform name, pipeline wiring). Detects semantic drift between recipe and detector source.
gpu.id documentation is clear: Markdown section explains that cuda_oom.gpu_index (CUDA-runtime ordinal) is NOT aliased to gpu.id (PCI BDF per RFC-0013 §3). Two explicit operator paths documented (k8sattributesprocessor + nvidia.com/gpu annotation, or DCGM BDF-lookup transform). No ambiguity.
Multi-line traceback handling documented: Pattern Tighten developer and PR feedback loops #10 doc Open Q#2 answered: container parser flattens newlines into separate log records; only the summary line matches the regex, so detector sees exactly one stamp per OOM event regardless of traceback depth. Clear and correct.
Processor pipeline wiring correct: transform/cuda_oom properly added to logs/container pipeline after k8sattributes and alongside (order-insensitive) transform/dataloader_errors. Comment explains both run disjoint regex gates.
OTTL stanza necessity justified: Per-unit-prefix repetition (KiB/MiB/GiB/TiB) is required because OTTL has no capture-group-conditional dispatch. Multiplier must be literal int64 per stanza. Design necessity, not over-engineering.

Simplification Sweep

Clean. No dead code, no premature helpers (atoi is test-only, 4 call sites), no redundant abstractions. Comment density in YAML preamble (28 lines) is load-bearing for operators verifying gpu.id paths and unit math. Repetition across YAML/markdown sections is justified by audience (config authors vs. operator runbooks).

VERDICT: A+ — Closes load-bearing input gap for pattern #10 detector

Recipe is minimal and correct: 4 OTTL stanzas + 1 GPU-index stanza with tight IsMatch guards
Test discipline solid: wire-contract pinning, round-trip e2e, 8 test cases (5 positive + 3 negative)
Operator UX excellent: both gpu.id source paths documented, multi-line traceback behavior explained, failure modes in troubleshooting table
Detector integration: closes ottl(pattern-10): filelogreceiver stanza for PyTorch CUDA OOM line → cuda_oom.tried_alloc_bytes + gpu.id #436 without scope creep; detector source (cuda_oom.go) untouched per hard rule
Simplicity: no refactoring targets, all comments are load-bearing, no hidden TODOs

Ship as-is.

trilamsr enabled auto-merge (squash) June 2, 2026 02:06

trilamsr merged commit b175412 into main Jun 2, 2026
12 checks passed

trilamsr deleted the recipe/436-cuda-oom-filelog-ottl branch June 2, 2026 02:10

trilamsr mentioned this pull request Jun 2, 2026

audit(wave-2026-06-02): autonomous-wave cross-cut review #488

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(integrations/examples): pattern-10 CUDA OOM filelog OTTL stanza#451

feat(integrations/examples): pattern-10 CUDA OOM filelog OTTL stanza#451
trilamsr merged 1 commit into
mainfrom
recipe/436-cuda-oom-filelog-ottl

trilamsr commented Jun 2, 2026

Uh oh!

trilamsr commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 2, 2026

Summary

Root cause

Recipe design

Tests

Self-grade: A+

Cross-references

Test plan

Uh oh!

trilamsr commented Jun 2, 2026

B/A/A+ Criteria

Findings

Simplification Sweep

VERDICT: A+ — Closes load-bearing input gap for pattern #10 detector

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant