Skip to content

feat(integrations/examples): pattern-10 CUDA OOM filelog OTTL stanza#451

Merged
trilamsr merged 1 commit into
mainfrom
recipe/436-cuda-oom-filelog-ottl
Jun 2, 2026
Merged

feat(integrations/examples): pattern-10 CUDA OOM filelog OTTL stanza#451
trilamsr merged 1 commit into
mainfrom
recipe/436-cuda-oom-filelog-ottl

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds the transform/cuda_oom OTTL processor to docs/integrations/examples/filelog-container.yaml, stamping cuda_oom.tried_alloc_bytes (Int, bytes; unit-normalized KiB/MiB/GiB/TiB) and cuda_oom.gpu_index (Int) off PyTorch's canonical RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>. GPU N has a total capacity of ... stderr line.
  • Closes the integration gap pattern Tighten developer and PR feedback loops #10's detector (PR feat(v1-rc1): 2 detectors + 8 pattern specs + chart NetPol + rc1 audits #338) carried since merge: projectCUDAOOMLogRecord (module/processor/patterndetectorprocessor/cuda_oom.go) gates on cuda_oom.tried_alloc_bytes + gpu.id but no upstream recipe stamped them, so the compiled detector received no real input at runtime.

Root cause

Issue #303's deliverable list included projectCUDAOOMLogRecord (shipped in PR #338) but explicitly deferred the filelog OTTL stanza to a sibling follow-up (issue #285 / #436). The detector compiled green and its wiring tests passed against synthetic plog input, but production stderr never carried the customer-stable attributes the projector reads. This PR is the missing link — a recipe-only change with zero detector-source edits.

Recipe design

  • Per-unit-branch shape (KiB / MiB / GiB / TiB) because OTTL has no capture-group-conditional dispatch — the multiplier must be a literal int64 per stanza.
  • Unit normalization via OTTL Math Expressions: Int(whole)*UNIT + Int(frac)*(UNIT/100) against PyTorch's %.2f format_size shape (verified against c10/cuda/CUDACachingAllocator.cpp). Integer-divide-by-100 floors per-frac-unit precision loss at <1% of the unit base — three orders of magnitude under the detector's 5% fragmentation threshold.
  • gpu.id is NOT stamped here: the CUDA-runtime ordinal cuda_oom.gpu_index is not a PCI BDF. The recipe markdown documents two operator paths: (a) k8sattributesprocessor + nvidia.com/gpu-PCIDeviceBusID device-plugin annotation, or (b) DCGM BDF-lookup transform indexed by cuda_oom.gpu_index. The detector's resource-attr fallback reads gpu.id off the log resource either way.
  • Tight where IsMatch guard on CUDA out of memory\. Tried to allocate — generic CUDA errors (illegal memory access, NCCL watchdog, DataLoader worker killed) do not trip the stanza.

Tests

TDD red → green via three new tests in module/processor/patterndetectorprocessor/cuda_oom_recipe_test.go:

  • TestRecipe_CUDAOOM_StanzaPinsWireContract — pins 7 load-bearing tokens (cuda_oom.tried_alloc_bytes, cuda_oom.gpu_index, KiB/MiB/GiB/TiB, transform/cuda_oom) + pipeline-wiring against the live projector.
  • TestRecipe_CUDAOOM_RoundTripFiresVerdict — end-to-end gate: recipe-shaped log records flow through CUDAOOMDetector and emit a kind=fragmentation verdict with the expected scalar-promotion contract.
  • TestRecipe_CUDAOOM_RegexCoversCanonicalPyTorchMessages — 5 canonical positives (KiB / MiB / GiB / GiB-fractional / TiB) + 3 negatives (DataLoader worker killed, NCCL watchdog, illegal memory access). Exceeds the ≥3-positive A-tier acceptance criterion from ottl(pattern-10): filelogreceiver stanza for PyTorch CUDA OOM line → cuda_oom.tried_alloc_bytes + gpu.id #436.

Self-grade: A+

  • B: YAML syntactically valid OTel (tracecore validate exit 0); regex extracts bytes + GPU index with unit normalization; documented. ✓
  • A: integration test green; make validator-recipe covers this file; regex tested against ≥3 canonical messages (5 positives total); negative cases verified. ✓
  • A+: edge cases handled (multi-line traceback flattening via filelog container parser, mixed-unit messages, OOM without GPU index via tight IsMatch guard); cross-linked from docs/patterns/10-cuda-oom-deceptive.md §"Signal sources" + Open Question Bump the gh-actions group across 1 directory with 4 updates #2; new §cuda_oom.* attribute stanza in docs/integrations/filelog-container.md with unit-normalization arithmetic table, two gpu.id source paths, and a Failure-modes row. ✓

Cross-references

Test plan

  • go test ./processor/patterndetectorprocessor/ -run TestRecipe_CUDAOOM -count=1 -v — PASS (3 tests, 8 sub-tests)
  • go test ./processor/patterndetectorprocessor/ -count=1 — PASS (no regressions)
  • make build_build/tracecore compiles via OCB
  • ./_build/tracecore validate --config=docs/integrations/examples/filelog-container.yaml — exit 0
  • make validator-recipe — 9 validated, 3 skipped (non-linux host) of 12 recipe(s)
  • make doc-check — PASS (new cross-link resolves)
  • make ci-fast — PASS (lint, vet, mod-verify, attribute-namespace-check, doc-check)
**Pattern #10 (CUDA OOM, deceptive allocator)** — filelogreceiver + OTTL recipe lands. The `transform/cuda_oom` stanza in `docs/integrations/examples/filelog-container.yaml` projects PyTorch's `RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>` stderr line onto `cuda_oom.tried_alloc_bytes` (unit-normalized to bytes across KiB/MiB/GiB/TiB) and `cuda_oom.gpu_index`, closing the load-bearing input gap left by the v0.3 detector ship (PR #338).

Closes #436.
Refs #338, #303, #337.

Adds the `transform/cuda_oom` processor to
`docs/integrations/examples/filelog-container.yaml` that projects
PyTorch's canonical `RuntimeError: CUDA out of memory. Tried to
allocate X.YY <unit>. GPU N has a total capacity of ...` stderr line
onto the customer-stable `cuda_oom.tried_alloc_bytes` (Int, bytes;
unit-normalized KiB/MiB/GiB/TiB) and `cuda_oom.gpu_index` (Int)
attributes that pattern #10's detector (PR #338) reads via
`projectCUDAOOMLogRecord`. Closes the load-bearing filelog→detector
integration gap flagged in issue #303 follow-ups.

Per-unit-branch shape (one stanza per KiB/MiB/GiB/TiB prefix) because
OTTL has no capture-group-conditional dispatch — the multiplier must
be a literal int64 per stanza. Uses OTTL Math Expressions
(`Int(whole)*UNIT + Int(frac)*(UNIT/100)`) to handle PyTorch's
`%.2f` `format_size` output; precision loss capped at <1% of the
unit base, three orders of magnitude under the detector's 5%
fragmentation threshold.

`gpu.id` (PCI BDF per RFC-0013 §3) is NOT stamped by this transform
— the CUDA-runtime ordinal `cuda_oom.gpu_index` is not a PCI BDF.
Two operator-configurable paths documented in the recipe markdown:
(a) k8sattributesprocessor + `nvidia.com/gpu-PCIDeviceBusID` device-
plugin annotation, or (b) DCGM BDF-lookup transform indexed by
`cuda_oom.gpu_index`. The detector's resource-attr fallback reads
`gpu.id` from the log resource either way.

Tests (TDD red→green): three new recipe-parity tests under
`module/processor/patterndetectorprocessor/cuda_oom_recipe_test.go`:

- `TestRecipe_CUDAOOM_StanzaPinsWireContract`: pins the 7 load-
  bearing tokens (`cuda_oom.tried_alloc_bytes`, `cuda_oom.gpu_index`,
  KiB/MiB/GiB/TiB unit prefixes, `transform/cuda_oom`) + pipeline-
  wiring against the live projector. Mirrors PR #393's IB-flap
  shape.
- `TestRecipe_CUDAOOM_RoundTripFiresVerdict`: end-to-end gate —
  log records carrying the exact attribute shape the recipe stamps
  flow through CUDAOOMDetector and emit a kind=fragmentation
  verdict with the expected scalar-promotion contract.
- `TestRecipe_CUDAOOM_RegexCoversCanonicalPyTorchMessages`: 5
  canonical positive PyTorch OOM messages (KiB/MiB/GiB/TiB/fractional)
  + 3 negative messages (DataLoader worker killed, NCCL watchdog,
  illegal memory access). Exceeds the >=3-positive A-tier
  acceptance criterion.

Validates clean:
- `tracecore validate docs/integrations/examples/filelog-container.yaml`
  exits 0.
- `make validator-recipe` covers this file (tested-against: tracecore).
- `make doc-check` resolves the new pattern-10 cross-link.
- Full `make ci-fast` green.

Cross-links:
- Pattern doc: `docs/patterns/10-cuda-oom-deceptive.md` §"Signal
  sources" now references this recipe and resolves Open Question #2
  ("filelogreceiver OTTL stanza for the OOM regex").
- Recipe markdown: new §`cuda_oom.*` attribute stanza (pattern #10)
  in `docs/integrations/filelog-container.md` with the unit-
  normalization arithmetic table, the two `gpu.id` source paths,
  and a Failure-modes row.

Closes #436.
Refs #338, #303, #337.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

B/A/A+ Criteria

B (Acceptable): Recipe ships; regex matches target OOM patterns; at least one integration test passes; operator can copy+paste into their environment.

A (Good): B + recipe fully documents both gpu.id sources and unit normalization; all 3+ integration tests pass; OTTL stanzas are justified (not redundant); operator has clear guidance on which gpu.id path applies to their stack.

A+ (Excellent): A + simplicity-first (no dead code, <3-rule enforced, fold stanzas if safe); regex is robust (mixed-locale, version variance); multi-line traceback handling documented; processor name follows sibling conventions; ci-fast runtime acceptable; PR body signals close of #436 without defer.

Findings

  1. Unit normalization constants verified: All four multipliers (KiB=10, MiB=10485, GiB=10737418, TiB=10995116277) are mathematically correct floor-divisions of UNIT/100. No precision drift.

  2. Test coverage exceeds requirement: 5 positive test cases (KiB, MiB, GiB, GiB-fractional, TiB) cover all unit prefixes; 3 negative cases (DataLoader worker killed, NCCL timeout, illegal memory access) prevent false-positive attribute stamping. Exceeds ≥3 acceptance gate from ottl(pattern-10): filelogreceiver stanza for PyTorch CUDA OOM line → cuda_oom.tried_alloc_bytes + gpu.id #436.

  3. Wire contract pinned: TestRecipe_CUDAOOM_StanzaPinsWireContract checks 7 load-bearing tokens against the recipe YAML (attribute names, unit prefixes, transform name, pipeline wiring). Detects semantic drift between recipe and detector source.

  4. gpu.id documentation is clear: Markdown section explains that cuda_oom.gpu_index (CUDA-runtime ordinal) is NOT aliased to gpu.id (PCI BDF per RFC-0013 §3). Two explicit operator paths documented (k8sattributesprocessor + nvidia.com/gpu annotation, or DCGM BDF-lookup transform). No ambiguity.

  5. Multi-line traceback handling documented: Pattern Tighten developer and PR feedback loops #10 doc Open Q#2 answered: container parser flattens newlines into separate log records; only the summary line matches the regex, so detector sees exactly one stamp per OOM event regardless of traceback depth. Clear and correct.

  6. Processor pipeline wiring correct: transform/cuda_oom properly added to logs/container pipeline after k8sattributes and alongside (order-insensitive) transform/dataloader_errors. Comment explains both run disjoint regex gates.

  7. OTTL stanza necessity justified: Per-unit-prefix repetition (KiB/MiB/GiB/TiB) is required because OTTL has no capture-group-conditional dispatch. Multiplier must be literal int64 per stanza. Design necessity, not over-engineering.

Simplification Sweep

Clean. No dead code, no premature helpers (atoi is test-only, 4 call sites), no redundant abstractions. Comment density in YAML preamble (28 lines) is load-bearing for operators verifying gpu.id paths and unit math. Repetition across YAML/markdown sections is justified by audience (config authors vs. operator runbooks).

VERDICT: A+ — Closes load-bearing input gap for pattern #10 detector

  • Recipe is minimal and correct: 4 OTTL stanzas + 1 GPU-index stanza with tight IsMatch guards
  • Test discipline solid: wire-contract pinning, round-trip e2e, 8 test cases (5 positive + 3 negative)
  • Operator UX excellent: both gpu.id source paths documented, multi-line traceback behavior explained, failure modes in troubleshooting table
  • Detector integration: closes ottl(pattern-10): filelogreceiver stanza for PyTorch CUDA OOM line → cuda_oom.tried_alloc_bytes + gpu.id #436 without scope creep; detector source (cuda_oom.go) untouched per hard rule
  • Simplicity: no refactoring targets, all comments are load-bearing, no hidden TODOs

Ship as-is.

@trilamsr trilamsr enabled auto-merge (squash) June 2, 2026 02:06
@trilamsr trilamsr merged commit b175412 into main Jun 2, 2026
12 checks passed
@trilamsr trilamsr deleted the recipe/436-cuda-oom-filelog-ottl branch June 2, 2026 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ottl(pattern-10): filelogreceiver stanza for PyTorch CUDA OOM line → cuda_oom.tried_alloc_bytes + gpu.id

1 participant