docs(style): fix recipes/pattern-N commit convention (#427)#431
Conversation
Audit issue #421 flagged that PR #406 + PR #415 commit titles read "feat(recipes): ... pattern #7" and "feat(recipe): pattern-2 ..." implying a top-level `recipes/pattern-N/{ottl.yaml,README.md}` layout that does not exist. The actual placement is `docs/integrations/examples/<target>.yaml`. Option B per #427: keep current placement (migration would rot cross-links across 6+ docs), fix the convention going forward. Changes: - STYLE.md §"Commits": new bullet pointing PR titles / commit subjects at the real path and naming the accepted subject shapes (`feat(integrations/examples): ...`, `feat(<target>): ...`, `feat(pattern-N): ...`). - .github/PULL_REQUEST_TEMPLATE.md: brief reminder pointing at STYLE.md §"Commits" and issue #427 for context. - scripts/recipes-path-check.sh + recipes-path-check_test.sh: TDD-driven gate that rejects PR titles referencing the aspirational `recipes/pattern-N/` path. Two rules: (1) literal path `recipes/pattern-N` (with negative lookahead for `internal/recipes/`, `module/recipes/`, etc.), (2) bare `feat(recipes):` / `feat(recipe):` scope paired with a `pattern N` mention — the exact shape #421 flagged. Test fixture covers 5 reject + 6 accept cases including the two historical PR titles. - Makefile: `recipes-path-check` target wired into `ci-fast` and `ci-full`; runs the regression test suite (not a tree scan — the gate operates on a subject string passed in). - .github/workflows/pr-lint.yml: new step invokes scripts/recipes-path-check.sh with the PR title; rejects titles that imply the aspirational layout. No files migrated. No docs reference a `recipes/pattern-N/` path (verified via `grep -rn 'recipes/pattern-' docs/` — only hit is `./internal/recipes/...` in HARDWARE-TESTING.md, a Go package path, correctly accepted by the gate). `make ci-fast` passes locally; actionlint + zizmor clean on the workflow change. Closes #427. Refs #406 #415 #421. Signed-off-by: Tri Lam <tree@lumalabs.ai>
Independent Adversarial Review — PR #431Grading CriteriaB (acceptable): Adds STYLE.md + PR template + gate script + tests; no migrations (option B as recommended in #427); gate rejects both historical bad titles (#406, #415). A (good): Does B + gate logic is sound (two rules); regex doesn't overcatch (negative lookahead excludes internal/recipes/, module/recipes/ Go paths); test fixture is comprehensive (5 reject + 6 accept cases). A+ (excellent): Does A + audits tree for stale references (none found); STYLE.md bullet is terse + links #427 + lists 3 acceptable shapes; exact bad titles from #406/#415 in test fixture; rule 2 pairs pattern mention to avoid false positives on unrelated recipes uses; process overhead minimal (+159/-3, 6 files). Findings
Simplification SweepTrim targets: Items 1, 2 (logic clarity), 3, 4, 6 (boilerplate), 5 (style consistency). None are load-bearing; all reduce false-positive surface or cut noise. Item 4 especially: 8-line docstring for a 2-rule grep script is scaffolding. Verdict: A — gate + docs + tests are sound; implementation has low-hanging polish targets (regex simplification, template trim, docstring cut, workflow condition, STYLE.md reflow) but none change correctness. Audit trail is thorough (tree checked, historical bad titles included in fixtures, #427 context explained). PR option B (document + lint, no migration) is the right call over option A (migrate).Recommendation: Merge pending cleanup (items 1, 2, 3, 4, 6 are optional sub-5-minute fixes; item 5 is nice-to-have for consistency). Do NOT block on polish. |
Address 6 trim findings from independent review on PR #431: 1. recipes-path-check.sh rule 1: switch the path-component-boundary regex `(^|[^A-Za-z0-9_/-])recipes/pattern-N` to the simpler word boundary `\brecipes/pattern-N`. Add a regression fixture for `chore(foo):recipes/pattern-7` to lock in colon-prefix rejection. 2. recipes-path-check.sh rule 2: replace the explicit optional-chars `pattern[ -]?#?[0-9]+` with `pattern[^a-zA-Z0-9]*[0-9]+` — one quantifier instead of two, while preserving coverage of the three real audit-flagged shapes (`pattern #7`, `pattern-2`, `pattern 7`). 3. PR template reminder: collapse the 3-line pattern-OTTL paragraph to a single pointer to STYLE.md §"Commits". 4. Trim 8-17 line docstrings in recipes-path-check{,_test}.sh to one-liners. 5. STYLE.md §"Commits": reflow the 777-char monoline bullet as a parent bullet + 3 sub-bullets within the existing per-bullet width norm. 6. pr-lint.yml recipes-path-check step: drop the redundant `pull_request.draft == false` guard; preserve the dependabot exclusion. Drafts already break other gates, so the duplicate guard added no signal. Fixtures: 11 → 12 (added colon-prefix reject case). Net diff -37 lines. Signed-off-by: Tri Lam <tree@lumalabs.ai>
|
Reviewer trim sweep applied — commit
Verification (all exit 0):
Net diff −37 lines across 5 files. Re-requesting review. |
Independent Adversarial Re-Review — Commit 04f54a7Resolution of 6 Prior Findings✓ Finding 1 (word boundary) — switched to Test suite reports 12/12 pass (up from 11/11, with new colon-prefix reject case). New Finding: Rule 1 Regex Over-Broadscripts/recipes-path-check.sh:12 Word boundary Root cause: Original intent (correct): The prior regex Why tests pass: Test fixture uses Fix: Use Simplification Sweep ResultClean. No additional trim targets. Docstrings, template, and style changes are all terse and load-bearing. VERDICT: BGate logic is sound for the audit-flagged targets (PR #406, #415) and all test fixtures pass. However, rule 1 boundary is technically incorrect: the simpler Not merging. File issue if out-of-scope for this PR. |
Rule 1 regex changed from \\brecipes/pattern-[0-9]+ to (^|[^/])recipes/pattern-[0-9]+ so that path-prefixed contexts like internal/recipes/pattern-7 do not falsely reject. \\b matches after a slash; the new boundary explicitly excludes a preceding slash. Added accept fixture 'test: refactor internal/recipes/pattern-7 detector' to lock the boundary. Signed-off-by: Tri Lam <tree@lumalabs.ai>
|
Boundary regression fixed: rule 1 regex now |
Review: Fix Verification + Final JudgmentFix Verification ✓Regex boundary corrected: Changed from
Accept fixture added: Test suite: All 13 cases pass (6 reject + 7 accept). Boundary regression prevented. Simplification Sweep: Clean
No new findingsEarlier B-verdict issue (boundary regression) is now fixed and locked by test. VERDICT: A Merge when ready. |
…451) ## Summary - Adds the `transform/cuda_oom` OTTL processor to `docs/integrations/examples/filelog-container.yaml`, stamping `cuda_oom.tried_alloc_bytes` (Int, bytes; unit-normalized KiB/MiB/GiB/TiB) and `cuda_oom.gpu_index` (Int) off PyTorch's canonical `RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>. GPU N has a total capacity of ...` stderr line. - Closes the integration gap pattern #10's detector (PR #338) carried since merge: `projectCUDAOOMLogRecord` (`module/processor/patterndetectorprocessor/cuda_oom.go`) gates on `cuda_oom.tried_alloc_bytes` + `gpu.id` but no upstream recipe stamped them, so the compiled detector received no real input at runtime. ## Root cause Issue #303's deliverable list included `projectCUDAOOMLogRecord` (shipped in PR #338) but explicitly deferred the filelog OTTL stanza to a sibling follow-up (issue #285 / #436). The detector compiled green and its wiring tests passed against synthetic plog input, but production stderr never carried the customer-stable attributes the projector reads. This PR is the missing link — a recipe-only change with zero detector-source edits. ## Recipe design - **Per-unit-branch shape** (KiB / MiB / GiB / TiB) because OTTL has no capture-group-conditional dispatch — the multiplier must be a literal `int64` per stanza. - **Unit normalization via OTTL Math Expressions**: `Int(whole)*UNIT + Int(frac)*(UNIT/100)` against PyTorch's `%.2f` `format_size` shape (verified against `c10/cuda/CUDACachingAllocator.cpp`). Integer-divide-by-100 floors per-frac-unit precision loss at <1% of the unit base — three orders of magnitude under the detector's 5% fragmentation threshold. - **`gpu.id` is NOT stamped here**: the CUDA-runtime ordinal `cuda_oom.gpu_index` is not a PCI BDF. The recipe markdown documents two operator paths: (a) k8sattributesprocessor + `nvidia.com/gpu-PCIDeviceBusID` device-plugin annotation, or (b) DCGM BDF-lookup transform indexed by `cuda_oom.gpu_index`. The detector's resource-attr fallback reads `gpu.id` off the log resource either way. - **Tight `where IsMatch` guard** on `CUDA out of memory\. Tried to allocate` — generic CUDA errors (illegal memory access, NCCL watchdog, DataLoader worker killed) do not trip the stanza. ## Tests TDD red → green via three new tests in `module/processor/patterndetectorprocessor/cuda_oom_recipe_test.go`: - `TestRecipe_CUDAOOM_StanzaPinsWireContract` — pins 7 load-bearing tokens (`cuda_oom.tried_alloc_bytes`, `cuda_oom.gpu_index`, KiB/MiB/GiB/TiB, `transform/cuda_oom`) + pipeline-wiring against the live projector. - `TestRecipe_CUDAOOM_RoundTripFiresVerdict` — end-to-end gate: recipe-shaped log records flow through `CUDAOOMDetector` and emit a `kind=fragmentation` verdict with the expected scalar-promotion contract. - `TestRecipe_CUDAOOM_RegexCoversCanonicalPyTorchMessages` — 5 canonical positives (KiB / MiB / GiB / GiB-fractional / TiB) + 3 negatives (DataLoader worker killed, NCCL watchdog, illegal memory access). Exceeds the ≥3-positive A-tier acceptance criterion from #436. ## Self-grade: **A+** - B: YAML syntactically valid OTel (`tracecore validate` exit 0); regex extracts bytes + GPU index with unit normalization; documented. ✓ - A: integration test green; `make validator-recipe` covers this file; regex tested against ≥3 canonical messages (5 positives total); negative cases verified. ✓ - A+: edge cases handled (multi-line traceback flattening via filelog container parser, mixed-unit messages, OOM without GPU index via tight `IsMatch` guard); cross-linked from `docs/patterns/10-cuda-oom-deceptive.md` §"Signal sources" + Open Question #2; new §`cuda_oom.*` attribute stanza in `docs/integrations/filelog-container.md` with unit-normalization arithmetic table, two `gpu.id` source paths, and a Failure-modes row. ✓ ## Cross-references - Detector source (untouched per hard rule): `module/processor/patterndetectorprocessor/cuda_oom.go`. - Sibling DCGM metric-side recipe: PR #337 / `docs/integrations/examples/prometheus-scrape.yaml`. - Pattern doc: `docs/patterns/10-cuda-oom-deceptive.md` — Open Q#2 resolved. - Convention: PR #431 (recipe stanzas placement under `docs/integrations/examples/<target>.yaml`). ## Test plan - [x] `go test ./processor/patterndetectorprocessor/ -run TestRecipe_CUDAOOM -count=1 -v` — PASS (3 tests, 8 sub-tests) - [x] `go test ./processor/patterndetectorprocessor/ -count=1` — PASS (no regressions) - [x] `make build` — `_build/tracecore` compiles via OCB - [x] `./_build/tracecore validate --config=docs/integrations/examples/filelog-container.yaml` — exit 0 - [x] `make validator-recipe` — 9 validated, 3 skipped (non-linux host) of 12 recipe(s) - [x] `make doc-check` — PASS (new cross-link resolves) - [x] `make ci-fast` — PASS (lint, vet, mod-verify, attribute-namespace-check, doc-check) ```release-notes **Pattern #10 (CUDA OOM, deceptive allocator)** — filelogreceiver + OTTL recipe lands. The `transform/cuda_oom` stanza in `docs/integrations/examples/filelog-container.yaml` projects PyTorch's `RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>` stderr line onto `cuda_oom.tried_alloc_bytes` (unit-normalized to bytes across KiB/MiB/GiB/TiB) and `cuda_oom.gpu_index`, closing the load-bearing input gap left by the v0.3 detector ship (PR #338). ``` Closes #436. Refs #338, #303, #337. Signed-off-by: Tri Lam <tree@lumalabs.ai>
…ention-option-b # Conflicts: # Makefile
## Summary PR #481 shipped `securityHardening.appArmorProfile.enabled: true` as the default in `install/kubernetes/tracecore/values.yaml`. Kubelet rejects pod-create when `pod.securityContext.appArmorProfile` references a profile the host cannot resolve, so the chart no longer installs on AppArmor-less nodes — including the ubuntu-latest GitHub Actions runner image (AppArmor dropped post-2024) and RHEL/SELinux production hosts. install-bench regressed; PRs #491, #484, #479, #431 are blocked behind this. This PR implements option (a) from #492: flip the default to opt-in. `values-production.yaml` keeps `enabled: true` since AppArmor-equipped Linux clusters (the production target) ship `RuntimeDefault` via containerd / CRI-O. ## Root cause Default-on AppArmor in `values.yaml` violated the chart contract that the default render installs on a vanilla cluster. The defense-in-depth posture is correct for production-preset users; it was wrong as the unconditional default. PR #481 didn't add a CI gate to assert "default render installs on a host without AppArmor", so the regression escaped review. ## Changes - `install/kubernetes/tracecore/values.yaml`: `securityHardening.appArmorProfile.enabled: true` -> `false`; in-line guidance reflects opt-in posture and names the failing-host classes (CI runners, RHEL/SELinux). - `install/kubernetes/tracecore/values-production.yaml`: unchanged — production preset still hardens with `enabled: true`. - `install/kubernetes/tracecore/README.md`: defaults table + Defense-in-depth section explain the opt-in posture, point operators at `values-production.yaml` for the prior behavior, and link #492. - `.github/workflows/chart.yml`: AppArmor mutation tests reshuffled from 6 to 8 cases. T1/T2 now assert default render emits **no** AppArmor field or annotation on K8s 1.30 + 1.28 (regression-prevent for #492). T3/T4 cover the opt-in path (`--set enabled=true`) and pin pre-#492 production-preset behavior. T7/T8 explicitly pass `--set enabled=true` so the Localhost-profile contract still fires under the new default. Production-preset assertion (`appArmorProfile.type=RuntimeDefault` from `values-production.yaml`) is untouched. ## Backward compatibility **Behavior change for default-values users.** Operators who installed via `helm install ... install/kubernetes/tracecore` (no production preset) and depended on the AppArmor hardening that #481 added will see it disappear on next upgrade. Two ways to keep the prior behavior: ```bash # Option 1 — adopt the production preset (recommended). helm upgrade demo install/kubernetes/tracecore \ --values install/kubernetes/tracecore/values-production.yaml # Option 2 — keep your current values, just flip the flag. helm upgrade demo install/kubernetes/tracecore \ --set securityHardening.appArmorProfile.enabled=true ``` Operators who relied on the chart's documented default (#481 was three days old; opt-in is the chart-hygiene norm for defense-in-depth knobs) get a quieter install on AppArmor-less hosts. ## Test plan - [x] `helm lint install/kubernetes/tracecore` — 0 warnings. - [x] `helm template ... --kube-version 1.30.0 --show-only templates/daemonset.yaml | grep -i apparmor` — empty (default render has no AppArmor). - [x] Same with `--kube-version 1.28.0` — empty. - [x] `helm template ... --values values-production.yaml --kube-version 1.30.0` — renders `appArmorProfile.type: RuntimeDefault` (production preset unchanged). - [x] `helm template ... --set securityHardening.appArmorProfile.enabled=true --kube-version 1.30.0` — renders structured field (opt-in works). - [x] All 8 mutation tests in `.github/workflows/chart.yml` AppArmor step run locally and pass. - [x] conftest: 52/52 default render, 91/91 production render. - [x] actionlint: 0 issues on `chart.yml`. - [x] Pre-commit (golangci-lint, vet, attribute-namespace-check, test-flake-audit) — all green. - [ ] CI: chart workflow turns green on this PR. - [ ] CI: install-bench turns green on this PR (and unblocks #491 / #484 / #479 / #431 once merged). ## Refs Closes #492 (refs #481). ```release-notes **Breaking (default-values users only).** `securityHardening.appArmorProfile.enabled` now defaults to `false` in `values.yaml` so the chart installs on AppArmor-less nodes (CI runners, RHEL/SELinux). The `values-production.yaml` preset still ships `enabled: true` — production Linux clusters that package the `RuntimeDefault` profile (every distro with containerd / CRI-O) keep the hardening when they layer that preset. Operators upgrading default-values installs who want the prior behavior can either adopt `values-production.yaml` or set `--set securityHardening.appArmorProfile.enabled=true`. Fixes the install-bench regression introduced in #481. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
…ention-option-b # Conflicts: # Makefile
…ention-option-b # Conflicts: # Makefile
…ention-option-b # Conflicts: # Makefile
What this PR does
Closes #427 via the reviewer-recommended option B: keep current pattern OTTL stanza placement (
docs/integrations/examples/<target>.yaml), fix the misleading PR-title convention going forward, and add an automated gate so it doesn't recur.Audit issue #421 flagged that PR #406 ("feat(recipes): OTTL stanzas + bridge for pattern #7") and PR #415 ("feat(recipe): pattern-2 IB link flap OTTL stanza") imply a top-level
recipes/pattern-N/{ottl.yaml,README.md}directory layout — butfind . -maxdepth 2 -type d -name 'recipes'returns nothing. The actual placement isdocs/integrations/examples/<target>.yaml. Option A (migrate) would rot cross-links across six-plus docs; option B (fix the convention) costs near-zero discoverability.Linked issue(s)
Closes #427. Refs #406 #415 #421.
Changes
STYLE.md§"Commits" — new bullet pointing PR titles / commit subjects at the real path and naming the three accepted subject shapes:feat(integrations/examples): pattern-N OTTL stanzafeat(<target>): pattern-N ...feat(pattern-N): OTTL stanza in docs/integrations/examples/.github/PULL_REQUEST_TEMPLATE.md— brief reminder pointing at STYLE.md §"Commits" and issue chore(recipes): create top-level recipes/ directory and migrate pattern-7 stanzas #427 for context.scripts/recipes-path-check.sh+_test.sh— TDD-driven gate. Two rules:recipes/pattern-N(with negative lookahead sointernal/recipes/,module/recipes/, etc. pass).feat(recipes):/feat(recipe):scope paired with apattern Nmention — the exact shape audit(wave-2026-06-01): post-wave cross-cut review #421 flagged on PRs feat(recipes): OTTL stanzas + bridge for pattern #7 (#364 #365) #406 / feat(recipe): pattern-2 IB link flap OTTL stanza (#393) #415.Test fixture: 5 reject cases (including the two historical PR titles verbatim) + 6 accept cases (real placement, per-target scope,
pattern-Nscope, unrelated subjects, the Go./internal/recipes/...package path fromHARDWARE-TESTING.md, empty input).Makefile—recipes-path-checktarget wired intoci-fastandci-full. Runs the regression test suite, not a tree scan; the gate itself operates on a subject string passed as$1..github/workflows/pr-lint.yml— new step invokesscripts/recipes-path-check.shwith${{ github.event.pull_request.title }}and fails the workflow on a forbidden shape.Audit results
grep -rn 'recipes/pattern-' docs/— only hit is./internal/recipes/...indocs/HARDWARE-TESTING.md(a Go package path, not a docs path). The gate accepts it. No stale references to fix.grep -rn 'recipes/' docs/ .github/ CONTRIBUTING.md PRINCIPLES.md README.md | grep -v 'docs/integrations/examples'— same single hit. Clean.Hard rules honoured
Release notes
Test plan
bash scripts/recipes-path-check_test.sh— 11/11 fixtures pass (5 reject + 6 accept).make recipes-path-check— runs the regression test via the new make target.make ci-fast— passes end-to-end including the new gate (lint + vet + mod-verify + attribute-namespace-check + doc-check + recipes-path-check).go tool actionlint .github/workflows/pr-lint.yml— clean.make zizmor— no findings; 28 suppressed, 32 ignored (baseline unchanged).feat(recipes): OTTL stanzas + bridge for pattern #7 (#364 #365)andfeat(recipe): pattern-2 IB link flap OTTL stanza (#393).Checklist
make ci-fastpasses.