Skip to content

feat(recipes): OTTL stanzas + bridge for pattern #7 (#364 #365)#406

Merged
trilamsr merged 1 commit into
mainfrom
feat/pattern7-recipes-364-365
Jun 1, 2026
Merged

feat(recipes): OTTL stanzas + bridge for pattern #7 (#364 #365)#406
trilamsr merged 1 commit into
mainfrom
feat/pattern7-recipes-364-365

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Bundles #364 (per-driver OTTL stanzas for dataloader.error_class) and #365 (training-step-stalled metrics-to-logs bridge contract) — both pattern-7, both touch the same recipe surface. This PR REPLACES #401 (which was opened against pre-wave main and accumulated unrelated scope-creep via merge-resolution).

#364dataloader.error_class per-driver regex

docs/integrations/filelog-container.md + examples/filelog-container.yaml ship a transform/dataloader_errors processor with one stanza per driver class. Each stanza stamps the customer-stable dataloader.error_class value the runbook branches on, plus an optional dataloader.worker_pid int when the line carries one.

Six driver classes: DataLoader worker killed, FUSE transport error, S3 throttle, Stale file handle, DataLoader queue empty, Connection reset by peer. Resolves pattern #7 spec Open Q#3.

#365 — training-step-stalled bridge contract

docs/integrations/prometheus-scrape.md grows a §"Pattern #7" section documenting the load-bearing wire contract for the bridge log record (tracecore.alert.training_step_stalled.no_progress_seconds + last_step_ns + gen_ai.training.step + phase). Per RFC-0014 the emitter half stays upstream-blocked until WithMetrics PR-B or metricthresholdconnector lands; the detector reads the contract today.

Test wiring

9 sub-tests in module/processor/patterndetectorprocessor/dataloader_hang_test.go pin every per-driver class value + every bridge attribute guard against live detector wiring (eval-phase guard + warmup step >= 2 guard).

Why a redo PR

#401 was branched from pre-wave main (commit 636c2a2). When it merged main back in, it accumulated re-deletions of recently-merged #389 (composite action) + re-adds of #379/#381 already shipped via #392 + #386 already shipped via #397. Cleanest path: cherry-pick the recipe-only commit onto current main + resolve the one true conflict (DCGM helm-template comment in prometheus-scrape.md vs the new pattern-7 NOTE block).

Test plan

  • golangci-lint, go vet, attribute-namespace-check — green
  • go test ./module/processor/patterndetectorprocessor/... -run DataLoader — pass
  • Pre-push hook gates all green

Closes #364. Closes #365. Supersedes #401.

feat(recipes): pattern #7 OTTL stanzas for `dataloader.error_class` per-driver classification + documented metrics-to-logs bridge contract for `training_step_stalled` log record.

per-driver stanzas (FUSE / S3 / Lustre / multiprocessing-queue /
worker-killed / connection-reset). Each stamps the customer-stable
`dataloader.error_class` value the runbook branches on, plus an
optional `dataloader.worker_pid` int when the line carries one.
Resolves pattern #7 spec Open Q#3.

the load-bearing wire contract for the training-step-stalled bridge
log record (`tracecore.alert.training_step_stalled.no_progress_seconds`
+ `last_step_ns` + `gen_ai.training.step` + `phase`). Per RFC-0014
the emitter half remains upstream-blocked until WithMetrics PR-B or
a `metricthresholdconnector` lands; the detector reads the contract
today.

Adds 9 sub-tests pinning every per-driver class value + every
bridge attribute guard against the live detector wiring.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 22:29
@trilamsr trilamsr merged commit c342210 into main Jun 1, 2026
12 checks passed
@trilamsr trilamsr deleted the feat/pattern7-recipes-364-365 branch June 1, 2026 22:40
trilamsr added a commit that referenced this pull request Jun 2, 2026
## What this PR does

Closes #427 via the reviewer-recommended **option B**: keep current
pattern OTTL stanza placement
(`docs/integrations/examples/<target>.yaml`), fix the misleading
PR-title convention going forward, and add an automated gate so it
doesn't recur.

Audit issue #421 flagged that PR #406 ("feat(recipes): OTTL stanzas +
bridge for pattern #7") and PR #415 ("feat(recipe): pattern-2 IB link
flap OTTL stanza") imply a top-level
`recipes/pattern-N/{ottl.yaml,README.md}` directory layout — but `find .
-maxdepth 2 -type d -name 'recipes'` returns nothing. The actual
placement is `docs/integrations/examples/<target>.yaml`. Option A
(migrate) would rot cross-links across six-plus docs; option B (fix the
convention) costs near-zero discoverability.

## Linked issue(s)

Closes #427. Refs #406 #415 #421.

## Changes

- **`STYLE.md` §"Commits"** — new bullet pointing PR titles / commit
subjects at the real path and naming the three accepted subject shapes:
  - `feat(integrations/examples): pattern-N OTTL stanza`
  - `feat(<target>): pattern-N ...`
  - `feat(pattern-N): OTTL stanza in docs/integrations/examples/`
- **`.github/PULL_REQUEST_TEMPLATE.md`** — brief reminder pointing at
STYLE.md §"Commits" and issue #427 for context.
- **`scripts/recipes-path-check.sh` + `_test.sh`** — TDD-driven gate.
Two rules:
1. Literal path `recipes/pattern-N` (with negative lookahead so
`internal/recipes/`, `module/recipes/`, etc. pass).
2. Bare `feat(recipes):` / `feat(recipe):` scope paired with a `pattern
N` mention — the exact shape #421 flagged on PRs #406 / #415.

Test fixture: 5 reject cases (including the two historical PR titles
verbatim) + 6 accept cases (real placement, per-target scope,
`pattern-N` scope, unrelated subjects, the Go `./internal/recipes/...`
package path from `HARDWARE-TESTING.md`, empty input).
- **`Makefile`** — `recipes-path-check` target wired into `ci-fast` and
`ci-full`. Runs the regression test suite, not a tree scan; the gate
itself operates on a subject string passed as `$1`.
- **`.github/workflows/pr-lint.yml`** — new step invokes
`scripts/recipes-path-check.sh` with `${{
github.event.pull_request.title }}` and fails the workflow on a
forbidden shape.

## Audit results

- `grep -rn 'recipes/pattern-' docs/` — only hit is
`./internal/recipes/...` in `docs/HARDWARE-TESTING.md` (a Go package
path, not a docs path). The gate accepts it. No stale references to fix.
- `grep -rn 'recipes/' docs/ .github/ CONTRIBUTING.md PRINCIPLES.md
README.md | grep -v 'docs/integrations/examples'` — same single hit.
Clean.

## Hard rules honoured

- **No files migrated.** Option B is documentation + lint only.
- **No bureaucratic process docs.** One bullet added to STYLE.md, one
block in PR template, one TDD-tested gate script. Total diff: 6 files,
+159 / -3.

## Release notes

```release-notes
NONE
```

## Test plan

- [x] `bash scripts/recipes-path-check_test.sh` — 11/11 fixtures pass (5
reject + 6 accept).
- [x] `make recipes-path-check` — runs the regression test via the new
make target.
- [x] `make ci-fast` — passes end-to-end including the new gate (lint +
vet + mod-verify + attribute-namespace-check + doc-check +
recipes-path-check).
- [x] `go tool actionlint .github/workflows/pr-lint.yml` — clean.
- [x] `make zizmor` — no findings; 28 suppressed, 32 ignored (baseline
unchanged).
- [x] Verified the gate rejects both historical titles verbatim:
`feat(recipes): OTTL stanzas + bridge for pattern #7 (#364 #365)` and
`feat(recipe): pattern-2 IB link flap OTTL stanza (#393)`.

## Checklist

- [x] Tests added or updated (TDD: test written before gate; both ship
in this PR).
- [x] `make ci-fast` passes.
- [x] Commits are signed off.
- [x] PR title and Summary reflect the current diff.

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant