feat(recipes): OTTL stanzas + bridge for pattern #7 (#364 #365)#406
Merged
Conversation
per-driver stanzas (FUSE / S3 / Lustre / multiprocessing-queue / worker-killed / connection-reset). Each stamps the customer-stable `dataloader.error_class` value the runbook branches on, plus an optional `dataloader.worker_pid` int when the line carries one. Resolves pattern #7 spec Open Q#3. the load-bearing wire contract for the training-step-stalled bridge log record (`tracecore.alert.training_step_stalled.no_progress_seconds` + `last_step_ns` + `gen_ai.training.step` + `phase`). Per RFC-0014 the emitter half remains upstream-blocked until WithMetrics PR-B or a `metricthresholdconnector` lands; the detector reads the contract today. Adds 9 sub-tests pinning every per-driver class value + every bridge attribute guard against the live detector wiring. Signed-off-by: Tri Lam <tri@maydow.com>
8 tasks
This was referenced Jun 1, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
## What this PR does Closes #427 via the reviewer-recommended **option B**: keep current pattern OTTL stanza placement (`docs/integrations/examples/<target>.yaml`), fix the misleading PR-title convention going forward, and add an automated gate so it doesn't recur. Audit issue #421 flagged that PR #406 ("feat(recipes): OTTL stanzas + bridge for pattern #7") and PR #415 ("feat(recipe): pattern-2 IB link flap OTTL stanza") imply a top-level `recipes/pattern-N/{ottl.yaml,README.md}` directory layout — but `find . -maxdepth 2 -type d -name 'recipes'` returns nothing. The actual placement is `docs/integrations/examples/<target>.yaml`. Option A (migrate) would rot cross-links across six-plus docs; option B (fix the convention) costs near-zero discoverability. ## Linked issue(s) Closes #427. Refs #406 #415 #421. ## Changes - **`STYLE.md` §"Commits"** — new bullet pointing PR titles / commit subjects at the real path and naming the three accepted subject shapes: - `feat(integrations/examples): pattern-N OTTL stanza` - `feat(<target>): pattern-N ...` - `feat(pattern-N): OTTL stanza in docs/integrations/examples/` - **`.github/PULL_REQUEST_TEMPLATE.md`** — brief reminder pointing at STYLE.md §"Commits" and issue #427 for context. - **`scripts/recipes-path-check.sh` + `_test.sh`** — TDD-driven gate. Two rules: 1. Literal path `recipes/pattern-N` (with negative lookahead so `internal/recipes/`, `module/recipes/`, etc. pass). 2. Bare `feat(recipes):` / `feat(recipe):` scope paired with a `pattern N` mention — the exact shape #421 flagged on PRs #406 / #415. Test fixture: 5 reject cases (including the two historical PR titles verbatim) + 6 accept cases (real placement, per-target scope, `pattern-N` scope, unrelated subjects, the Go `./internal/recipes/...` package path from `HARDWARE-TESTING.md`, empty input). - **`Makefile`** — `recipes-path-check` target wired into `ci-fast` and `ci-full`. Runs the regression test suite, not a tree scan; the gate itself operates on a subject string passed as `$1`. - **`.github/workflows/pr-lint.yml`** — new step invokes `scripts/recipes-path-check.sh` with `${{ github.event.pull_request.title }}` and fails the workflow on a forbidden shape. ## Audit results - `grep -rn 'recipes/pattern-' docs/` — only hit is `./internal/recipes/...` in `docs/HARDWARE-TESTING.md` (a Go package path, not a docs path). The gate accepts it. No stale references to fix. - `grep -rn 'recipes/' docs/ .github/ CONTRIBUTING.md PRINCIPLES.md README.md | grep -v 'docs/integrations/examples'` — same single hit. Clean. ## Hard rules honoured - **No files migrated.** Option B is documentation + lint only. - **No bureaucratic process docs.** One bullet added to STYLE.md, one block in PR template, one TDD-tested gate script. Total diff: 6 files, +159 / -3. ## Release notes ```release-notes NONE ``` ## Test plan - [x] `bash scripts/recipes-path-check_test.sh` — 11/11 fixtures pass (5 reject + 6 accept). - [x] `make recipes-path-check` — runs the regression test via the new make target. - [x] `make ci-fast` — passes end-to-end including the new gate (lint + vet + mod-verify + attribute-namespace-check + doc-check + recipes-path-check). - [x] `go tool actionlint .github/workflows/pr-lint.yml` — clean. - [x] `make zizmor` — no findings; 28 suppressed, 32 ignored (baseline unchanged). - [x] Verified the gate rejects both historical titles verbatim: `feat(recipes): OTTL stanzas + bridge for pattern #7 (#364 #365)` and `feat(recipe): pattern-2 IB link flap OTTL stanza (#393)`. ## Checklist - [x] Tests added or updated (TDD: test written before gate; both ship in this PR). - [x] `make ci-fast` passes. - [x] Commits are signed off. - [x] PR title and Summary reflect the current diff. --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bundles #364 (per-driver OTTL stanzas for
dataloader.error_class) and #365 (training-step-stalled metrics-to-logs bridge contract) — both pattern-7, both touch the same recipe surface. This PR REPLACES #401 (which was opened against pre-wave main and accumulated unrelated scope-creep via merge-resolution).#364 —
dataloader.error_classper-driver regexdocs/integrations/filelog-container.md+examples/filelog-container.yamlship atransform/dataloader_errorsprocessor with one stanza per driver class. Each stanza stamps the customer-stabledataloader.error_classvalue the runbook branches on, plus an optionaldataloader.worker_pidint when the line carries one.Six driver classes:
DataLoader worker killed,FUSE transport error,S3 throttle,Stale file handle,DataLoader queue empty,Connection reset by peer. Resolves pattern #7 spec Open Q#3.#365 — training-step-stalled bridge contract
docs/integrations/prometheus-scrape.mdgrows a §"Pattern #7" section documenting the load-bearing wire contract for the bridge log record (tracecore.alert.training_step_stalled.no_progress_seconds+last_step_ns+gen_ai.training.step+phase). Per RFC-0014 the emitter half stays upstream-blocked until WithMetrics PR-B ormetricthresholdconnectorlands; the detector reads the contract today.Test wiring
9 sub-tests in
module/processor/patterndetectorprocessor/dataloader_hang_test.gopin every per-driver class value + every bridge attribute guard against live detector wiring (eval-phase guard + warmupstep >= 2guard).Why a redo PR
#401 was branched from pre-wave main (commit
636c2a2). When it merged main back in, it accumulated re-deletions of recently-merged #389 (composite action) + re-adds of #379/#381 already shipped via #392 + #386 already shipped via #397. Cleanest path: cherry-pick the recipe-only commit onto current main + resolve the one true conflict (DCGM helm-template comment in prometheus-scrape.md vs the new pattern-7 NOTE block).Test plan
golangci-lint,go vet,attribute-namespace-check— greengo test ./module/processor/patterndetectorprocessor/... -run DataLoader— passCloses #364. Closes #365. Supersedes #401.