feat(recipes): OTTL stanzas + bridge for pattern #7 (#364 #365)#401
Closed
trilamsr wants to merge 3 commits into
Closed
feat(recipes): OTTL stanzas + bridge for pattern #7 (#364 #365)#401trilamsr wants to merge 3 commits into
trilamsr wants to merge 3 commits into
Conversation
added 3 commits
June 1, 2026 15:13
#364: filelog-container ships `transform/dataloader_errors` with per-driver stanzas (FUSE / S3 / Lustre / multiprocessing-queue / worker-killed / connection-reset). Each stamps the customer-stable `dataloader.error_class` value the runbook branches on, plus an optional `dataloader.worker_pid` int when the line carries one. Resolves pattern #7 spec Open Q#3. #365: prometheus-scrape grows a §"Pattern #7" section documenting the load-bearing wire contract for the training-step-stalled bridge log record (`tracecore.alert.training_step_stalled.no_progress_seconds` + `last_step_ns` + `gen_ai.training.step` + `phase`). Per RFC-0014 the emitter half remains upstream-blocked until WithMetrics PR-B or a `metricthresholdconnector` lands; the detector reads the contract today. Adds 9 sub-tests pinning every per-driver class value + every bridge attribute guard against the live detector wiring. Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
3 tasks
Contributor
Author
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Bundles #364 (per-driver OTTL stanzas for `dataloader.error_class`) and #365 (training-step-stalled metrics-to-logs bridge contract) — both pattern-7, both touch the same recipe surface. This PR REPLACES #401 (which was opened against pre-wave main and accumulated unrelated scope-creep via merge-resolution). ### #364 — `dataloader.error_class` per-driver regex `docs/integrations/filelog-container.md` + `examples/filelog-container.yaml` ship a `transform/dataloader_errors` processor with one stanza per driver class. Each stanza stamps the customer-stable `dataloader.error_class` value the runbook branches on, plus an optional `dataloader.worker_pid` int when the line carries one. Six driver classes: `DataLoader worker killed`, `FUSE transport error`, `S3 throttle`, `Stale file handle`, `DataLoader queue empty`, `Connection reset by peer`. Resolves pattern #7 spec Open Q#3. ### #365 — training-step-stalled bridge contract `docs/integrations/prometheus-scrape.md` grows a §"Pattern #7" section documenting the load-bearing wire contract for the bridge log record (`tracecore.alert.training_step_stalled.no_progress_seconds` + `last_step_ns` + `gen_ai.training.step` + `phase`). Per RFC-0014 the emitter half stays upstream-blocked until WithMetrics PR-B or `metricthresholdconnector` lands; the detector reads the contract today. ### Test wiring 9 sub-tests in `module/processor/patterndetectorprocessor/dataloader_hang_test.go` pin every per-driver class value + every bridge attribute guard against live detector wiring (eval-phase guard + warmup `step >= 2` guard). ## Why a redo PR #401 was branched from pre-wave main (commit `636c2a2`). When it merged main back in, it accumulated re-deletions of recently-merged #389 (composite action) + re-adds of #379/#381 already shipped via #392 + #386 already shipped via #397. Cleanest path: cherry-pick the recipe-only commit onto current main + resolve the one true conflict (DCGM helm-template comment in prometheus-scrape.md vs the new pattern-7 NOTE block). ## Test plan - [x] `golangci-lint`, `go vet`, `attribute-namespace-check` — green - [x] `go test ./module/processor/patterndetectorprocessor/... -run DataLoader` — pass - [x] Pre-push hook gates all green Closes #364. Closes #365. Supersedes #401. ```release-notes feat(recipes): pattern #7 OTTL stanzas for `dataloader.error_class` per-driver classification + documented metrics-to-logs bridge contract for `training_step_stalled` log record. ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bundles #364 (per-driver OTTL stanzas for
dataloader.error_class) and #365 (training-step-stalled metrics-to-logs bridge contract) — both pattern-7, both touch the same recipe surface, no churn benefit to splitting.#364 —
dataloader.error_classper-driver regexdocs/integrations/filelog-container.md+examples/filelog-container.yamlship a newtransform/dataloader_errorsprocessor with one stanza per driver class:dataloader.error_classDataLoader worker killedFUSE transport errorS3 throttleStale file handleDataLoader queue emptyConnection reset by peerPer-driver — not omnibus — because the operator runbook (pattern-7 spec §"Edge cases") branches on storage class. Resolves spec Open Q#3.
#365 —
tracecore.alert.training_step_stalled.*bridge contractdocs/integrations/prometheus-scrape.mdgrows a §"Pattern #7" section mirroring the pattern-5pcie_rate_collapseand pattern-10hw.gpu.memory.*shapes. Documents the load-bearing wire format the future emitter MUST honor.Root-cause status: RFC-0014 documented that OTel-contrib v0.130 cannot emit log records from a metrics pipeline (no contrib
transformprocessorconfig or connector spans metrics→logs). This PR ships the wire contract (frozen viaTestPatternDetector_DataLoaderHangTrainingStepStalledWireContract); the emitter half lands when RFC-0014 PR-B (processor.WithMetricsextension) ships or ametricthresholdconnectorlands upstream. Not a workaround — the upstream blocker is named.Tests
Adds two new tests with 9 sub-tests in
dataloader_hang_test.go:TestPatternDetector_DataLoaderHangPerDriverErrorClasses(6 sub-tests) — each recipe-stampeddataloader.error_classvalue drives the detector through to a verdict whose scalar matches the input. Future recipe drift (e.g. omnibus regex collapsing FUSE+S3) trips the table row that lost coverage.TestPatternDetector_DataLoaderHangTrainingStepStalledWireContract(3 sub-tests) — pins the bridge log-record contract: full contract fires,phase=evalsuppresses,step<2suppresses.Closes #364 #365
Test plan
make build— tracecore binary builds cleanmake validator-recipe— 9 of 12 recipes validate locally (3 skipped on darwin; CI ubuntu covers)make fmt— gofumpt cleanmake lint— golangci-lint 0 issuesgo vet— cleango test ./processor/patterndetectorprocessor/— full suite green, 9 new sub-tests passmake doc-check/make alert-check/make rfc-status-check/make chart-appversion-check— all green via pre-commit hook