Skip to content

feat(recipes): OTTL stanzas + bridge for pattern #7 (#364 #365)#401

Closed
trilamsr wants to merge 3 commits into
mainfrom
worktree-agent-a40aabadfd2eb55d7
Closed

feat(recipes): OTTL stanzas + bridge for pattern #7 (#364 #365)#401
trilamsr wants to merge 3 commits into
mainfrom
worktree-agent-a40aabadfd2eb55d7

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Bundles #364 (per-driver OTTL stanzas for dataloader.error_class) and #365 (training-step-stalled metrics-to-logs bridge contract) — both pattern-7, both touch the same recipe surface, no churn benefit to splitting.

#364dataloader.error_class per-driver regex

docs/integrations/filelog-container.md + examples/filelog-container.yaml ship a new transform/dataloader_errors processor with one stanza per driver class:

dataloader.error_class Runbook branch
DataLoader worker killed PyTorch SIGKILL — worker exit / OOMKilled
FUSE transport error mountpoint-s3 / goofys / s3fs
S3 throttle 503 / SlowDown / rate-limit
Stale file handle Lustre / GPFS / NFS-v4
DataLoader queue empty multiprocessing-queue runtime
Connection reset by peer generic transport (catch-all)

Per-driver — not omnibus — because the operator runbook (pattern-7 spec §"Edge cases") branches on storage class. Resolves spec Open Q#3.

#365tracecore.alert.training_step_stalled.* bridge contract

docs/integrations/prometheus-scrape.md grows a §"Pattern #7" section mirroring the pattern-5 pcie_rate_collapse and pattern-10 hw.gpu.memory.* shapes. Documents the load-bearing wire format the future emitter MUST honor.

Root-cause status: RFC-0014 documented that OTel-contrib v0.130 cannot emit log records from a metrics pipeline (no contrib transformprocessor config or connector spans metrics→logs). This PR ships the wire contract (frozen via TestPatternDetector_DataLoaderHangTrainingStepStalledWireContract); the emitter half lands when RFC-0014 PR-B (processor.WithMetrics extension) ships or a metricthresholdconnector lands upstream. Not a workaround — the upstream blocker is named.

Tests

Adds two new tests with 9 sub-tests in dataloader_hang_test.go:

  • TestPatternDetector_DataLoaderHangPerDriverErrorClasses (6 sub-tests) — each recipe-stamped dataloader.error_class value drives the detector through to a verdict whose scalar matches the input. Future recipe drift (e.g. omnibus regex collapsing FUSE+S3) trips the table row that lost coverage.
  • TestPatternDetector_DataLoaderHangTrainingStepStalledWireContract (3 sub-tests) — pins the bridge log-record contract: full contract fires, phase=eval suppresses, step<2 suppresses.

Closes #364 #365

- `docs/integrations/filelog-container.md` ships per-driver OTTL stanzas (FUSE / S3 / Lustre / multiprocessing-queue / worker-killed / connection-reset) that stamp `dataloader.error_class` for pattern #7's detector.
- `docs/integrations/prometheus-scrape.md` documents the metrics-to-logs bridge contract for `tracecore.alert.training_step_stalled.*` — the load-bearing wire format the future RFC-0014 PR-B emitter (or upstream `metricthresholdconnector`) honors so pattern #7 fires end-to-end.

Test plan

  • make build — tracecore binary builds clean
  • make validator-recipe — 9 of 12 recipes validate locally (3 skipped on darwin; CI ubuntu covers)
  • make fmt — gofumpt clean
  • make lint — golangci-lint 0 issues
  • go vet — clean
  • go test ./processor/patterndetectorprocessor/ — full suite green, 9 new sub-tests pass
  • make doc-check / make alert-check / make rfc-status-check / make chart-appversion-check — all green via pre-commit hook
  • CI green on push

Tri Lam added 3 commits June 1, 2026 15:13
#364: filelog-container ships `transform/dataloader_errors` with
per-driver stanzas (FUSE / S3 / Lustre / multiprocessing-queue /
worker-killed / connection-reset). Each stamps the customer-stable
`dataloader.error_class` value the runbook branches on, plus an
optional `dataloader.worker_pid` int when the line carries one.
Resolves pattern #7 spec Open Q#3.

#365: prometheus-scrape grows a §"Pattern #7" section documenting
the load-bearing wire contract for the training-step-stalled bridge
log record (`tracecore.alert.training_step_stalled.no_progress_seconds`
+ `last_step_ns` + `gen_ai.training.step` + `phase`). Per RFC-0014
the emitter half remains upstream-blocked until WithMetrics PR-B or
a `metricthresholdconnector` lands; the detector reads the contract
today.

Adds 9 sub-tests pinning every per-driver class value + every
bridge attribute guard against the live detector wiring.

Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr

trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Superseded by #406 — recipe-only redo against current main. The previous branch had accumulated unrelated scope (re-deletions of #389 composite action, re-adds of #379/#381/#386 already shipped) from a merge resolution off stale base. #406 is the recipe-only cherry-pick.

@trilamsr trilamsr closed this Jun 1, 2026
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Bundles #364 (per-driver OTTL stanzas for `dataloader.error_class`) and
#365 (training-step-stalled metrics-to-logs bridge contract) — both
pattern-7, both touch the same recipe surface. This PR REPLACES #401
(which was opened against pre-wave main and accumulated unrelated
scope-creep via merge-resolution).

### #364 — `dataloader.error_class` per-driver regex

`docs/integrations/filelog-container.md` +
`examples/filelog-container.yaml` ship a `transform/dataloader_errors`
processor with one stanza per driver class. Each stanza stamps the
customer-stable `dataloader.error_class` value the runbook branches on,
plus an optional `dataloader.worker_pid` int when the line carries one.

Six driver classes: `DataLoader worker killed`, `FUSE transport error`,
`S3 throttle`, `Stale file handle`, `DataLoader queue empty`,
`Connection reset by peer`. Resolves pattern #7 spec Open Q#3.

### #365 — training-step-stalled bridge contract

`docs/integrations/prometheus-scrape.md` grows a §"Pattern #7" section
documenting the load-bearing wire contract for the bridge log record
(`tracecore.alert.training_step_stalled.no_progress_seconds` +
`last_step_ns` + `gen_ai.training.step` + `phase`). Per RFC-0014 the
emitter half stays upstream-blocked until WithMetrics PR-B or
`metricthresholdconnector` lands; the detector reads the contract today.

### Test wiring

9 sub-tests in
`module/processor/patterndetectorprocessor/dataloader_hang_test.go` pin
every per-driver class value + every bridge attribute guard against live
detector wiring (eval-phase guard + warmup `step >= 2` guard).

## Why a redo PR

#401 was branched from pre-wave main (commit `636c2a2`). When it merged
main back in, it accumulated re-deletions of recently-merged #389
(composite action) + re-adds of #379/#381 already shipped via #392 +
#386 already shipped via #397. Cleanest path: cherry-pick the
recipe-only commit onto current main + resolve the one true conflict
(DCGM helm-template comment in prometheus-scrape.md vs the new pattern-7
NOTE block).

## Test plan

- [x] `golangci-lint`, `go vet`, `attribute-namespace-check` — green
- [x] `go test ./module/processor/patterndetectorprocessor/... -run
DataLoader` — pass
- [x] Pre-push hook gates all green

Closes #364. Closes #365. Supersedes #401.

```release-notes
feat(recipes): pattern #7 OTTL stanzas for `dataloader.error_class` per-driver classification + documented metrics-to-logs bridge contract for `training_step_stalled` log record.
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[recipe] OTTL stanza for dataloader.error_class per-driver regex (pattern #7)

1 participant