docs: bundled post-wave-audit cleanups (#381 #383 #379)#392
Merged
Conversation
added 3 commits
June 1, 2026 14:58
Wave-3/4 PRs (PR-I.1a/1b/2) all landed long ago. The 'Contents land in PR-I.1b/I.2' framing and the genesis-tag proxy rationale are no longer load-bearing — the module is alive and well, and `package module` declaration alone satisfies the proxy requirement. Closes #381 Signed-off-by: Tri Lam <tri@maydow.com>
Two strikethrough blocks targeted by post-wave-audit finding #15: - docs/patterns/10-cuda-oom-deceptive.md:74 — referenced closed #337 - docs/v1-rc1-simplification-audit.md:185-186 — referenced closed #333 / #334 Replace strikethrough markup with plain prose. The resolution text itself is retained (still useful as cross-link history); only the strike-through visual noise is removed. Followups/* + MILESTONES.md + RFC-0003 strikethroughs are documented convention markers (per MILESTONES.md:65,69) and are intentionally untouched. Closes #383 Signed-off-by: Tri Lam <tri@maydow.com>
Post-wave-audit finding #5: config.go carries three knob-naming styles (bare, pattern-prefixed, inconsistent flat). Prefixed is the right shape — it avoids cross-pattern collisions on `*_window` / `*_threshold` axes. But renaming bare→prefixed pre-RC1 would break every existing values.yaml. Take option 1 from #379: document the convention at top-of-file so future detectors follow the prefixed shape. v2.0 will nest into `<pattern>: { <knob>: ... }` blocks with a migration helper. Closes #379 Signed-off-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
3 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Bundles #364 (per-driver OTTL stanzas for `dataloader.error_class`) and #365 (training-step-stalled metrics-to-logs bridge contract) — both pattern-7, both touch the same recipe surface. This PR REPLACES #401 (which was opened against pre-wave main and accumulated unrelated scope-creep via merge-resolution). ### #364 — `dataloader.error_class` per-driver regex `docs/integrations/filelog-container.md` + `examples/filelog-container.yaml` ship a `transform/dataloader_errors` processor with one stanza per driver class. Each stanza stamps the customer-stable `dataloader.error_class` value the runbook branches on, plus an optional `dataloader.worker_pid` int when the line carries one. Six driver classes: `DataLoader worker killed`, `FUSE transport error`, `S3 throttle`, `Stale file handle`, `DataLoader queue empty`, `Connection reset by peer`. Resolves pattern #7 spec Open Q#3. ### #365 — training-step-stalled bridge contract `docs/integrations/prometheus-scrape.md` grows a §"Pattern #7" section documenting the load-bearing wire contract for the bridge log record (`tracecore.alert.training_step_stalled.no_progress_seconds` + `last_step_ns` + `gen_ai.training.step` + `phase`). Per RFC-0014 the emitter half stays upstream-blocked until WithMetrics PR-B or `metricthresholdconnector` lands; the detector reads the contract today. ### Test wiring 9 sub-tests in `module/processor/patterndetectorprocessor/dataloader_hang_test.go` pin every per-driver class value + every bridge attribute guard against live detector wiring (eval-phase guard + warmup `step >= 2` guard). ## Why a redo PR #401 was branched from pre-wave main (commit `636c2a2`). When it merged main back in, it accumulated re-deletions of recently-merged #389 (composite action) + re-adds of #379/#381 already shipped via #392 + #386 already shipped via #397. Cleanest path: cherry-pick the recipe-only commit onto current main + resolve the one true conflict (DCGM helm-template comment in prometheus-scrape.md vs the new pattern-7 NOTE block). ## Test plan - [x] `golangci-lint`, `go vet`, `attribute-namespace-check` — green - [x] `go test ./module/processor/patterndetectorprocessor/... -run DataLoader` — pass - [x] Pre-push hook gates all green Closes #364. Closes #365. Supersedes #401. ```release-notes feat(recipes): pattern #7 OTTL stanzas for `dataloader.error_class` per-driver classification + documented metrics-to-logs bridge contract for `training_step_stalled` log record. ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bundle three post-wave-audit docs-cleanup issues (
docs/v1-rc1-post-wave-audit.mdfindings #5, #8, #15). One commit per issue for clean revert / history. Fourth audit cleanup (#380) intentionally deferred — precondition unmet.Per-issue scope + delta
chore(module): trim stale PR-I genesis-tag rationale(closes #381)Audit finding #8.
module/doc.goreferenced PR-I.1a/1b/2 as "Contents land in…" and explained the genesis-tag proxy rationale. All three PRs landed long ago; the module is alive. Trimmed to package-doc only.package moduledeclaration only, retained for proxy compat).go list -m github.com/tracecoreai/tracecore/modulestill resolves.docs(rc1): sweep strikethrough resolved-in-issue blocks(closes #383)Audit finding #15. The two specific examples called out by the issue:
docs/patterns/10-cuda-oom-deceptive.md:74—~~DCGM_FI_DEV_FB_* OTTL recipe extension.~~referenced closed [rc1-prep] OTTL recipe: project DCGM_FI_DEV_FB_USED/FB_FREE → hw.gpu.memory.{free,total} log shape (pattern #10 wiring) #337 (verified viagh issue view 337→ CLOSED).docs/v1-rc1-simplification-audit.md:185-186—~~components/exporters/otlphttp/~~+~~components/exporters/stdoutexporter/~~referenced closed [rc1-prep] Delete in-tree components/exporters/otlphttp/ wrapper (superseded by upstream otlphttpexporter) #333 / [rc1-prep] Delete in-tree components/exporters/stdoutexporter/ wrapper (superseded by upstream debugexporter) #334 (both verified CLOSED).Resolution text retained as cross-link history; only the strike-through visual noise removed.
Explicitly NOT touched:
docs/followups/M*.md,docs/MILESTONES.md,docs/rfcs/0003-*.mdstrikethroughs — these are documented convention markers perMILESTONES.md:65,69("A PR that completes a follow-up strikes it through (~~…~~)"). Sweeping them would delete a load-bearing convention.chore(config): document pattern-prefixed knob naming convention(closes #379)Audit finding #5.
config.go(540 lines) carries three knob styles. The post-wave prefixed shape is right; renaming bare→prefixed pre-RC1 would break every existingvalues.yaml. Take option 1 from the issue: document the convention at top-of-file so future detectors follow it.// Knob-naming conventionblock beforepackage patterndetectorprocessor, naming bare-name knobs explicitly and pointing forward to v2.0 nesting.Why #380 is NOT in this bundle
#380 sweeps 12 "future PR-B" comments referencing the RFC-0014
WithMetricsbridge. Its own acceptance criteria say "Closes with PR-B's merge commit."Precondition check today:
git log --all --grep=WithMetrics→ only wave-3 port commits, no bridge.docs/rfcs/0014-metrics-to-logs-pattern-input.md:3→ statusaccepted, blocks issue #260 PR-B.docs/integrations/prometheus-scrape.md:381→ "PR-B has NOT yet shipped."All 12 comments are factually accurate today. Sweeping prematurely would either delete load-bearing forward-looking documentation, or replace it with stale post-PR-B prose. Issue stays open as tracker per its design; commented on issue noting the deferral.
Verification
go build ./...(module) — clean.go vet ./...— clean (via pre-commit).go list -m github.com/tracecoreai/tracecore/module— resolves.golangci-lint run ./...— 0 issues (via pre-commit).attribute-namespace-check— 100/100 (via pre-commit).gh issue view 337 333 334— all CLOSED (justifies docs(rc1-tag-cut): sweep strikethrough resolved-in-issue blocks #383 sweep).Test plan
docs/followups/*/MILESTONES.md/ RFC-0003 left intact (deliberate scope-limit).#380deferred with on-issue rationale; not inClosestrailer.Closes #383
Closes #381
Closes #379