Skip to content

docs: bundled post-wave-audit cleanups (#381 #383 #379)#392

Merged
trilamsr merged 3 commits into
mainfrom
worktree-agent-a8885593a38582acc
Jun 1, 2026
Merged

docs: bundled post-wave-audit cleanups (#381 #383 #379)#392
trilamsr merged 3 commits into
mainfrom
worktree-agent-a8885593a38582acc

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Bundle three post-wave-audit docs-cleanup issues (docs/v1-rc1-post-wave-audit.md findings #5, #8, #15). One commit per issue for clean revert / history. Fourth audit cleanup (#380) intentionally deferred — precondition unmet.

Per-issue scope + delta

chore(module): trim stale PR-I genesis-tag rationale (closes #381)

Audit finding #8. module/doc.go referenced PR-I.1a/1b/2 as "Contents land in…" and explained the genesis-tag proxy rationale. All three PRs landed long ago; the module is alive. Trimmed to package-doc only.

  • Before: 14 lines (8-line historical-artifact block).
  • After: 6 lines (package module declaration only, retained for proxy compat).
  • go list -m github.com/tracecoreai/tracecore/module still resolves.

docs(rc1): sweep strikethrough resolved-in-issue blocks (closes #383)

Audit finding #15. The two specific examples called out by the issue:

Resolution text retained as cross-link history; only the strike-through visual noise removed.

Explicitly NOT touched: docs/followups/M*.md, docs/MILESTONES.md, docs/rfcs/0003-*.md strikethroughs — these are documented convention markers per MILESTONES.md:65,69 ("A PR that completes a follow-up strikes it through (~~…~~)"). Sweeping them would delete a load-bearing convention.

chore(config): document pattern-prefixed knob naming convention (closes #379)

Audit finding #5. config.go (540 lines) carries three knob styles. The post-wave prefixed shape is right; renaming bare→prefixed pre-RC1 would break every existing values.yaml. Take option 1 from the issue: document the convention at top-of-file so future detectors follow it.

  • Before: No top-of-file knob-naming guidance.
  • After: 12-line // Knob-naming convention block before package patterndetectorprocessor, naming bare-name knobs explicitly and pointing forward to v2.0 nesting.

Why #380 is NOT in this bundle

#380 sweeps 12 "future PR-B" comments referencing the RFC-0014 WithMetrics bridge. Its own acceptance criteria say "Closes with PR-B's merge commit."

Precondition check today:

  • git log --all --grep=WithMetrics → only wave-3 port commits, no bridge.
  • docs/rfcs/0014-metrics-to-logs-pattern-input.md:3 → status accepted, blocks issue #260 PR-B.
  • docs/integrations/prometheus-scrape.md:381 → "PR-B has NOT yet shipped."

All 12 comments are factually accurate today. Sweeping prematurely would either delete load-bearing forward-looking documentation, or replace it with stale post-PR-B prose. Issue stays open as tracker per its design; commented on issue noting the deferral.

Verification

  • go build ./... (module) — clean.
  • go vet ./... — clean (via pre-commit).
  • go list -m github.com/tracecoreai/tracecore/module — resolves.
  • golangci-lint run ./... — 0 issues (via pre-commit).
  • attribute-namespace-check — 100/100 (via pre-commit).
  • gh issue view 337 333 334 — all CLOSED (justifies docs(rc1-tag-cut): sweep strikethrough resolved-in-issue blocks #383 sweep).

Test plan

  • Module builds + vets.
  • Referenced closed-issue states verified.
  • Strikethrough convention markers in docs/followups/* / MILESTONES.md / RFC-0003 left intact (deliberate scope-limit).
  • #380 deferred with on-issue rationale; not in Closes trailer.
  • CI green.
NONE

Closes #383
Closes #381
Closes #379

Tri Lam added 3 commits June 1, 2026 14:58
Wave-3/4 PRs (PR-I.1a/1b/2) all landed long ago. The 'Contents land
in PR-I.1b/I.2' framing and the genesis-tag proxy rationale are no
longer load-bearing — the module is alive and well, and `package
module` declaration alone satisfies the proxy requirement.

Closes #381

Signed-off-by: Tri Lam <tri@maydow.com>
Two strikethrough blocks targeted by post-wave-audit finding #15:
- docs/patterns/10-cuda-oom-deceptive.md:74 — referenced closed #337
- docs/v1-rc1-simplification-audit.md:185-186 — referenced closed
  #333 / #334

Replace strikethrough markup with plain prose. The resolution text
itself is retained (still useful as cross-link history); only the
strike-through visual noise is removed. Followups/* + MILESTONES.md
+ RFC-0003 strikethroughs are documented convention markers (per
MILESTONES.md:65,69) and are intentionally untouched.

Closes #383

Signed-off-by: Tri Lam <tri@maydow.com>
Post-wave-audit finding #5: config.go carries three knob-naming
styles (bare, pattern-prefixed, inconsistent flat). Prefixed is the
right shape — it avoids cross-pattern collisions on `*_window` /
`*_threshold` axes. But renaming bare→prefixed pre-RC1 would break
every existing values.yaml.

Take option 1 from #379: document the convention at top-of-file so
future detectors follow the prefixed shape. v2.0 will nest into
`<pattern>: { <knob>: ... }` blocks with a migration helper.

Closes #379

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr merged commit 2aa6c0b into main Jun 1, 2026
12 checks passed
@trilamsr trilamsr deleted the worktree-agent-a8885593a38582acc branch June 1, 2026 22:13
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Bundles #364 (per-driver OTTL stanzas for `dataloader.error_class`) and
#365 (training-step-stalled metrics-to-logs bridge contract) — both
pattern-7, both touch the same recipe surface. This PR REPLACES #401
(which was opened against pre-wave main and accumulated unrelated
scope-creep via merge-resolution).

### #364 — `dataloader.error_class` per-driver regex

`docs/integrations/filelog-container.md` +
`examples/filelog-container.yaml` ship a `transform/dataloader_errors`
processor with one stanza per driver class. Each stanza stamps the
customer-stable `dataloader.error_class` value the runbook branches on,
plus an optional `dataloader.worker_pid` int when the line carries one.

Six driver classes: `DataLoader worker killed`, `FUSE transport error`,
`S3 throttle`, `Stale file handle`, `DataLoader queue empty`,
`Connection reset by peer`. Resolves pattern #7 spec Open Q#3.

### #365 — training-step-stalled bridge contract

`docs/integrations/prometheus-scrape.md` grows a §"Pattern #7" section
documenting the load-bearing wire contract for the bridge log record
(`tracecore.alert.training_step_stalled.no_progress_seconds` +
`last_step_ns` + `gen_ai.training.step` + `phase`). Per RFC-0014 the
emitter half stays upstream-blocked until WithMetrics PR-B or
`metricthresholdconnector` lands; the detector reads the contract today.

### Test wiring

9 sub-tests in
`module/processor/patterndetectorprocessor/dataloader_hang_test.go` pin
every per-driver class value + every bridge attribute guard against live
detector wiring (eval-phase guard + warmup `step >= 2` guard).

## Why a redo PR

#401 was branched from pre-wave main (commit `636c2a2`). When it merged
main back in, it accumulated re-deletions of recently-merged #389
(composite action) + re-adds of #379/#381 already shipped via #392 +
#386 already shipped via #397. Cleanest path: cherry-pick the
recipe-only commit onto current main + resolve the one true conflict
(DCGM helm-template comment in prometheus-scrape.md vs the new pattern-7
NOTE block).

## Test plan

- [x] `golangci-lint`, `go vet`, `attribute-namespace-check` — green
- [x] `go test ./module/processor/patterndetectorprocessor/... -run
DataLoader` — pass
- [x] Pre-push hook gates all green

Closes #364. Closes #365. Supersedes #401.

```release-notes
feat(recipes): pattern #7 OTTL stanzas for `dataloader.error_class` per-driver classification + documented metrics-to-logs bridge contract for `training_step_stalled` log record.
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment