Add developer foundation: Git hooks suite, issue templates, editorconfig#11
Merged
Conversation
Tighten the developer feedback loop and codify what "good" looks like so M1 sets a pattern future packages copy rather than diverge from. - Makefile: new `check` target (fmt + tidy-check + lint + test) for the inner loop, complementing `ci` (license-check + vet + lint + test + build) which remains the pre-PR gate. New `tidy-check` runs `go mod tidy -diff` non-mutatingly so dependency drift fails fast. - go.sum: 54 transitive-checksum additions surfaced by tidy-check on first run. Bundled here so the new gate lands green; no source change. - AGENTS.md: new "Exemplar packages" section names internal/pipeline/ (lands with M1) as the canonical runtime shape and lists the rubric (layout, doc comments, error wrapping, slog discipline, concurrency contract, table tests, no speculative abstractions) every package must satisfy to qualify as an exemplar. Verification section now distinguishes `make check` (continuous) from `make ci` (pre-PR). - CONTRIBUTING.md: dev quick-reference surfaces `make check`. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Five reinforcing tweaks surfaced by the line-by-line review: - AGENTS.md: rubric body becomes pointers into STYLE.md (Errors, Logging, Concurrency, Tests, Layout). STYLE.md is the source of truth; the rubric is now the checklist of which sections apply, not a restatement that drifts. Also: "Lands with M1" softened to "designated once M1 clears the rubric"; component path spelled out to match STYLE.md verbatim; "no hidden dependencies on main" reworded to "no reaching into cmd/tracecore globals"; "Two audiences" -> "Two cadences". - Makefile: tidy-check is now part of `ci`, not just `check`, so CI itself catches go.sum drift instead of relying on the contributor to remember the inner-loop command. - .github/workflows/pr-validation.yml: release-notes block skip pattern extended from docs/CI/tests to also cover Makefile, go.mod/go.sum, scripts/, .golangci.yml, .go-version, dotfiles — the paths with no user-visible runtime behavior. - CONTRIBUTING.md / AGENTS.md: "one concern per PR" rule relaxed to allow prerequisite cleanup that the change itself surfaces (e.g., a tidy-drift fix needed for a new gate to land green), provided it's called out in the PR body. This commit is itself such a case. - .claude/skills/fpr/SKILL.md: §9 expanded — when responding to review, also re-check the PR title and Summary/Changes sections against the current diff and rewrite them if scope shifted. Don't just append "Recent fixes" while the body stays stale. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
`go mod tidy -diff` writes the actual diff to stdout, but in fresh environments (CI on a clean checkout) it also writes "go: downloading X" messages to stderr while fetching the module info it needs to compute the diff. Capturing both with `2>&1` made the variable non-empty whenever any download chatter appeared, which falsely failed the check on CI even when go.mod / go.sum were perfectly tidy. Locally the module cache silenced stderr, so the bug was invisible. Caught by CI's `verify` job on this same PR after `tidy-check` was added to the `ci` target. Drop the `2>&1` so only stdout (the real diff) decides pass/fail; let stderr flow to the terminal for visibility but not into the decision. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Four small, durable additions that compound for the life of the
project, bundled per the relaxed one-concern-with-prerequisites rule:
- .githooks/pre-commit + `make hooks` target: opt-in pre-commit hook
that runs `make check` on every commit. Installation is one command
(`make hooks` sets core.hooksPath to .githooks). Bypass with
--no-verify; CI still catches what gets skipped. Encodes the inner
loop as a default, not a discipline contributors have to remember.
- .github/ISSUE_TEMPLATE/{bug_report.md,feature_request.md,config.yml}:
structured bug and feature templates. config.yml disables blank
issues and routes architectural changes to docs/rfcs/, security
reports to private advisories, and questions to Discussions. Makes
the first external issue the *least* badly-formatted instead of
the most.
- .editorconfig: charset, EOL, indent rules for Go (tabs), Makefile
(tabs), go.mod/go.sum (tabs), Markdown (preserve trailing
whitespace for hard breaks), everything else (2 spaces). Locks
down decisions gofumpt doesn't cover so YAML/Markdown contributors
don't accidentally drift.
- .github/PULL_REQUEST_TEMPLATE.md polish: drops the "Keep PRs under
500 lines" line that contradicts the just-relaxed one-concern rule;
adds `make check` to the checklist alongside `make ci`; adds the
"PR title and Summary still reflect the current diff" item that
the /fpr skill §9 now enforces.
- CONTRIBUTING.md: dev quick-reference surfaces `make hooks`.
Depends on PR #10 (the pre-commit hook calls `make check`, which #10
introduces). Diff will collapse once #10 merges.
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
60d8223 to
b0fac1f
Compare
Round out the .githooks/ suite introduced earlier in this PR so the moment of truth for whether a change passes is `git push`, not the CI queue. Three complementary gates: - .githooks/commit-msg: enforces the two rules CI checks per-commit — ≤72-char subject (STYLE.md §Commits) and DCO Signed-off-by trailer (matches .github/scripts/dco-check.sh). Skips git-managed shapes (Merge, fixup!, squash!, amend!) where the rules don't apply. Catches the exact failure mode this branch hit twice today: forgetting -s. - .githooks/pre-push: runs `make ci` so the full gate (license, vet, lint, tidy, test with race, build) fires before the network round-trip. Closes the gap between pre-commit's lighter `make check` and CI's full sweep, which is the gap that surfaced this PR's tidy-check stderr bug (caught only after pushing). - scripts/setup-dev.sh: idempotent one-shot bootstrap — runs `go mod download` and `make hooks`, then prints the hook map. Replaces the two-step manual setup in CONTRIBUTING.md with one command, while `make hooks` stays available for users who already downloaded modules. - CONTRIBUTING.md: dev quick-reference now points at setup-dev.sh and documents the three hooks together. All three hooks honour `--no-verify` for emergencies. CI remains the final authority; locally enforced is a velocity multiplier, not a gate. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Five small tightenings: - Makefile: `make hooks` output now lists all three hooks (was stale, only mentioned pre-commit). Anyone running `make hooks` directly (not through setup-dev.sh) gets an accurate picture. - scripts/setup-dev.sh: pre-flight check for `go` on PATH so the failure mode is a clear error message with install pointer, not a confusing `go: command not found` from `go mod download`. Also trimmed the "Downloading Go modules and tools" message to just "modules" since tools are pulled on-demand by `go tool`, not by `go mod download`. - .githooks/commit-msg: clarified that the DCO check mirrors .github/scripts/dco-check.sh and notes future drift risk. - .github/PULL_REQUEST_TEMPLATE.md: replaced the bare "Fixes #" placeholder (which rendered verbatim if a contributor forgot to edit it) with an explicit "_No linked issue._" default and clearer instructions to either replace or delete the section. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
PR #10 squash-merged to main (3027c1e), bringing PR #10's polished ordering (`make ci` reorder, fpr SKILL.md §9 body-edit rewrite, release-notes skip pattern dropping `tests/*`, CONTRIBUTING.md `tidy-check` consistency) into main as one commit. PR #11's branch still carried the unpolished individual commits, so the same lines were edited on both sides. Resolved by taking main's polished versions in those four files; PR #11's own additions (`hooks` target, hooks docs block in CONTRIBUTING.md) are preserved. Per workflow-no-stacked-prs memory: merging main in instead of rebasing keeps the push non-force. Conflicts resolved: - Makefile .PHONY: keep PR #11's `hooks`. ci order: take main's polished `license-check vet tidy-check lint test build`. - CONTRIBUTING.md: keep PR #11's hooks documentation block, but use main's polished `make ci` line (`tidy-check`, not `tidy`). - .github/workflows/pr-validation.yml: take main's polished case — `scripts/*` with comment, no dead `tests/*`. - .claude/skills/fpr/SKILL.md: take main's §9 rewrite (separate /tmp/pr_title.txt and /tmp/pr_body.md with explicit edit step). Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
5 tasks
trilamsr
added a commit
that referenced
this pull request
May 13, 2026
## What this PR does M1 ships the pipeline runtime contract; M1.5 proves it works end-to-end with one canonical receiver, one canonical exporter, and an integration test that runs the binary. | Area | What lands | |---|---| | **Pipeline contract** (`internal/pipeline/`) | `Component`, `Host`, per-signal `Factory` interfaces; `Runtime` with two-phase shutdown (1s receivers + operator-configurable drain budget); `pipelinetest.New(t)` fixture; `WrapFirstData*` consumers; `ComponentState` mixin | | **Consumer interfaces** (`internal/consumer/`) | Per-signal `Metrics`/`Traces`/`Logs` push interfaces with `Capabilities() Capabilities` (MutatesData flag) | | **Fan-out** (`internal/fanout/`) | Per-signal multi-exporter consumers; mutable/readonly partition, donate-to-last-mutator, `errors.Join` aggregation; mirrors OTel v0.152.0 `fanoutconsumer` | | **safe.Call** (`internal/safe/`) | cgo / vendor-SDK panic-wrapper with named-op error tagging, ctx-respect, `panic(nil)` + `runtime.Goexit` handling | | **Config loader** (`internal/config/`) | YAML loader with line-numbered errors; rejects multi-doc YAML, trailing/multi-slash pipeline IDs, undefined refs, invalid instance names | | **Codegen** (`tools/components-gen/`, `components.yaml`) | Single source of truth for built-in components; `make generate` produces `cmd/tracecore/components.go`; `make generate-check` gates on freshness | | **Receiver** (`components/receivers/clockreceiver/`) | Canonical example: emits `tracecore.clock.now` gauge at configurable interval. M8+ receivers mirror this shape | | **Exporter** (`components/exporters/stdoutexporter/`) | Canonical example: writes one OTLP/JSON line per `ConsumeMetrics` to a configurable `io.Writer` | | **Binary** (`cmd/tracecore/`) | `tracecore collect --config=<path>` boots, builds pipelines bottom-up (exporters → fanout → processors reversed → first-data wrap → receivers), runs until SIGTERM/SIGINT, two-phase shutdown; sysexits.h exit codes | | **Operator UX** (cross-cutting) | Line-numbered YAML errors; named-op `safe.Call`; empty-pipeline boot logs once + idles; first-data log per pipeline; `pipelinetest.New(t)` fixture | | **Integration test** (`cmd/tracecore/integration_test.go`) | Builds the binary, runs with a real config, captures stdout/stderr separately, asserts exit 0 + JSON metrics + lifecycle log lines | | **Acceptance test** (`internal/pipeline/contract_test.go`) | Receiver-author end-to-end story: lifecycle, data flow, first-data log, two-phase shutdown with real component types | | **RFC-0003** | Pipeline runtime + component contract (accepted; Implementation notes record all design deviations) | | **RFC-0004** | clockreceiver + stdoutexporter with Option C adoption (`Capabilities()`, fan-out, `ComponentState`, package-var factory) | | **`docs/STRATEGY.md`** | Long-term repo posture: "OTel-compatible by default. Every divergence is deliberate and documented." Divergence-from-OTel table | | **`docs/research/otel-graph-notes.md`** | Findings from reading OTel v0.152.0's `service/internal/graph` + `testcomponents` + `fanoutconsumer` | | **Receiver-author quickstart** (`internal/pipeline/README.md`) | 6-file layout, package-var Factory, `ComponentState` embedding, `Capabilities`, `pipelinetest.New`, registration, pitfalls (each tied to a `clockreceiver.go` line) | | **`CHANGELOG.md`** | `[Unreleased]` section documenting all user-visible additions and changes | | **`Makefile`** | `generate` + `generate-check` targets; merged with `origin/main`'s `tidy-check`, `check`, `hooks` (PRs #10/#11) | | **`.gitignore`** | Loop-runtime state, claude-mem project DB, local-only personal config files (`*.local.*`), pprof, Goland debug bins | ## Linked issue(s) N/A — foundation work establishing the keystone M8+ receivers plug into. ## Release notes ```release-notes [FEATURE] M1 pipeline runtime + first canonical components. `tracecore collect --config=<path>` boots, runs clockreceiver → stdoutexporter pipelines via factory-based assembly, performs two-phase shutdown on SIGTERM. Adds RFC-0003 (contract) and RFC-0004 (clockreceiver + stdoutexporter). ``` ## Checklist - [x] Tests added or updated — unit, acceptance (`contract_test.go`), and integration (`integration_test.go`) - [x] `make ci` passes locally - [x] Commits are signed off (`git commit -s`) — DCO verified on all 41 commits - [x] Commits cryptographically signed (SSH-based) - [x] New components follow `STYLE.md` §Component layout (clockreceiver + stdoutexporter each ship the 6 files) --------- Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
A+ criteria batch: - TestReceiver_OTLPRoundTrip + _TraceContext (criterion #9 — one CI-exercised backend). Uses plog.ProtoMarshaler / ProtoUnmarshaler so we don't pull grpc into go.sum. Pins body, severity, kernelevents.xid/gpu.id/sequence/source attributes, AND TraceID/SpanID survival across the OTLP/protobuf wire format every OTel-compatible exporter consumes. - README backend compat matrix replaced with the OTLP-round-trip evidence row; the rest of the rows are now framed as "expected via OTLP exporter" instead of unbacked "works". - README SLI/SLO table swaps the TBD aspirations for measured numbers from BenchmarkReceiver_KmsgOnly on M4 Pro: ~400 k events/ sec, p99 580 µs, <1 MiB heap. NORTHSTARS budgets retained as upper bounds. CPU% remains TBD (Carry-forward). - RUNBOOK.md gains a "First 15 minutes" walkthrough: install → verify CAP_SYSLOG → synthesize an Xid 79 line → see it surface in the backend with the expected attribute set. Reproducible by an SRE with no prior tracecore knowledge. - .github/workflows/kernelevents-integration.yml stages the Linux- only integration suite + race-stability `-count=10` run. CAP_SYSLOG is granted to the test binary via `setcap` so /dev/kmsg opens. Bench results upload as a workflow artifact. - MILESTONES.md M9 row trimmed to two Carry-forward bullets (criterion #11): real-host CPU% measurement + first green run of the new workflow. Six previously-Carry-forward items moved to FOLLOWUPS with explicit `Revisit if:` triggers. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Address items #10 (deprecation policy for self-metric names) and #11 (JSON content negotiation on /healthz + /readyz). JSON probes: - /healthz + /readyz now honor `Accept: application/json` and return `{"status":"ok|ready|not ready|shutting down"}` with Content-Type application/json. Default (no Accept or Accept: text/plain) stays text/plain so k8s probes + curl are unchanged. - Hand-rolled JSON to keep encoding/json off the hot scrape path; status strings are operator-set constants so no quoting risk. - Cache-Control: no-store applies to both content types. - Test pins six cases covering both paths, with/without Accept, malformed Accept, and the text-beats-json precedence case. Deprecation policy (RFC-0006 new section): - Pre-1.0 (M2–M7): tracecore reserves the right to rename without procedure; the surface is still settling. CHANGELOG calls out any rename as "no bridge window: pre-1.0 contract." - Post-1.0: any rename gets an RFC, then a bridge window (≥1 minor release emitting BOTH names simultaneously) backed by a `tracecore_deprecated_metric{old_name, new_name}` counter so operators see dashboards still asking for the old name. Removal after the bridge window, flagged loudly in `[Removed]` + release-notes `[CHANGE]`. This is the documented commitment operators alerting on these metrics need before they take a hard dependency. Coverage: internal/telemetry 84.4% → 86.9% via new probe tests. make ci clean. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
Reviewer 1 (R3) flagged calibration drift in M9-AGRADE-GAP.md (self ≈ 4.35 vs reviewer-anchored ≈ 4.0) plus tractable items that close all remaining lens-below-4.0 gaps. This commit closes all of them. R2.S4 — pr-validation.yml's no-AI-co-author job now ALSO scans the PR body for AI-attribution patterns (🤖, Generated with Claude/Copilot, Generated by GPT/Gemini/Claude). The commit- trailer scan can't see body content; squash-merge folds the body into the merge commit, so AGENTS.md's no-AI-mention rule needs enforcement at both layers. FOLLOWUPS #8 closed. R3.N1 — config.go:17 stale RFC-0005 reference. The Commit A grep walked .md files only; this round greps .go too and caught three more stale references (config.go:17 + config.go:56 + kernelevents.go:258). All updated to RFC-0007; the deliberate breadcrumb in doc.go:23 remains as renumber history. P2 — TestSourceTemplate_ParsesAsGo. Parser-based gate on source_template.go.example confirms the file stays valid Go syntax and retains its scaffolding identifiers (yourSource, YourConfig, sourceYour, Start/Shutdown/run signatures). Parse- only, no type checking, because the template references internal-package types a copy-paste author renames; full type check would defeat the rename safety. The class of bug an author would actually hit (syntax error, broken func signature, deleted scaffolding identifier) is caught. Self-score recalibration in M9-AGRADE-GAP.md: now reports both self (4.35) and reviewer-anchored (4.05) composites, with explicit note that A/A- is the honest read until C8 (independent reviewer) closes. Per the honest-review-pushback rule, when the author scores above the reviewer, the reviewer's number is the one a third party should size effort against. FOLLOWUPS #11 added: AST walker `resolveIncErrorCall` handles SelectorExpr + Ident but not CallExpr cast form (`selftelemetry.Kind("foo")`). Zero live call sites today (grep-verified); future-tracked. R3 nit on TestRUNBOOK_KindsMatchEmitted_PinsExpectedSet docstring: updated to direct future authors to update `want` DELIBERATELY (i.e. only after writing a RUNBOOK row), not reflexively to silence the gate. Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <trilamsr@gmail.com>
Closed
11 tasks
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Eight 1-page pattern-design specs covering #2 IB link flap, #7 dataloader hang, #8 NCCL timeout no-HW, #9 NCCL bootstrap timeout, #10 CUDA OOM deceptive allocator, #11 checkpointer hang, #12 loss spike NaN, #13 silent data corruption. Each carries the standard detector-design shape (symptom, layers, signal sources, evaluation rule, verdict attrs, edge cases, status, open questions) so the next contributor can write a TDD red test directly off the spec. Status: all 8 marked planned. #10 already has issue #303; the spec frames the design alongside. NORTHSTARS Appendix A gains a Spec column; docs/README + patterns README link the new specs. Signed-off-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Ships **pattern #11 (Checkpointer hang)** detector end-to-end, per `docs/patterns/11-checkpointer-hang.md`. Surfaces training-step stalls that happen inside a checkpoint write (the operator-meaningful failure: "step duration jumped from 30s to 600s on the periodic checkpoint step and never finished"). - **Detector library** (`module/pkg/patterns/checkpointer_hang.go`): joins `TrainingStepStallRecord` + same-pod `CheckpointPhaseRecord` (plan/write/barrier, no matching complete) inside a 90s symmetric correlation window; optional same-node `StoragePressureRecord` promotes `confidence` partial → full. Discriminator vs pattern #7 (dataloader hang): the checkpoint-phase signal — a stall WITHOUT a concurrent in-progress checkpoint is skipped (that's #7/#6 territory). False-positive guards per spec: BarrierTolerance (30s) for normal rank skew, StallThreshold (3min) floor for step jitter, completed-checkpoint suppressor. - **Processor wiring** (`module/processor/patterndetectorprocessor/checkpointer_hang.go`): projects `gen_ai.training.step_duration_seconds` (stall), `tracecore.alert.checkpoint.phase` + path (checkpoint phase), `tracecore.alert.storage_pressure.backend` (pressure) off log records; emits one verdict per match with promoted scalars per issue #270 contract. - **Schema** (`testdata/checkpointer_hang_verdict.schema.json`): pinned by 1 conformance + 10 drift-rejection falsifier tests. - **Config**: three new YAML knobs (`checkpointer_hang_{correlation_window,barrier_tolerance,stall_threshold}`) with sub-1s Validate guards. - **Docs**: spec status flipped to shipped; `docs/patterns/README.md` updated; 8 new attribute keys registered in `docs/ATTRIBUTES.md` (attribute-namespace-check stays clean at 75/75). ```release-notes feat(patterns): ship pattern #11 (Checkpointer hang) detector — joins training-step stalls with same-pod checkpoint-phase records (plan/write/barrier) inside a 90s window; same-node storage-pressure record promotes confidence partial → full. Three YAML knobs surfaced. Discriminator vs pattern #7 (dataloader hang): the in-progress checkpoint-phase signal. ``` ## Test plan - [x] `cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/...` — all green - [x] `cd module && go vet ./...` — clean - [x] `cd module && go test ./...` — all module tests green - [x] `make doc-check` — clean (links, comment-noise, integration recipes) - [x] `make attribute-namespace-check` — clean (75/75 attributes documented) - [x] Pre-commit hooks (license, license-fix, gofumpt, golangci-lint, govulncheck, doc-check, attribute-namespace-check, no-autoupdate, DCO, AI-trailer) — all pass - [x] 14 patterns-lib tests cover: positive full (write + pressure + Lustre), positive partial (no pressure), negative stall-without-checkpoint (pattern-#7 discriminator), negative completed-checkpoint suppressor, BarrierTolerance guard, StallThreshold guard, cross-pod isolation, out-of-window, post-stall window-tolerant join, multi-rank cohort, deterministic order, 6-case storage-backend inference (lustre/fsx/weka/nfs/s3/unknown), cross-node pressure ignored, window-configurable - [x] 4 wiring tests cover: full verdict + promoted scalars, partial (no pressure), suppression with `emit_partial_verdicts=false`, silent-on-stall-without-checkpoint - [x] 10-falsifier schema drift battery + 1 conformance assertion - [x] 3-row Validate guard battery for the new YAML knobs ## Files - `module/pkg/patterns/checkpointer_hang.go` (+434) - `module/pkg/patterns/checkpointer_hang_test.go` (+584) - `module/pkg/patterns/testdata/checkpointer_hang_verdict.schema.json` (new) - `module/processor/patterndetectorprocessor/checkpointer_hang.go` (+292) - `module/processor/patterndetectorprocessor/checkpointer_hang_test.go` (+289) - `module/processor/patterndetectorprocessor/{config.go,patterndetector.go,example_config.yaml}` (knobs + wire-up + example) - `docs/patterns/{11-checkpointer-hang.md,README.md}` (status flip) - `docs/ATTRIBUTES.md` (+8 attributes registered) --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary - Ships **pattern #7 (dataloader hang)** end-to-end: detector library, JSON-schema verdict shape, processor wiring, integration tests, and docs flip from `☐ planned` to `☑ shipped`. - Conservative two-discriminator rule per `docs/patterns/07-dataloader-hang.md`: a stall alone never fires; the verdict requires *either* a same-pod `dataloader.error_class` log (`worker_killed`) *or* a same-node `FailedMount` / `VolumeMountFailure` Kubernetes Event (`storage_event`) within the correlation window. - Three false-positive guards land with the detector: step-0/1 warmup-prefetch, `phase=="eval"` validation pauses, and sub-threshold cadence variance. ```release-notes patterns: pattern #7 (dataloader hang) detector ships with worker_killed + storage_event discriminators, conservative warmup / eval-phase / sub-threshold guards, and a pinned JSON-schema verdict shape. ``` ## Tests ``` cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... -run DataLoader ``` - **Library** (`module/pkg/patterns/dataloader_hang_test.go`, 14 cases): canonical worker-killed + storage-event triggers, the three false-positive guards (warmup, eval, sub-threshold), per-pod / per-node join scopes, window semantics, discriminator priority when both layers join, configurable threshold, deterministic ordering, schema conformance + 10 drift-rejection falsifiers. - **Processor** (`module/processor/patterndetectorprocessor/dataloader_hang_test.go`, 5 cases): worker-killed wiring with promoted scalars, storage-event wiring, stall-alone-no-fire conservative-guard wiring, YAML threshold knob, sub-second validation rejection. - `make check` clean (gofumpt, golangci-lint 0 issues, go vet, go mod verify, attribute-namespace-check 73/73 documented). ## Schema `module/pkg/patterns/testdata/dataloader_hang_verdict.schema.json` — pins: - `pattern.id` const `"7"` - `discriminator` enum `{worker_killed, storage_event}` - `confidence` enum `{full, partial}` (v1 emits only `full`; `partial` reserved for the future py-spy degraded path per spec Open Q#2) - `tracecore.alert.dataloader_hang.stall_seconds` integer ≥ 0 - `evidence_trail.kind` enum `{training_step_stall, dataloader_error, storage_event}` with `additionalProperties: false` ## Spec ambiguity choices (per PRINCIPLES.md: maximize operator UX, minimize false positives) - **Open Q#2 (py-spy degraded path)**: deferred. Spec notes the py-spy / parca-agent receiver is undecided at v1 (RFC-0013 §7 slates pyspy for delete; parca-agent contract not yet authored). v1 detector emits `confidence=full` only when a discriminator joins; `confidence=partial` is reserved in the schema enum so we can land the py-spy branch additively. Trade-off: stalls without storage / worker evidence are silent — operators triage off pattern #6 (stragglers) and pattern #11 (checkpointer) instead. - **Open Q#1 (step-progress signal source)**: detector consumes a `tracecore.alert.training_step_stalled.no_progress_seconds` bridge attribute, mirroring the pattern-5 `tracecore.alert.pcie_rate_collapse.*` shape. OTTL recipe wire-up follows the metrics→logs convention from RFC-0014; the recipe itself is the integration-gap follow-up (out of scope for the detector PR). - **Open Q#3 (FUSE error vocabulary)**: deferred to the OTTL recipe — detector reads a single `dataloader.error_class` string; the recipe owns per-driver regex stanzas. - **Open Q#4 (`gen_ai.training.phase` upstream)**: detector reads it as a tracecore-ext alpha attribute; documented in `docs/ATTRIBUTES.md`. Switches to upstream-semconv when O4 lands it. - **Discriminator priority**: when both a same-pod worker-killed log *and* a same-node storage event join the stall, `worker_killed` wins — pod-scoped is more precise (storage events on the node may be unrelated to the failing pod's mount). Pinned by `TestDataLoaderHangDetector_PrefersWorkerKilledOverStorage`. ## Closes None — no GitHub issue exists for pattern #7. Closes the spec-vs-detector gap from `docs/patterns/07-dataloader-hang.md` (status flips from `☐ planned` to `☑ shipped`). --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
The patterndetector ships 11 detectors with 14 time-bounded knobs, but the join shape varies across patterns and the rationale lived only in code comments + PR review threads. Operators tuning windows had to read source per detector. Audit finding: five distinct shapes are load-bearing (chosen by the causal physics of each signal), not bugs: - One-sided lookback (#1 #3 #5 #6 #7 #10): cause precedes effect. - Asymmetric two-sided (#11): pre-stall covers concurrent-start checkpoints; post-stall covers OTTL-bridge logger latency. - Symmetric two-sided (#9 CNI-event leg): cohort-ready ±window could be cause OR consequence. - Job-window bounded (#13): SDC counter rise must fall in the bounded eval-cycle's owning job; no operator knob is meaningful. - Trailing-window rate / freshness (#2 #4 #8): rolling window anchored at `now` or the most-recent record. Decision: document the existing reality, do not converge. Forcing every detector to the asymmetric two-knob form would silently zero one leg for the one-sided detectors (footgun on clock skew) and would not apply to #13 at all. Adds: - 'Why this correlation shape' section in docs/patterns/07, 11, 13 (the three shapes the issue called out by name). - 'Correlation-window semantics' table in docs/patterns/README.md covering ALL 11 detectors with the predicate, anchor, and shape rationale, plus cross-links to the per-pattern sections. No code changes; no detector behavior changes. Closes #367. Signed-off-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Three v1 detectors use three different correlation-window shapes; the rationale lived only in code comments and review threads. Operators tuning windows hit these without warning. - pattern #7 (dataloader_hang): one-sided look-back — strict cause→symptom - pattern #11 (checkpointer_hang): asymmetric (-30s, +60s) — log/record race - pattern #13 (silent_data_corruption): symmetric job window — order-free attribution ## Decision Document the existing reality. Each shape matches a distinct physical event-ordering. A unified asymmetric two-knob form (issue suggestion) would silently zero one leg for #7 and not apply to #13 — operators would tune knobs with no physical meaning for their pattern. ## Changes - "Why this correlation shape" subsection in each of `docs/patterns/07-dataloader-hang.md`, `11-checkpointer-hang.md`, `13-silent-data-corruption.md` - "Correlation-window semantics" comparison table in `docs/patterns/README.md` cross-linking the three rationales - No code changes; no detector behavior changes ## Test plan - [x] `golangci-lint`, `go vet`, `attribute-namespace-check` (pre-push hook) — green - [x] Each pattern doc cross-links to the comparison table - [x] Each comparison-table row links back to the per-pattern rationale section Closes #367. ```release-notes docs: document correlation-window semantics for patterns #7, #11, #13 — three shapes (one-sided, asymmetric, symmetric) match distinct physical event-orderings; no behavior change. ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Five small, durable additions that compound for the life of the project — the developer-experience foundation that catches problems locally before they hit CI, and makes the first contributor's first commit the least badly-formatted instead of the most.
.githooks/suite +make hooks+scripts/setup-dev.sh— three opt-in Git hooks installed by one command:pre-commitrunsmake check(fmt, tidy-check, lint, race-tested tests).commit-msgenforces ≤72-char subject + DCOSigned-off-by:(matches the rules CI gates on). Skips git-managed shapes (Merge, fixup!, squash!, amend!).pre-pushrunsmake ci— the full license + vet + lint + tidy + test + build gate, locally, before the network round-trip.scripts/setup-dev.shis the one-shot bootstrap (idempotent):go mod download+make hooks+ prints what each hook does.make hooksremains available for users who already downloaded modules. Bypass any hook with--no-verify; CI is still the final authority..github/ISSUE_TEMPLATE/(bug, feature, config) — structured templates plus aconfig.ymlthat disables blank issues and routes architectural changes todocs/rfcs/, security reports to private advisories, and questions to Discussions..editorconfig— locks down charset / EOL / indent for the files gofumpt doesn't cover (Markdown, YAML, Makefile, go.mod/go.sum). Write-once, no maintenance..github/PULL_REQUEST_TEMPLATE.mdpolish — drops the "Keep PRs under 500 lines" line that contradicts the just-relaxed one-concern rule; surfacesmake checkalongsidemake ci; adds the "PR title and Summary still reflect the current diff" checklist item the/fprskill §9 now enforces.CONTRIBUTING.md— dev quick-reference now points atsetup-dev.shand documents the three hooks together.Linked issue(s)
Refs #10 — depends on #10 (the pre-commit hook calls
make check, introduced there; the pre-push hook callsmake ci, which #10 also updated). Branched off #10's tip; targetingmain. Once #10 merges, GitHub auto-rebases this PR and the diff collapses to only this PR's files.Changes
.githooks/pre-commit(new, executable) —exec make check..githooks/commit-msg(new, executable) — subject length + DCO sign-off check..githooks/pre-push(new, executable) —exec make ci.scripts/setup-dev.sh(new, executable) — idempotent bootstrap.Makefile—hookstarget..editorconfig(new)..github/ISSUE_TEMPLATE/config.yml(new) — blank issues off; contact links for RFC / security / Discussions..github/ISSUE_TEMPLATE/bug_report.md(new)..github/ISSUE_TEMPLATE/feature_request.md(new)..github/PULL_REQUEST_TEMPLATE.md— line-cap removed,make checkand title-sync items added to checklist.CONTRIBUTING.md—setup-dev.sh+ three-hook documentation.Test plan
scripts/setup-dev.shruns idempotently and reports hook installationmake hooksinstalls and reports the hook directory is activegit commitoutput)eb1b170(fullmake ciran green before the network call)make checkgreen;make cigreen (includestidy-checkafter Tighten developer and PR feedback loops #10's stderr-redirect fix)Recent fixes (2026-05-13)
scripts/setup-dev.shineb1b170— closes the gap betweenmake checkand CI by running the full gate atgit pushtime.CONTRIBUTING.mdquick-reference to surfacesetup-dev.shand document all three hooks together./fpr§9).Checklist
make checkruns green continuously while editing;make cipasses before pushinggit commit -s)Assisted-by:trailer (seeAGENTS.md)STYLE.md(N/A)