Skip to content

Add developer foundation: Git hooks suite, issue templates, editorconfig#11

Merged
trilamsr merged 7 commits into
mainfrom
chore/dev-foundation
May 13, 2026
Merged

Add developer foundation: Git hooks suite, issue templates, editorconfig#11
trilamsr merged 7 commits into
mainfrom
chore/dev-foundation

Conversation

@trilamsr

@trilamsr trilamsr commented May 13, 2026

Copy link
Copy Markdown
Contributor

What this PR does

Five small, durable additions that compound for the life of the project — the developer-experience foundation that catches problems locally before they hit CI, and makes the first contributor's first commit the least badly-formatted instead of the most.

  • .githooks/ suite + make hooks + scripts/setup-dev.sh — three opt-in Git hooks installed by one command:

    • pre-commit runs make check (fmt, tidy-check, lint, race-tested tests).
    • commit-msg enforces ≤72-char subject + DCO Signed-off-by: (matches the rules CI gates on). Skips git-managed shapes (Merge, fixup!, squash!, amend!).
    • pre-push runs make ci — the full license + vet + lint + tidy + test + build gate, locally, before the network round-trip.

    scripts/setup-dev.sh is the one-shot bootstrap (idempotent): go mod download + make hooks + prints what each hook does. make hooks remains available for users who already downloaded modules. Bypass any hook with --no-verify; CI is still the final authority.

  • .github/ISSUE_TEMPLATE/ (bug, feature, config) — structured templates plus a config.yml that disables blank issues and routes architectural changes to docs/rfcs/, security reports to private advisories, and questions to Discussions.

  • .editorconfig — locks down charset / EOL / indent for the files gofumpt doesn't cover (Markdown, YAML, Makefile, go.mod/go.sum). Write-once, no maintenance.

  • .github/PULL_REQUEST_TEMPLATE.md polish — drops the "Keep PRs under 500 lines" line that contradicts the just-relaxed one-concern rule; surfaces make check alongside make ci; adds the "PR title and Summary still reflect the current diff" checklist item the /fpr skill §9 now enforces.

  • CONTRIBUTING.md — dev quick-reference now points at setup-dev.sh and documents the three hooks together.

Linked issue(s)

Refs #10 — depends on #10 (the pre-commit hook calls make check, introduced there; the pre-push hook calls make ci, which #10 also updated). Branched off #10's tip; targeting main. Once #10 merges, GitHub auto-rebases this PR and the diff collapses to only this PR's files.

Changes

  • .githooks/pre-commit (new, executable) — exec make check.
  • .githooks/commit-msg (new, executable) — subject length + DCO sign-off check.
  • .githooks/pre-push (new, executable) — exec make ci.
  • scripts/setup-dev.sh (new, executable) — idempotent bootstrap.
  • Makefilehooks target.
  • .editorconfig (new).
  • .github/ISSUE_TEMPLATE/config.yml (new) — blank issues off; contact links for RFC / security / Discussions.
  • .github/ISSUE_TEMPLATE/bug_report.md (new).
  • .github/ISSUE_TEMPLATE/feature_request.md (new).
  • .github/PULL_REQUEST_TEMPLATE.md — line-cap removed, make check and title-sync items added to checklist.
  • CONTRIBUTING.mdsetup-dev.sh + three-hook documentation.

Test plan

  • scripts/setup-dev.sh runs idempotently and reports hook installation
  • make hooks installs and reports the hook directory is active
  • Pre-commit hook fired automatically during both commits on this branch (visible in git commit output)
  • Commit-msg hook tested on 4 cases: valid passes, 73-char subject rejected, missing DCO rejected, merge-commit exempted
  • Pre-push hook fired automatically during the push of eb1b170 (full make ci ran green before the network call)
  • make check green; make ci green (includes tidy-check after Tighten developer and PR feedback loops #10's stderr-redirect fix)
  • CI on PR fully green
  • After Tighten developer and PR feedback loops #10 merges, this PR's diff narrows to just the new files
NONE

Recent fixes (2026-05-13)

  • Added: commit-msg + pre-push hooks and scripts/setup-dev.sh in eb1b170 — closes the gap between make check and CI by running the full gate at git push time.
  • Updated: CONTRIBUTING.md quick-reference to surface setup-dev.sh and document all three hooks together.
  • Updated: PR title from "pre-commit hook, issue templates, editorconfig" to "Git hooks suite, issue templates, editorconfig" to reflect the expanded scope (per /fpr §9).

Checklist

  • Tests added or updated (N/A — config/template/script-only)
  • make check runs green continuously while editing; make ci passes before pushing
  • Commits are signed off (git commit -s)
  • If AI-assisted, commit message includes Assisted-by: trailer (see AGENTS.md)
  • For new components, follows the layout required by STYLE.md (N/A)
  • PR title and Summary still reflect the current diff

trilamsr added 4 commits May 13, 2026 12:44
Tighten the developer feedback loop and codify what "good" looks like
so M1 sets a pattern future packages copy rather than diverge from.

- Makefile: new `check` target (fmt + tidy-check + lint + test) for the
  inner loop, complementing `ci` (license-check + vet + lint + test +
  build) which remains the pre-PR gate. New `tidy-check` runs `go mod
  tidy -diff` non-mutatingly so dependency drift fails fast.
- go.sum: 54 transitive-checksum additions surfaced by tidy-check on
  first run. Bundled here so the new gate lands green; no source change.
- AGENTS.md: new "Exemplar packages" section names internal/pipeline/
  (lands with M1) as the canonical runtime shape and lists the rubric
  (layout, doc comments, error wrapping, slog discipline, concurrency
  contract, table tests, no speculative abstractions) every package
  must satisfy to qualify as an exemplar. Verification section now
  distinguishes `make check` (continuous) from `make ci` (pre-PR).
- CONTRIBUTING.md: dev quick-reference surfaces `make check`.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Five reinforcing tweaks surfaced by the line-by-line review:

- AGENTS.md: rubric body becomes pointers into STYLE.md (Errors,
  Logging, Concurrency, Tests, Layout). STYLE.md is the source of
  truth; the rubric is now the checklist of which sections apply, not
  a restatement that drifts. Also: "Lands with M1" softened to
  "designated once M1 clears the rubric"; component path spelled out
  to match STYLE.md verbatim; "no hidden dependencies on main"
  reworded to "no reaching into cmd/tracecore globals"; "Two audiences"
  -> "Two cadences".
- Makefile: tidy-check is now part of `ci`, not just `check`, so CI
  itself catches go.sum drift instead of relying on the contributor
  to remember the inner-loop command.
- .github/workflows/pr-validation.yml: release-notes block skip
  pattern extended from docs/CI/tests to also cover Makefile,
  go.mod/go.sum, scripts/, .golangci.yml, .go-version, dotfiles — the
  paths with no user-visible runtime behavior.
- CONTRIBUTING.md / AGENTS.md: "one concern per PR" rule relaxed to
  allow prerequisite cleanup that the change itself surfaces (e.g., a
  tidy-drift fix needed for a new gate to land green), provided it's
  called out in the PR body. This commit is itself such a case.
- .claude/skills/fpr/SKILL.md: §9 expanded — when responding to
  review, also re-check the PR title and Summary/Changes sections
  against the current diff and rewrite them if scope shifted. Don't
  just append "Recent fixes" while the body stays stale.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
`go mod tidy -diff` writes the actual diff to stdout, but in fresh
environments (CI on a clean checkout) it also writes "go: downloading
X" messages to stderr while fetching the module info it needs to
compute the diff. Capturing both with `2>&1` made the variable
non-empty whenever any download chatter appeared, which falsely failed
the check on CI even when go.mod / go.sum were perfectly tidy.

Locally the module cache silenced stderr, so the bug was invisible.
Caught by CI's `verify` job on this same PR after `tidy-check` was
added to the `ci` target.

Drop the `2>&1` so only stdout (the real diff) decides pass/fail; let
stderr flow to the terminal for visibility but not into the decision.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Four small, durable additions that compound for the life of the
project, bundled per the relaxed one-concern-with-prerequisites rule:

- .githooks/pre-commit + `make hooks` target: opt-in pre-commit hook
  that runs `make check` on every commit. Installation is one command
  (`make hooks` sets core.hooksPath to .githooks). Bypass with
  --no-verify; CI still catches what gets skipped. Encodes the inner
  loop as a default, not a discipline contributors have to remember.

- .github/ISSUE_TEMPLATE/{bug_report.md,feature_request.md,config.yml}:
  structured bug and feature templates. config.yml disables blank
  issues and routes architectural changes to docs/rfcs/, security
  reports to private advisories, and questions to Discussions. Makes
  the first external issue the *least* badly-formatted instead of
  the most.

- .editorconfig: charset, EOL, indent rules for Go (tabs), Makefile
  (tabs), go.mod/go.sum (tabs), Markdown (preserve trailing
  whitespace for hard breaks), everything else (2 spaces). Locks
  down decisions gofumpt doesn't cover so YAML/Markdown contributors
  don't accidentally drift.

- .github/PULL_REQUEST_TEMPLATE.md polish: drops the "Keep PRs under
  500 lines" line that contradicts the just-relaxed one-concern rule;
  adds `make check` to the checklist alongside `make ci`; adds the
  "PR title and Summary still reflect the current diff" item that
  the /fpr skill §9 now enforces.

- CONTRIBUTING.md: dev quick-reference surfaces `make hooks`.

Depends on PR #10 (the pre-commit hook calls `make check`, which #10
introduces). Diff will collapse once #10 merges.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr force-pushed the chore/dev-foundation branch from 60d8223 to b0fac1f Compare May 13, 2026 20:11
Round out the .githooks/ suite introduced earlier in this PR so the
moment of truth for whether a change passes is `git push`, not the CI
queue. Three complementary gates:

- .githooks/commit-msg: enforces the two rules CI checks per-commit —
  ≤72-char subject (STYLE.md §Commits) and DCO Signed-off-by trailer
  (matches .github/scripts/dco-check.sh). Skips git-managed shapes
  (Merge, fixup!, squash!, amend!) where the rules don't apply. Catches
  the exact failure mode this branch hit twice today: forgetting -s.

- .githooks/pre-push: runs `make ci` so the full gate (license, vet,
  lint, tidy, test with race, build) fires before the network
  round-trip. Closes the gap between pre-commit's lighter `make check`
  and CI's full sweep, which is the gap that surfaced this PR's
  tidy-check stderr bug (caught only after pushing).

- scripts/setup-dev.sh: idempotent one-shot bootstrap — runs `go mod
  download` and `make hooks`, then prints the hook map. Replaces the
  two-step manual setup in CONTRIBUTING.md with one command, while
  `make hooks` stays available for users who already downloaded
  modules.

- CONTRIBUTING.md: dev quick-reference now points at setup-dev.sh and
  documents the three hooks together.

All three hooks honour `--no-verify` for emergencies. CI remains the
final authority; locally enforced is a velocity multiplier, not a
gate.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr changed the title Add developer foundation: pre-commit hook, issue templates, editorconfig Add developer foundation: Git hooks suite, issue templates, editorconfig May 13, 2026
trilamsr added 2 commits May 13, 2026 13:34
Five small tightenings:

- Makefile: `make hooks` output now lists all three hooks (was stale,
  only mentioned pre-commit). Anyone running `make hooks` directly
  (not through setup-dev.sh) gets an accurate picture.
- scripts/setup-dev.sh: pre-flight check for `go` on PATH so the
  failure mode is a clear error message with install pointer, not a
  confusing `go: command not found` from `go mod download`. Also
  trimmed the "Downloading Go modules and tools" message to just
  "modules" since tools are pulled on-demand by `go tool`, not by
  `go mod download`.
- .githooks/commit-msg: clarified that the DCO check mirrors
  .github/scripts/dco-check.sh and notes future drift risk.
- .github/PULL_REQUEST_TEMPLATE.md: replaced the bare "Fixes #"
  placeholder (which rendered verbatim if a contributor forgot to
  edit it) with an explicit "_No linked issue._" default and clearer
  instructions to either replace or delete the section.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
PR #10 squash-merged to main (3027c1e), bringing PR #10's polished
ordering (`make ci` reorder, fpr SKILL.md §9 body-edit rewrite,
release-notes skip pattern dropping `tests/*`, CONTRIBUTING.md
`tidy-check` consistency) into main as one commit. PR #11's branch
still carried the unpolished individual commits, so the same lines
were edited on both sides. Resolved by taking main's polished
versions in those four files; PR #11's own additions (`hooks` target,
hooks docs block in CONTRIBUTING.md) are preserved.

Per workflow-no-stacked-prs memory: merging main in instead of
rebasing keeps the push non-force.

Conflicts resolved:
- Makefile .PHONY: keep PR #11's `hooks`. ci order: take main's
  polished `license-check vet tidy-check lint test build`.
- CONTRIBUTING.md: keep PR #11's hooks documentation block, but use
  main's polished `make ci` line (`tidy-check`, not `tidy`).
- .github/workflows/pr-validation.yml: take main's polished case —
  `scripts/*` with comment, no dead `tests/*`.
- .claude/skills/fpr/SKILL.md: take main's §9 rewrite (separate
  /tmp/pr_title.txt and /tmp/pr_body.md with explicit edit step).

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr enabled auto-merge (squash) May 13, 2026 20:44
@trilamsr trilamsr merged commit dd90bb7 into main May 13, 2026
9 checks passed
@trilamsr trilamsr deleted the chore/dev-foundation branch May 13, 2026 20:45
trilamsr added a commit that referenced this pull request May 13, 2026
## What this PR does

M1 ships the pipeline runtime contract; M1.5 proves it works end-to-end
with one canonical receiver, one canonical exporter, and an integration
test that runs the binary.

| Area | What lands |
|---|---|
| **Pipeline contract** (`internal/pipeline/`) | `Component`, `Host`,
per-signal `Factory` interfaces; `Runtime` with two-phase shutdown (1s
receivers + operator-configurable drain budget); `pipelinetest.New(t)`
fixture; `WrapFirstData*` consumers; `ComponentState` mixin |
| **Consumer interfaces** (`internal/consumer/`) | Per-signal
`Metrics`/`Traces`/`Logs` push interfaces with `Capabilities()
Capabilities` (MutatesData flag) |
| **Fan-out** (`internal/fanout/`) | Per-signal multi-exporter
consumers; mutable/readonly partition, donate-to-last-mutator,
`errors.Join` aggregation; mirrors OTel v0.152.0 `fanoutconsumer` |
| **safe.Call** (`internal/safe/`) | cgo / vendor-SDK panic-wrapper with
named-op error tagging, ctx-respect, `panic(nil)` + `runtime.Goexit`
handling |
| **Config loader** (`internal/config/`) | YAML loader with
line-numbered errors; rejects multi-doc YAML, trailing/multi-slash
pipeline IDs, undefined refs, invalid instance names |
| **Codegen** (`tools/components-gen/`, `components.yaml`) | Single
source of truth for built-in components; `make generate` produces
`cmd/tracecore/components.go`; `make generate-check` gates on freshness
|
| **Receiver** (`components/receivers/clockreceiver/`) | Canonical
example: emits `tracecore.clock.now` gauge at configurable interval. M8+
receivers mirror this shape |
| **Exporter** (`components/exporters/stdoutexporter/`) | Canonical
example: writes one OTLP/JSON line per `ConsumeMetrics` to a
configurable `io.Writer` |
| **Binary** (`cmd/tracecore/`) | `tracecore collect --config=<path>`
boots, builds pipelines bottom-up (exporters → fanout → processors
reversed → first-data wrap → receivers), runs until SIGTERM/SIGINT,
two-phase shutdown; sysexits.h exit codes |
| **Operator UX** (cross-cutting) | Line-numbered YAML errors; named-op
`safe.Call`; empty-pipeline boot logs once + idles; first-data log per
pipeline; `pipelinetest.New(t)` fixture |
| **Integration test** (`cmd/tracecore/integration_test.go`) | Builds
the binary, runs with a real config, captures stdout/stderr separately,
asserts exit 0 + JSON metrics + lifecycle log lines |
| **Acceptance test** (`internal/pipeline/contract_test.go`) |
Receiver-author end-to-end story: lifecycle, data flow, first-data log,
two-phase shutdown with real component types |
| **RFC-0003** | Pipeline runtime + component contract (accepted;
Implementation notes record all design deviations) |
| **RFC-0004** | clockreceiver + stdoutexporter with Option C adoption
(`Capabilities()`, fan-out, `ComponentState`, package-var factory) |
| **`docs/STRATEGY.md`** | Long-term repo posture: "OTel-compatible by
default. Every divergence is deliberate and documented."
Divergence-from-OTel table |
| **`docs/research/otel-graph-notes.md`** | Findings from reading OTel
v0.152.0's `service/internal/graph` + `testcomponents` +
`fanoutconsumer` |
| **Receiver-author quickstart** (`internal/pipeline/README.md`) |
6-file layout, package-var Factory, `ComponentState` embedding,
`Capabilities`, `pipelinetest.New`, registration, pitfalls (each tied to
a `clockreceiver.go` line) |
| **`CHANGELOG.md`** | `[Unreleased]` section documenting all
user-visible additions and changes |
| **`Makefile`** | `generate` + `generate-check` targets; merged with
`origin/main`'s `tidy-check`, `check`, `hooks` (PRs #10/#11) |
| **`.gitignore`** | Loop-runtime state, claude-mem project DB,
local-only personal config files (`*.local.*`), pprof, Goland debug bins
|

## Linked issue(s)

N/A — foundation work establishing the keystone M8+ receivers plug into.

## Release notes

```release-notes
[FEATURE] M1 pipeline runtime + first canonical components. `tracecore collect --config=<path>` boots, runs clockreceiver → stdoutexporter pipelines via factory-based assembly, performs two-phase shutdown on SIGTERM. Adds RFC-0003 (contract) and RFC-0004 (clockreceiver + stdoutexporter).
```

## Checklist

- [x] Tests added or updated — unit, acceptance (`contract_test.go`),
and integration (`integration_test.go`)
- [x] `make ci` passes locally
- [x] Commits are signed off (`git commit -s`) — DCO verified on all 41
commits
- [x] Commits cryptographically signed (SSH-based)
- [x] New components follow `STYLE.md` §Component layout (clockreceiver
+ stdoutexporter each ship the 6 files)

---------

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
A+ criteria batch:

- TestReceiver_OTLPRoundTrip + _TraceContext (criterion #9 — one
  CI-exercised backend). Uses plog.ProtoMarshaler /
  ProtoUnmarshaler so we don't pull grpc into go.sum. Pins body,
  severity, kernelevents.xid/gpu.id/sequence/source attributes,
  AND TraceID/SpanID survival across the OTLP/protobuf wire
  format every OTel-compatible exporter consumes.
- README backend compat matrix replaced with the OTLP-round-trip
  evidence row; the rest of the rows are now framed as "expected
  via OTLP exporter" instead of unbacked "works".
- README SLI/SLO table swaps the TBD aspirations for measured
  numbers from BenchmarkReceiver_KmsgOnly on M4 Pro: ~400 k events/
  sec, p99 580 µs, <1 MiB heap. NORTHSTARS budgets retained as
  upper bounds. CPU% remains TBD (Carry-forward).
- RUNBOOK.md gains a "First 15 minutes" walkthrough: install →
  verify CAP_SYSLOG → synthesize an Xid 79 line → see it surface
  in the backend with the expected attribute set. Reproducible by
  an SRE with no prior tracecore knowledge.
- .github/workflows/kernelevents-integration.yml stages the Linux-
  only integration suite + race-stability `-count=10` run. CAP_SYSLOG
  is granted to the test binary via `setcap` so /dev/kmsg opens.
  Bench results upload as a workflow artifact.
- MILESTONES.md M9 row trimmed to two Carry-forward bullets
  (criterion #11): real-host CPU% measurement + first green run
  of the new workflow. Six previously-Carry-forward items moved to
  FOLLOWUPS with explicit `Revisit if:` triggers.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Address items #10 (deprecation policy for self-metric names) and
#11 (JSON content negotiation on /healthz + /readyz).

JSON probes:
- /healthz + /readyz now honor `Accept: application/json` and
  return `{"status":"ok|ready|not ready|shutting down"}` with
  Content-Type application/json. Default (no Accept or Accept:
  text/plain) stays text/plain so k8s probes + curl are unchanged.
- Hand-rolled JSON to keep encoding/json off the hot scrape path;
  status strings are operator-set constants so no quoting risk.
- Cache-Control: no-store applies to both content types.
- Test pins six cases covering both paths, with/without Accept,
  malformed Accept, and the text-beats-json precedence case.

Deprecation policy (RFC-0006 new section):
- Pre-1.0 (M2–M7): tracecore reserves the right to rename without
  procedure; the surface is still settling. CHANGELOG calls out
  any rename as "no bridge window: pre-1.0 contract."
- Post-1.0: any rename gets an RFC, then a bridge window (≥1 minor
  release emitting BOTH names simultaneously) backed by a
  `tracecore_deprecated_metric{old_name, new_name}` counter so
  operators see dashboards still asking for the old name. Removal
  after the bridge window, flagged loudly in
  `[Removed]` + release-notes `[CHANGE]`.

This is the documented commitment operators alerting on these
metrics need before they take a hard dependency.

Coverage: internal/telemetry 84.4% → 86.9% via new probe tests.
make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
Reviewer 1 (R3) flagged calibration drift in M9-AGRADE-GAP.md
(self ≈ 4.35 vs reviewer-anchored ≈ 4.0) plus tractable items
that close all remaining lens-below-4.0 gaps. This commit
closes all of them.

R2.S4 — pr-validation.yml's no-AI-co-author job now ALSO scans
the PR body for AI-attribution patterns (🤖, Generated with
Claude/Copilot, Generated by GPT/Gemini/Claude). The commit-
trailer scan can't see body content; squash-merge folds the
body into the merge commit, so AGENTS.md's no-AI-mention rule
needs enforcement at both layers. FOLLOWUPS #8 closed.

R3.N1 — config.go:17 stale RFC-0005 reference. The Commit A
grep walked .md files only; this round greps .go too and
caught three more stale references (config.go:17 + config.go:56
+ kernelevents.go:258). All updated to RFC-0007; the deliberate
breadcrumb in doc.go:23 remains as renumber history.

P2 — TestSourceTemplate_ParsesAsGo. Parser-based gate on
source_template.go.example confirms the file stays valid Go
syntax and retains its scaffolding identifiers (yourSource,
YourConfig, sourceYour, Start/Shutdown/run signatures). Parse-
only, no type checking, because the template references
internal-package types a copy-paste author renames; full type
check would defeat the rename safety. The class of bug an
author would actually hit (syntax error, broken func signature,
deleted scaffolding identifier) is caught.

Self-score recalibration in M9-AGRADE-GAP.md: now reports both
self (4.35) and reviewer-anchored (4.05) composites, with
explicit note that A/A- is the honest read until C8
(independent reviewer) closes. Per the honest-review-pushback
rule, when the author scores above the reviewer, the reviewer's
number is the one a third party should size effort against.

FOLLOWUPS #11 added: AST walker `resolveIncErrorCall` handles
SelectorExpr + Ident but not CallExpr cast form
(`selftelemetry.Kind("foo")`). Zero live call sites today
(grep-verified); future-tracked.

R3 nit on TestRUNBOOK_KindsMatchEmitted_PinsExpectedSet
docstring: updated to direct future authors to update `want`
DELIBERATELY (i.e. only after writing a RUNBOOK row), not
reflexively to silence the gate.

Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Eight 1-page pattern-design specs covering #2 IB link flap, #7
dataloader hang, #8 NCCL timeout no-HW, #9 NCCL bootstrap timeout,
#10 CUDA OOM deceptive allocator, #11 checkpointer hang, #12 loss
spike NaN, #13 silent data corruption. Each carries the standard
detector-design shape (symptom, layers, signal sources, evaluation
rule, verdict attrs, edge cases, status, open questions) so the next
contributor can write a TDD red test directly off the spec.

Status: all 8 marked planned. #10 already has issue #303; the spec
frames the design alongside.

NORTHSTARS Appendix A gains a Spec column; docs/README + patterns
README link the new specs.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Ships **pattern #11 (Checkpointer hang)** detector end-to-end, per
`docs/patterns/11-checkpointer-hang.md`. Surfaces training-step stalls
that happen inside a checkpoint write (the operator-meaningful failure:
"step duration jumped from 30s to 600s on the periodic checkpoint step
and never finished").

- **Detector library** (`module/pkg/patterns/checkpointer_hang.go`):
joins `TrainingStepStallRecord` + same-pod `CheckpointPhaseRecord`
(plan/write/barrier, no matching complete) inside a 90s symmetric
correlation window; optional same-node `StoragePressureRecord` promotes
`confidence` partial → full. Discriminator vs pattern #7 (dataloader
hang): the checkpoint-phase signal — a stall WITHOUT a concurrent
in-progress checkpoint is skipped (that's #7/#6 territory).
False-positive guards per spec: BarrierTolerance (30s) for normal rank
skew, StallThreshold (3min) floor for step jitter, completed-checkpoint
suppressor.
- **Processor wiring**
(`module/processor/patterndetectorprocessor/checkpointer_hang.go`):
projects `gen_ai.training.step_duration_seconds` (stall),
`tracecore.alert.checkpoint.phase` + path (checkpoint phase),
`tracecore.alert.storage_pressure.backend` (pressure) off log records;
emits one verdict per match with promoted scalars per issue #270
contract.
- **Schema** (`testdata/checkpointer_hang_verdict.schema.json`): pinned
by 1 conformance + 10 drift-rejection falsifier tests.
- **Config**: three new YAML knobs
(`checkpointer_hang_{correlation_window,barrier_tolerance,stall_threshold}`)
with sub-1s Validate guards.
- **Docs**: spec status flipped to shipped; `docs/patterns/README.md`
updated; 8 new attribute keys registered in `docs/ATTRIBUTES.md`
(attribute-namespace-check stays clean at 75/75).

```release-notes
feat(patterns): ship pattern #11 (Checkpointer hang) detector — joins training-step stalls with same-pod checkpoint-phase records (plan/write/barrier) inside a 90s window; same-node storage-pressure record promotes confidence partial → full. Three YAML knobs surfaced. Discriminator vs pattern #7 (dataloader hang): the in-progress checkpoint-phase signal.
```

## Test plan

- [x] `cd module && go test ./pkg/patterns/...
./processor/patterndetectorprocessor/...` — all green
- [x] `cd module && go vet ./...` — clean
- [x] `cd module && go test ./...` — all module tests green
- [x] `make doc-check` — clean (links, comment-noise, integration
recipes)
- [x] `make attribute-namespace-check` — clean (75/75 attributes
documented)
- [x] Pre-commit hooks (license, license-fix, gofumpt, golangci-lint,
govulncheck, doc-check, attribute-namespace-check, no-autoupdate, DCO,
AI-trailer) — all pass
- [x] 14 patterns-lib tests cover: positive full (write + pressure +
Lustre), positive partial (no pressure), negative
stall-without-checkpoint (pattern-#7 discriminator), negative
completed-checkpoint suppressor, BarrierTolerance guard, StallThreshold
guard, cross-pod isolation, out-of-window, post-stall window-tolerant
join, multi-rank cohort, deterministic order, 6-case storage-backend
inference (lustre/fsx/weka/nfs/s3/unknown), cross-node pressure ignored,
window-configurable
- [x] 4 wiring tests cover: full verdict + promoted scalars, partial (no
pressure), suppression with `emit_partial_verdicts=false`,
silent-on-stall-without-checkpoint
- [x] 10-falsifier schema drift battery + 1 conformance assertion
- [x] 3-row Validate guard battery for the new YAML knobs

## Files

- `module/pkg/patterns/checkpointer_hang.go` (+434)
- `module/pkg/patterns/checkpointer_hang_test.go` (+584)
- `module/pkg/patterns/testdata/checkpointer_hang_verdict.schema.json`
(new)
- `module/processor/patterndetectorprocessor/checkpointer_hang.go`
(+292)
- `module/processor/patterndetectorprocessor/checkpointer_hang_test.go`
(+289)
-
`module/processor/patterndetectorprocessor/{config.go,patterndetector.go,example_config.yaml}`
(knobs + wire-up + example)
- `docs/patterns/{11-checkpointer-hang.md,README.md}` (status flip)
- `docs/ATTRIBUTES.md` (+8 attributes registered)

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

- Ships **pattern #7 (dataloader hang)** end-to-end: detector library,
JSON-schema verdict shape, processor wiring, integration tests, and docs
flip from `☐ planned` to `☑ shipped`.
- Conservative two-discriminator rule per
`docs/patterns/07-dataloader-hang.md`: a stall alone never fires; the
verdict requires *either* a same-pod `dataloader.error_class` log
(`worker_killed`) *or* a same-node `FailedMount` / `VolumeMountFailure`
Kubernetes Event (`storage_event`) within the correlation window.
- Three false-positive guards land with the detector: step-0/1
warmup-prefetch, `phase=="eval"` validation pauses, and sub-threshold
cadence variance.

```release-notes
patterns: pattern #7 (dataloader hang) detector ships with worker_killed + storage_event discriminators, conservative warmup / eval-phase / sub-threshold guards, and a pinned JSON-schema verdict shape.
```

## Tests

```
cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... -run DataLoader
```

- **Library** (`module/pkg/patterns/dataloader_hang_test.go`, 14 cases):
canonical worker-killed + storage-event triggers, the three
false-positive guards (warmup, eval, sub-threshold), per-pod / per-node
join scopes, window semantics, discriminator priority when both layers
join, configurable threshold, deterministic ordering, schema conformance
+ 10 drift-rejection falsifiers.
- **Processor**
(`module/processor/patterndetectorprocessor/dataloader_hang_test.go`, 5
cases): worker-killed wiring with promoted scalars, storage-event
wiring, stall-alone-no-fire conservative-guard wiring, YAML threshold
knob, sub-second validation rejection.
- `make check` clean (gofumpt, golangci-lint 0 issues, go vet, go mod
verify, attribute-namespace-check 73/73 documented).

## Schema

`module/pkg/patterns/testdata/dataloader_hang_verdict.schema.json` —
pins:
- `pattern.id` const `"7"`
- `discriminator` enum `{worker_killed, storage_event}`
- `confidence` enum `{full, partial}` (v1 emits only `full`; `partial`
reserved for the future py-spy degraded path per spec Open Q#2)
- `tracecore.alert.dataloader_hang.stall_seconds` integer ≥ 0
- `evidence_trail.kind` enum `{training_step_stall, dataloader_error,
storage_event}` with `additionalProperties: false`

## Spec ambiguity choices (per PRINCIPLES.md: maximize operator UX,
minimize false positives)

- **Open Q#2 (py-spy degraded path)**: deferred. Spec notes the py-spy /
parca-agent receiver is undecided at v1 (RFC-0013 §7 slates pyspy for
delete; parca-agent contract not yet authored). v1 detector emits
`confidence=full` only when a discriminator joins; `confidence=partial`
is reserved in the schema enum so we can land the py-spy branch
additively. Trade-off: stalls without storage / worker evidence are
silent — operators triage off pattern #6 (stragglers) and pattern #11
(checkpointer) instead.
- **Open Q#1 (step-progress signal source)**: detector consumes a
`tracecore.alert.training_step_stalled.no_progress_seconds` bridge
attribute, mirroring the pattern-5
`tracecore.alert.pcie_rate_collapse.*` shape. OTTL recipe wire-up
follows the metrics→logs convention from RFC-0014; the recipe itself is
the integration-gap follow-up (out of scope for the detector PR).
- **Open Q#3 (FUSE error vocabulary)**: deferred to the OTTL recipe —
detector reads a single `dataloader.error_class` string; the recipe owns
per-driver regex stanzas.
- **Open Q#4 (`gen_ai.training.phase` upstream)**: detector reads it as
a tracecore-ext alpha attribute; documented in `docs/ATTRIBUTES.md`.
Switches to upstream-semconv when O4 lands it.
- **Discriminator priority**: when both a same-pod worker-killed log
*and* a same-node storage event join the stall, `worker_killed` wins —
pod-scoped is more precise (storage events on the node may be unrelated
to the failing pod's mount). Pinned by
`TestDataLoaderHangDetector_PrefersWorkerKilledOverStorage`.

## Closes

None — no GitHub issue exists for pattern #7. Closes the
spec-vs-detector gap from `docs/patterns/07-dataloader-hang.md` (status
flips from `☐ planned` to `☑ shipped`).

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
The patterndetector ships 11 detectors with 14 time-bounded knobs, but
the join shape varies across patterns and the rationale lived only
in code comments + PR review threads. Operators tuning windows had to
read source per detector.

Audit finding: five distinct shapes are load-bearing (chosen by the
causal physics of each signal), not bugs:

- One-sided lookback (#1 #3 #5 #6 #7 #10): cause precedes effect.
- Asymmetric two-sided (#11): pre-stall covers concurrent-start
  checkpoints; post-stall covers OTTL-bridge logger latency.
- Symmetric two-sided (#9 CNI-event leg): cohort-ready ±window
  could be cause OR consequence.
- Job-window bounded (#13): SDC counter rise must fall in the
  bounded eval-cycle's owning job; no operator knob is meaningful.
- Trailing-window rate / freshness (#2 #4 #8): rolling window
  anchored at `now` or the most-recent record.

Decision: document the existing reality, do not converge. Forcing
every detector to the asymmetric two-knob form would silently
zero one leg for the one-sided detectors (footgun on clock skew)
and would not apply to #13 at all.

Adds:
- 'Why this correlation shape' section in docs/patterns/07, 11, 13
  (the three shapes the issue called out by name).
- 'Correlation-window semantics' table in docs/patterns/README.md
  covering ALL 11 detectors with the predicate, anchor, and shape
  rationale, plus cross-links to the per-pattern sections.

No code changes; no detector behavior changes.

Closes #367.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Three v1 detectors use three different correlation-window shapes; the
rationale lived only in code comments and review threads. Operators
tuning windows hit these without warning.

- pattern #7 (dataloader_hang): one-sided look-back — strict
cause→symptom
- pattern #11 (checkpointer_hang): asymmetric (-30s, +60s) — log/record
race
- pattern #13 (silent_data_corruption): symmetric job window —
order-free attribution

## Decision

Document the existing reality. Each shape matches a distinct physical
event-ordering. A unified asymmetric two-knob form (issue suggestion)
would silently zero one leg for #7 and not apply to #13 — operators
would tune knobs with no physical meaning for their pattern.

## Changes

- "Why this correlation shape" subsection in each of
`docs/patterns/07-dataloader-hang.md`, `11-checkpointer-hang.md`,
`13-silent-data-corruption.md`
- "Correlation-window semantics" comparison table in
`docs/patterns/README.md` cross-linking the three rationales
- No code changes; no detector behavior changes

## Test plan

- [x] `golangci-lint`, `go vet`, `attribute-namespace-check` (pre-push
hook) — green
- [x] Each pattern doc cross-links to the comparison table
- [x] Each comparison-table row links back to the per-pattern rationale
section

Closes #367.

```release-notes
docs: document correlation-window semantics for patterns #7, #11, #13 — three shapes (one-sided, asymmetric, symmetric) match distinct physical event-orderings; no behavior change.
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant