Skip to content

[docs] follow-up curation: validate Next-up, promote to Issues, strike shipped#142

Merged
trilamsr merged 1 commit into
mainfrom
worktree-followups-curation
May 20, 2026
Merged

[docs] follow-up curation: validate Next-up, promote to Issues, strike shipped#142
trilamsr merged 1 commit into
mainfrom
worktree-followups-curation

Conversation

@trilamsr

@trilamsr trilamsr commented May 20, 2026

Copy link
Copy Markdown
Contributor

Summary

Curation pass on docs/followups/opportunistic.md "Next up" rows. Builds on #132 (the shard split, now merged) and lands alongside #133 (RUNBOOK + chart-appversion audit, also merged).

  • Validated every Next-up row against the current code tree. Per the pr-workflow.md "audit stale rows before implementing" lesson: any row naming a specific file/symbol gets a grep/ls check before pickup.
  • Item 5 (DCGM adopt lifecycle.Lifecycle): SHIPPED. components/receivers/dcgm/receiver.go L67-70 + L81 already use lifecycle.Lifecycle. Row replaced with a "Closed" marker citing the validation date.
  • Item 8 (resolveIncErrorCall CallExpr): anchor corrected. Cited as dcgm/runbook_kinds_test.go; correct path is kernelevents/runbook_kinds_test.go. Also: trigger fired — selftelemetry.Kind("...") call sites now exist (capturing_test.go, classify_internal_test.go).
  • Items 2, 7, 8, 10, 11, 12, 13: promoted to GitHub Issues (#135, #136, #137, #138, #139, #140, #141) with enhancement + help wanted labels. Each opportunistic row gains a *Tracked:* #NN back-link.
  • Item 4 (tracecore.dev/schemas migration): no Issue. Trigger is external hosting, not contributor pickup. Stays in the shard.

Resource-bucket re-audit

Scanned every milestone shard for follow-ups gated on GPU hardware or production data; zero net moves. Other shards' GPU/production mentions are trigger conditions ("operator reports X") or describe what the receiver observes, not resource gates on the follow-up itself. M14 explicitly self-notes this. Documented as "Coverage scan (2026-05-20)" notes in _needs-prod-data.md and _needs-gpu.md so the next curator knows the audit ran.

Files changed

  • docs/followups/opportunistic.md — 7 Tracked-links, Item 5 strike, Item 8 anchor + trigger update.
  • docs/followups/_needs-prod-data.md — coverage-scan note.
  • docs/followups/_needs-gpu.md — coverage-scan note.

Test plan

  • bash scripts/doc-check.sh green locally (436 markdown links resolve, em-dash gate clean, 7 baseline unverified markers non-growing).
  • All 7 created Issues exist on GitHub (linked above).
  • Item 5 ship-verification re-confirmed: grep -n lifecycle.Lifecycle components/receivers/dcgm/receiver.go shows the field at L81.
  • CI green on this branch.

Rebase note

Branch was originally based on the pre-merge worktree-followups-split branch (#132). After #132 squash-merged + #133 landed, a straight rebase tried to replay #132's already-merged commits and conflicted with their squashed form on docs/notes/. Resolution: reset branch to origin/main and cherry-pick only the self-contained curation commit (fdebda8897acc8). Lesson: branch follow-on work off the same base the upstream PR targets (main), not off the feature branch.

🤖 Generated with Claude Code

…k Item 5 shipped

Validation pass on docs/followups/opportunistic.md "Next up" items
against the current code tree (per the pr-workflow.md "audit stale
rows before implementing" lesson):

- Item 5 (DCGM adopt lifecycle.Lifecycle): SHIPPED. receiver.go
  L67-70 + L81 confirm the helper is in use. Replaced the row with
  a "Closed" marker citing the validation date.
- Item 8 (resolveIncErrorCall CallExpr): anchor corrected
  (kernelevents/runbook_kinds_test.go, not dcgm/). Trigger fired:
  selftelemetry.Kind("...") call sites now exist in
  capturing_test.go and classify_internal_test.go.
- Items 2, 7, 8, 10, 11, 12, 13: still active. Promoted to GitHub
  Issues (#135-#141) with enhancement + help wanted labels. Each
  row in opportunistic.md gains a *Tracked:* link.
- Item 4 (tracecore.dev/schemas migration): still active but
  blocked on external hosting, not contributor pickup. No Issue.

Resource-bucket re-audit across every milestone shard: zero net
moves. Other shards' GPU/production mentions are trigger conditions
or describe what the receiver observes, not resource gates on the
follow-up itself. Documented as "Coverage scan" notes in
_needs-prod-data.md and _needs-gpu.md so a future curator knows
the audit ran without leaving the buckets empty-looking.

This builds on PR #132 (which split the monolithic FOLLOWUPS.md
into per-milestone shards); merging order does not strictly matter
since the changes are append-style under the union-merge driver.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@trilamsr trilamsr force-pushed the worktree-followups-curation branch from fdebda8 to 897acc8 Compare May 20, 2026 22:56
@trilamsr trilamsr enabled auto-merge (squash) May 20, 2026 23:04
@trilamsr trilamsr merged commit 723babf into main May 20, 2026
9 checks passed
@trilamsr trilamsr deleted the worktree-followups-curation branch May 20, 2026 23:07
trilamsr added a commit that referenced this pull request May 21, 2026
…ards (#143)

## Summary

Cross-shard anchor + trigger audit of `docs/followups/*.md`. Targets the
same staleness class that the opportunistic.md curation pass (#142)
addressed, applied to every other milestone shard.

**Method:** for each of 192 rows across 14 milestone/component shards,
extracted backticked paths/symbols/Make-targets and verified each anchor
against the current tree. For rows with explicit `*Trigger:*`
conditions, evaluated whether the condition has fired.

## Findings applied

- **M13 §pyspy `Operator RUNBOOK.md expansion`: SHIPPED.** PR #133
created `components/receivers/pyspy/RUNBOOK.md` with per-kind triage for
all 12 RFC-0009 IncError kinds. Row marked `[x]` with ship reference;
original ask text struck through.
- **M3 §`chart-appversion` drift gate: PARTIALLY SHIPPED.** PR #133's
`scripts/chart-appversion-check.sh` compares `Chart.yaml.appVersion`
against `internal/version/version.go::Version` (in-tree drift). Original
row asked for drift against the actual binary release tag (`gh release
view`). In-tree half is now covered; binary-tag half still gated on M21
publishing a real tag. Row kept open with explicit partial-ship line.

## Findings not applied (with reasoning)

Total: 72 candidate rows surfaced by the sweep, partitioned 29 + 9 + 34.
Each bucket's rejection reasoning below.

- **29 candidate "all anchors exist" rows**: anchors resolved because
they're common files (`receiver.go`, `config.go`), not because the
specific feature shipped. Examples: `docs/followups/M14.md` L70
concurrent ingest race-detector test (counters exist, test does not);
`docs/followups/otlphttp.md` L25 `MaxBodyBytes` config (zero grep hits
for the feature); `docs/followups/M8.md` L35 `tracecore validate
--show-defaults` (subcommand absent from `cmd/tracecore/`).
- **9 trigger candidates** (5 + 4):
- **5 rows triggered by "M11 NVML lands"**: M11 in `MILESTONES.md` is
the NCCL_fr receiver, not NVML. NVML is a future trigger.
- **4 rows triggered by "cgo client lands"**: `pkg/dcgm/client_cgo.go`
is an explicit placeholder ("PLACEHOLDER" caps in the doc comment;
exposes `cgo-placeholder` variant string in `tracecore receivers list`).
- **34 "stale path" candidates**: most resolved to common Go identifiers
(`process.pid`, `plog.LogRecord`) or files with the same name in
multiple packages — false positives from path-stripped greps; the rows'
true anchors are correct in context.

## Root cause

The two findings applied came from PR #133 shipping work that closed two
specific follow-up rows but didn't update the matching shards. The
`MILESTONES.md § "Keeping this document current"` rule (updated in #132)
now mandates this, but #133 was authored against the older single-file
convention. Going forward, the rule covers `docs/followups/*.md`
directly.

## Files changed

- `docs/followups/M13.md` — strike-through `Operator RUNBOOK.md
expansion` row.
- `docs/followups/M3.md` — partial-ship note on chart-appversion drift
gate.

## Test plan

- [x] `bash scripts/doc-check.sh` green locally (436 markdown links
resolve, em-dash + en-dash diff gate clean, comment-noise diff gate
clean).
- [x] Strike target verified on `main`:
`components/receivers/pyspy/RUNBOOK.md` (blob `2e189a2`) exists and
covers all 12 IncError kinds.
- [x] Partial-ship target verified on `main`:
`scripts/chart-appversion-check.sh` (blob `9d8e2a5`) exists and is wired
into `make doc-check` (Makefile L227-231).
- [x] CI green on this branch (8/8 checks passing as of body-edit time:
verify, verify-test, verify-lint, verify-static, build, pr-lint, CodeQL
Analyze, CodeQL).

## Sequencing

Builds on PR #132 (shard split, merged) and PR #133 (RUNBOOK +
chart-appversion gate, merged). Independent of #142 (opportunistic.md
curation, currently auto-merging) since they touch different shards.
Merging order does not matter.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 21, 2026
#147)

## Summary

Single-PR bundle of 10 low-risk follow-up actions. Each row was
anchor-verified on `main` before editing; no production behavior change.
Diff is 10 files, +91/-37, dominated by markdown.

**Breakdown:**
- 3 strikes (anchor shipped, row was stale)
- 1 test-only struct add (k8sevents `NodeWatchErrors`)
- 1 bash test add (`no-autoupdate-check` hit-line format lock)
- 5 doc-only clarifications / partial-ship / audit notes

## Items applied

### Strikes — anchor on `main` confirms shipped

1. **M3.md L188 `make doc-check`.** `scripts/doc-check.sh` header reads
"verify every Test\*/Fuzz\*/Benchmark\* name referenced in docs"; wired
into `make doc-check` and `make ci`.
2. **M8.md L103 `docs/HARDWARE-TESTING.md` libdcgm + nv-hostengine
setup.** File exists (28 hits for libdcgm/dcgm/nv-hostengine); covers
Ubuntu 22.04 driver / `libdcgm-dev` / `nv-hostengine` provisioning,
x86_64 + aarch64-SBSA build matrix, and the `//go:build dcgm,hardware`
distinction. Doc shipped ahead of cgo client to unblock GPU-less
contributors.
3. **M19.md L18 `nodeWatchErrCount` not in SnapshotCounters.** Closed by
item 6 below — added `NodeWatchErrors` field symmetrically.

### Test-only struct add

4. **components/receivers/k8sevents/export_test.go.** Added
`NodeWatchErrors int64` field on `CountersForTest`; `SnapshotCounters`
now reads `rr.nodeWatchErrCount.Load()` symmetrically with
`rr.watchErrCount.Load()`. Both call sites are keyed-init inside the
same file; no external positional callers to break (grep confirmed: only
2 hits, both in `export_test.go`).

### Bash test add (M23 grep-gate format lock)

5. **scripts/no-autoupdate-check_test.sh "hit-line-format-stable".** New
assertion that runs the gate against the hyphenated-go-update fixture,
captures stdout (existing tests discard it), and asserts at least one
line matches `^[^:]+:[0-9]+:`. Locks the parseable hit-line shape
*before* the first automation consumer (CI summary, dashboard, Slack
notifier) wires up — a cosmetic tweak to the gate's message body now
fails CI instead of silently breaking downstream parsers. M23.md row
struck.

### Doc-only clarifications

6. **M15.md L185 falsifying-check backfill.** Anchored the
"/var/lib/tracecore/ subdir governance" row's grep-falsifying-check to
RFC-0010 §Proposal — `docs/rfcs/0010-containerstdout-receiver-scope.md`
L177/L217/L274/L393/L407 already carry the convention ("M15 owns
`/var/lib/tracecore/container_stdout/`. Future siblings reserve their
own subdirectories."). Row marked `[x]`.

7. **M15.md L192 + RFC-0010 §Pod-attribution forward-pointer.** Appended
one-line cross-reference at RFC-0010 L158 → `docs/followups/M15.md`
"Cross-receiver rank-label reconciliation" so the deferred audit trail
is discoverable from the RFC. Row marked `[x]`.

8. **M8.md L30 `tracecore debug dump` partial-ship.**
`cmd/tracecore/debug.go::runDebugDump` already writes version + revision
+ branch + build date + Go runtime stats + registered components +
redacted config to `tracecore debug dump > diagnostic.txt`. Remaining
gap is "last N samples" — needs receiver-side ring buffer (M2
carry-forward). Row kept open with partial-ship line +
remaining-trigger.

9. **M3.md L153 SUPPLY-CHAIN-IDENTITY.md scope clarification.** Added
one sentence noting the consolidation is a copy-and-deduplicate pass
against existing `release.yml` comment blocks (cosign-sign-blob,
gh-attestation-sign), not net-new authoring — so the next reader sees
the actual scope of work, not a misleading "30-min write" estimate that
implies green-field.

10. **otlphttp.md L182 workflow paths audit + M14.md L88 test pointer.**
- **otlphttp**: inlined audit findings (2026-05-20). `chart.yml` and
`install-bench.yml` are substrate-aware (include `cmd/tracecore/**`,
`internal/**`); `kernelevents-integration.yml` and
`pyspy-integration.yml` cover only `components/receivers/<name>/**` +
`internal/runtime/lifecycle/**` — a `cmd/tracecore` factory wiring or
`internal/pipeline` contract change can land without re-running these
integration jobs. `chaos.yml` covers `tools/failure-inject/**` +
`internal/synthesis/**` only (indirect coupling, acceptable). Remaining:
6-line YAML edit per integration workflow.
- **M14**: added inline pointer from the multi-retry slow-write fixture
row to the existing single-retry baseline at
`components/receivers/kineto/shutdown_test.go::TestIngest_RetryOnTruncated`
so the future author has the test-shape anchor.

## Files changed

| File | LOC | Kind |
|---|---|---|
| `components/receivers/k8sevents/export_test.go` | +2 | test struct
field |
| `scripts/no-autoupdate-check_test.sh` | +20 | bash test add |
| `docs/rfcs/0010-containerstdout-receiver-scope.md` | +1/-1 | inline
cross-ref |
| `docs/followups/M3.md` | +9/-5 | strike + scope clarification |
| `docs/followups/M8.md` | +16/-5 | strike + partial-ship |
| `docs/followups/M14.md` | +1/-1 | test pointer |
| `docs/followups/M15.md` | +15/-8 | 2 strikes |
| `docs/followups/M19.md` | +5/-9 | strike (anchored to test add) |
| `docs/followups/M23.md` | +9/-7 | strike |
| `docs/followups/otlphttp.md` | +13/-1 | audit findings inline |

## Test plan

- [x] `go test ./components/receivers/k8sevents/...` green.
- [x] `bash scripts/no-autoupdate-check_test.sh` 10/10 assertions pass
(added "hit-line-format-stable" — the 10th).
- [x] `bash scripts/doc-check.sh` green (437 markdown links resolve,
em-dash + en-dash diff gate clean, comment-noise diff gate clean).
- [x] Pre-commit hook ran full `make check` + `make ci` (all package
tests cached/passing).
- [ ] CI green on this branch.

## Release notes

```release-notes
NONE
```

## Sequencing

Builds on `main` after PRs #132 (shard split), #133 (RUNBOOK +
chart-appversion), #142 (opportunistic curation), #134 (chaos.yml row),
#143 (cross-shard audit). Independent of currently-open PRs #144 (m6
integration recipes) and #145 (m3 GHCR image publish).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr pushed a commit that referenced this pull request May 21, 2026
Sync feature branch with main per the merge-not-rebase policy
documented in CONTRIBUTING.md (commit ddf86f7).

Main moved 5 PRs ahead during this branch's lifetime:
- PR #143 (followups sweep)
- PR #134 (chaos.yml pattern-pod-evicted)
- PR #142 (follow-up curation)
- PR #144 (M6 integration recipes)
- PR #146 (kineto MaxEvents stub)
- PR #147 (followups bundle)

Conflicts expected in CHANGELOG.md and docs/followups/M3.md
(both additive).

# Conflicts:
#	CHANGELOG.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant