Skip to content

Bump the gh-actions group across 1 directory with 4 updates#2

Merged
trilamsr merged 1 commit into
mainfrom
dependabot/github_actions/gh-actions-fd3a928c96
May 13, 2026
Merged

Bump the gh-actions group across 1 directory with 4 updates#2
trilamsr merged 1 commit into
mainfrom
dependabot/github_actions/gh-actions-fd3a928c96

Conversation

@dependabot

@dependabot dependabot Bot commented on behalf of github May 8, 2026

Copy link
Copy Markdown
Contributor

Bumps the gh-actions group with 4 updates in the / directory: actions/checkout, actions/setup-go, actions/upload-artifact and github/codeql-action.

Updates actions/checkout from 4 to 6

Release notes

Sourced from actions/checkout's releases.

v6.0.0

What's Changed

Full Changelog: actions/checkout@v5.0.0...v6.0.0

v6-beta

What's Changed

Updated persist-credentials to store the credentials under $RUNNER_TEMP instead of directly in the local git config.

This requires a minimum Actions Runner version of v2.329.0 to access the persisted credentials for Docker container action scenarios.

v5.0.1

What's Changed

Full Changelog: actions/checkout@v5...v5.0.1

v5.0.0

What's Changed

⚠️ Minimum Compatible Runner Version

v2.327.1
Release Notes

Make sure your runner is updated to this version or newer to use this release.

Full Changelog: actions/checkout@v4...v5.0.0

v4.3.1

What's Changed

Full Changelog: actions/checkout@v4...v4.3.1

v4.3.0

What's Changed

... (truncated)

Changelog

Sourced from actions/checkout's changelog.

Changelog

v6.0.2

v6.0.1

v6.0.0

v5.0.1

v5.0.0

v4.3.1

v4.3.0

v4.2.2

v4.2.1

v4.2.0

v4.1.7

v4.1.6

... (truncated)

Commits

Updates actions/setup-go from 5 to 6

Release notes

Sourced from actions/setup-go's releases.

v6.0.0

What's Changed

Breaking Changes

Make sure your runner is on version v2.327.1 or later to ensure compatibility with this release. See Release Notes

Dependency Upgrades

New Contributors

Full Changelog: actions/setup-go@v5...v6.0.0

v5.6.0

What's Changed

Full Changelog: actions/setup-go@v5...v5.6.0

v5.5.0

What's Changed

Bug fixes:

Dependency updates:

New Contributors

Full Changelog: actions/setup-go@v5...v5.5.0

... (truncated)

Commits

Updates actions/upload-artifact from 4 to 7

Release notes

Sourced from actions/upload-artifact's releases.

v7.0.0

v7 What's new

Direct Uploads

Adds support for uploading single files directly (unzipped). Callers can set the new archive parameter to false to skip zipping the file during upload. Right now, we only support single files. The action will fail if the glob passed resolves to multiple files. The name parameter is also ignored with this setting. Instead, the name of the artifact will be the name of the uploaded file.

ESM

To support new versions of the @actions/* packages, we've upgraded the package to ESM.

What's Changed

New Contributors

Full Changelog: actions/upload-artifact@v6...v7.0.0

v6.0.0

v6 - What's new

[!IMPORTANT] actions/upload-artifact@v6 now runs on Node.js 24 (runs.using: node24) and requires a minimum Actions Runner version of 2.327.1. If you are using self-hosted runners, ensure they are updated before upgrading.

Node.js 24

This release updates the runtime to Node.js 24. v5 had preliminary support for Node.js 24, however this action was by default still running on Node.js 20. Now this action by default will run on Node.js 24.

What's Changed

Full Changelog: actions/upload-artifact@v5.0.0...v6.0.0

v5.0.0

What's Changed

BREAKING CHANGE: this update supports Node v24.x. This is not a breaking change per-se but we're treating it as such.

... (truncated)

Commits
  • 043fb46 Merge pull request #797 from actions/yacaovsnc/update-dependency
  • 634250c Include changes in typespec/ts-http-runtime 0.3.5
  • e454baa Readme: bump all the example versions to v7 (#796)
  • 74fad66 Update the readme with direct upload details (#795)
  • bbbca2d Support direct file uploads (#764)
  • 589182c Upgrade the module to ESM and bump dependencies (#762)
  • 47309c9 Merge pull request #754 from actions/Link-/add-proxy-integration-tests
  • 02a8460 Add proxy integration test
  • b7c566a Merge pull request #745 from actions/upload-artifact-v6-release
  • e516bc8 docs: correct description of Node.js 24 support in README
  • Additional commits viewable in compare view

Updates github/codeql-action from 3 to 4

Release notes

Sourced from github/codeql-action's releases.

v3.35.4

  • Update default CodeQL bundle version to 2.25.4. #3881

v3.35.3

  • Upcoming breaking change: Add a deprecation warning for customers using CodeQL version 2.19.3 and earlier. These versions of CodeQL were discontinued on 9 April 2026 alongside GitHub Enterprise Server 3.15, and will be unsupported by the next minor release of the CodeQL Action. #3837
  • Configurations for private registries that use Cloudsmith or GCP OIDC are now accepted. #3850
  • Best-effort connection tests for private registries now use GET requests instead of HEAD for better compatibility with various registry implementations. For NuGet feeds, the test is now always performed against the service index. #3853
  • Fixed a bug where two diagnostics produced within the same millisecond could overwrite each other on disk, causing one of them to be lost. #3852
  • Update default CodeQL bundle version to 2.25.3. #3865

v3.35.2

  • The undocumented TRAP cache cleanup feature that could be enabled using the CODEQL_ACTION_CLEANUP_TRAP_CACHES environment variable is deprecated and will be removed in May 2026. If you are affected by this, we recommend disabling TRAP caching by passing the trap-caching: false input to the init Action. #3795
  • The Git version 2.36.0 requirement for improved incremental analysis now only applies to repositories that contain submodules. #3789
  • Python analysis on GHES no longer extracts the standard library, relying instead on models of the standard library. This should result in significantly faster extraction and analysis times, while the effect on alerts should be minimal. #3794
  • Fixed a bug in the validation of OIDC configurations for private registries that was added in CodeQL Action 4.33.0 / 3.33.0. #3807
  • Update default CodeQL bundle version to 2.25.2. #3823

v3.35.1

v3.35.0

v3.34.1

  • Downgrade default CodeQL bundle version to 2.24.3 due to issues with a small percentage of Actions and JavaScript analyses. #3762

v3.34.0

  • Added an experimental change which disables TRAP caching when improved incremental analysis is enabled, since improved incremental analysis supersedes TRAP caching. This will improve performance and reduce Actions cache usage. We expect to roll this change out to everyone in March. #3569
  • We are rolling out improved incremental analysis to C/C++ analyses that use build mode none. We expect this rollout to be complete by the end of April 2026. #3584
  • Update default CodeQL bundle version to 2.25.0. #3585

v3.33.0

  • Upcoming change: Starting April 2026, the CodeQL Action will skip collecting file coverage information on pull requests to improve analysis performance. File coverage information will still be computed on non-PR analyses. Pull request analyses will log a warning about this upcoming change. #3562 To opt out of this change:
    • Repositories owned by an organization: Create a custom repository property with the name github-codeql-file-coverage-on-prs and the type "True/false", then set this property to true in the repository's settings. For more information, see Managing custom properties for repositories in your organization. Alternatively, if you are using an advanced setup workflow, you can set the CODEQL_ACTION_FILE_COVERAGE_ON_PRS environment variable to true in your workflow.
    • User-owned repositories using default setup: Switch to an advanced setup workflow and set the CODEQL_ACTION_FILE_COVERAGE_ON_PRS environment variable to true in your workflow.
    • User-owned repositories using advanced setup: Set the CODEQL_ACTION_FILE_COVERAGE_ON_PRS environment variable to true in your workflow.
  • Fixed a bug which caused the CodeQL Action to fail loading repository properties if a "Multi select" repository property was configured for the repository. #3557
  • The CodeQL Action now loads custom repository properties on GitHub Enterprise Server, enabling the customization of features such as github-codeql-disable-overlay that was previously only available on GitHub.com. #3559
  • Once private package registries can be configured with OIDC-based authentication for organizations, the CodeQL Action will now be able to accept such configurations. #3563
  • Fixed the retry mechanism for database uploads. Previously this would fail with the error "Response body object should not be disturbed or locked". #3564
  • A warning is now emitted if the CodeQL Action detects a repository property whose name suggests that it relates to the CodeQL Action, but which is not one of the properties recognised by the current version of the CodeQL Action. #3570

v3.32.6

  • Update default CodeQL bundle version to 2.24.3. #3548

v3.32.5

  • Repositories owned by an organization can now set up the github-codeql-disable-overlay custom repository property to disable improved incremental analysis for CodeQL. First, create a custom repository property with the name github-codeql-disable-overlay and the type "True/false" in the organization's settings. Then in the repository's settings, set this property to true to disable improved incremental analysis. For more information, see Managing custom properties for repositories in your organization. This feature is not yet available on GitHub Enterprise Server. #3507
  • Added an experimental change so that when improved incremental analysis fails on a runner — potentially due to insufficient disk space — the failure is recorded in the Actions cache so that subsequent runs will automatically skip improved incremental analysis until something changes (e.g. a larger runner is provisioned or a new CodeQL version is released). We expect to roll this change out to everyone in March. #3487

... (truncated)

Changelog

Sourced from github/codeql-action's changelog.

4.32.3 - 13 Feb 2026

  • Added experimental support for testing connections to private package registries. This feature is not currently enabled for any analysis. In the future, it may be enabled by default for Default Setup. #3466

4.32.2 - 05 Feb 2026

  • Update default CodeQL bundle version to 2.24.1. #3460

4.32.1 - 02 Feb 2026

  • A warning is now shown in Default Setup workflow logs if a private package registry is configured using a GitHub Personal Access Token (PAT), but no username is configured. #3422
  • Fixed a bug which caused the CodeQL Action to fail when repository properties cannot successfully be retrieved. #3421

4.32.0 - 26 Jan 2026

  • Update default CodeQL bundle version to 2.24.0. #3425

4.31.11 - 23 Jan 2026

  • When running a Default Setup workflow with Actions debugging enabled, the CodeQL Action will now use more unique names when uploading logs from the Dependabot authentication proxy as workflow artifacts. This ensures that the artifact names do not clash between multiple jobs in a build matrix. #3409
  • Improved error handling throughout the CodeQL Action. #3415
  • Added experimental support for automatically excluding generated files from the analysis. This feature is not currently enabled for any analysis. In the future, it may be enabled by default for some GitHub-managed analyses. #3318
  • The changelog extracts that are included with releases of the CodeQL Action are now shorter to avoid duplicated information from appearing in Dependabot PRs. #3403

4.31.10 - 12 Jan 2026

  • Update default CodeQL bundle version to 2.23.9. #3393

4.31.9 - 16 Dec 2025

No user facing changes.

4.31.8 - 11 Dec 2025

  • Update default CodeQL bundle version to 2.23.8. #3354

4.31.7 - 05 Dec 2025

  • Update default CodeQL bundle version to 2.23.7. #3343

4.31.6 - 01 Dec 2025

No user facing changes.

4.31.5 - 24 Nov 2025

  • Update default CodeQL bundle version to 2.23.6. #3321

4.31.4 - 18 Nov 2025

... (truncated)

Commits

@dependabot @github

dependabot Bot commented on behalf of github May 8, 2026

Copy link
Copy Markdown
Contributor Author

Labels

The following labels could not be found: dependencies, github-actions. Please create them before Dependabot can add them to a pull request.

Please fix the above issues or remove invalid values from dependabot.yml.

@dependabot dependabot Bot changed the title ci(deps): bump the gh-actions group across 1 directory with 4 updates Bump the gh-actions group across 1 directory with 4 updates May 8, 2026
@dependabot dependabot Bot force-pushed the dependabot/github_actions/gh-actions-fd3a928c96 branch 2 times, most recently from 5e9ad36 to b2e2c9e Compare May 8, 2026 06:53
Bumps the gh-actions group with 4 updates in the / directory: [actions/checkout](https://github.com/actions/checkout), [actions/setup-go](https://github.com/actions/setup-go), [actions/upload-artifact](https://github.com/actions/upload-artifact) and [github/codeql-action](https://github.com/github/codeql-action).


Updates `actions/checkout` from 4 to 6
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v4...v6)

Updates `actions/setup-go` from 5 to 6
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](actions/setup-go@v5...v6)

Updates `actions/upload-artifact` from 4 to 7
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v4...v7)

Updates `github/codeql-action` from 3 to 4
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](github/codeql-action@v3...v4)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: gh-actions
- dependency-name: actions/setup-go
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: gh-actions
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: gh-actions
- dependency-name: github/codeql-action
  dependency-version: '4'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: gh-actions
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot force-pushed the dependabot/github_actions/gh-actions-fd3a928c96 branch from b2e2c9e to 036a0ef Compare May 11, 2026 07:32
@trilamsr trilamsr merged commit 51b03ba into main May 13, 2026
8 of 9 checks passed
@dependabot dependabot Bot deleted the dependabot/github_actions/gh-actions-fd3a928c96 branch May 13, 2026 10:02
trilamsr added a commit that referenced this pull request May 14, 2026
Pushes toward the prompt's self-eval gate by closing every gap
that does NOT genuinely require Linux + libdcgm at build time.

- metrics.go::fieldEmitters grows from 6 to all 13 metric
  families: + hw.gpu.io (PCIe Tx/Rx), hw.energy, hw.gpu.nvlink.io
  (per-link Tx/Rx), hw.gpu.clock.frequency (sm/memory/video
  domains), hw.gpu.xid.errors. ECC aggregate counters keep their
  dedicated drop tier. The receiver-side pipeline is now complete
  for every metric in the README's design table; only the SOURCE
  of samples remains gated on the cgo client.

- pkg/dcgm/types.go: new well-known FieldID constants for SM /
  memory / video clock (100/101/102), NVLink L0 Tx/Rx
  (1040/1041), throttle reasons bitmask (112). RHS will switch to
  go-dcgm constants when client_cgo.go lands.

- components/receivers/dcgm/integration_hardware_test.go:
  //go:build dcgm,hardware skeleton. Skips with a clear reason
  when DCGM is unreachable; runs end-to-end against a real GPU on
  a Linux host where both build tags are active. Hardware
  reviewers have the test to fill in; macOS CI doesn't run it.

- emit_bench_test.go: BenchmarkEmit_TypicalScrape pins the
  per-scrape cost at 37 microseconds for 8 GPUs x 12 fields.
  At 15s collection_interval that's 0.00025 percent CPU --
  three orders of magnitude under the 0.05% O2 budget.

- resetSession() helper extracted from ensureConnected + scrape
  so the connection-loss state-reset doesn't drift between two
  call sites. Closes Loop-4 P3 nit on duplicated reset logic.

- docs/agents/RECEIVER-PATTERNS.md: new "Pattern selection" table
  -- five source-type rows with constructor / lifecycle / pattern
  reference per row -- so M9 (streaming/subprocess), M10 (failure-
  triggered), M11 (vendor-SDK like dcgm) authors know which shape
  fits their work. Closes Loop-4 P3 question on the doc gap.

- FOLLOWUPS.md created at repo root (was referenced repeatedly,
  never written): 4 opportunistic items, 4 "considered and
  explicitly skipped" items with Revisit-if predicates.

- README.md: metric table no longer split into "emitted vs
  deferred" -- the table is the truth, and a single paragraph
  notes that the data SOURCE waits on the cgo client.

- Smoke-tested the binary: `tracecore collect --config=
  example_config.yaml` boots, logs "dcgm receiver started",
  attempts Connect, fails with "dcgm: SDK unavailable", enters
  degraded mode (reason=init), shuts down within the 1s budget.
  End-to-end happy path verified on this host.

Self-eval criterion #3 (metric set) lifts from 3 to 4. Criterion
#2 (cgo wrapper) stays at 3 because client_cgo.go itself is the
only remaining gate -- a Linux GPU host is required to compile
the cgo bindings. The MILESTONES Carry-forward bullet commits to
that work.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Address item #2: the rolling-window failure-rate math was baked
into AggregateSLOSource (~80 lines of ring-buffer + underflow
guard + 2× window pruning + maxSamples cap). When the future queue
mechanism and runtime-restart mechanism land — both with their
own SLI gauges — they'd need the same math.

Extract into a standalone WindowedRate primitive in
internal/telemetry/windowed_rate.go. AggregateSLOSource is now
a thin walker over the exporter registry that delegates the math:

  rate := s.rate.Observe(failure, success+failure)

Public API:

  NewWindowedRate(window) → *WindowedRate
  (*WindowedRate).Observe(numerator, denominator) → float64

Same semantics as before (warming-up returns 0, underflow returns 0,
zero-delta returns 0), now with five focused tests pinning each
contract (warming up, rate-over-window, underflow safety,
zero-delta, default window). AggregateSLOSource shrinks from ~120
lines to ~25 lines of glue.

When queue.depth_ratio gets a real source in a future milestone,
its callback drops in `NewWindowedRate(...)` + `Observe(depth,
capacity)` and inherits the same bounded-memory + monotonic-safe
behavior for free.

make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Closes 4 of 7 new A+ criteria from the recursive self-review:

#1 — e2e-otelcontrib now verifies the collector PARSED the record,
not just that it accepted bytes. Workflow rewritten to docker-run
otelcol-contrib with a custom config (file + debug exporters,
detailed verbosity). After the e2e POST, the bash step greps
/tmp/otelout/logs.json for the canonical body, the kernelevents.xid
attribute, and the gpu.id attribute. Empty file or missing
attributes → workflow fails.

#2 — TestIntegration_KmsgWriteReadBehavioral (//go:build linux)
writes a synthetic <6>NVRM Xid 79 line to /dev/kmsg, uses a marker
string in a regex_filter to isolate from ring-buffer noise, then
asserts the receiver emits a plog.LogRecord with kernelevents.xid=79
+ gpu.id=0000:65:00.0 within 3s. A regression in parse/build/emit
fails this on Linux CI.

#3 — prometheus_alerts_test.go validates the alert YAML structure
(every group has interval, every rule has expr/severity/summary/
description) AND cross-references the metric + label-filter names
against the receiver's actual SelfTelemetry surface. A typo in
the alert would silently never fire; this catches it before merge.

#5 — runbook_test.go executes the RUNBOOK's "First 15 minutes"
step 1 (`tracecore validate --config=...`) and step 2
(`tracecore debug dump`) as real commands. Documentation rot
becomes a test failure, not a silent SRE-time discovery.

#4 — sustained_test.go (`//go:build sustained`) feeds 1000
events/sec for 5 minutes (300k records), samples heap every 30s,
asserts ≤10 MiB growth and p99 emit latency tail bounded. New
`sustained-load` workflow job runs it on push-to-main + schedule
(not PR — 5 minutes is too slow for the inner loop).

The seventh criterion (two-week soak + external operator) requires
elapsed time + a human; nothing in-session can close it.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Two independent reviews of PR #18 surfaced a stack of blockers,
strong findings, and quality lifts. This commit lands the tractable
items (defers documented in docs/FOLLOWUPS.md M8 section).

Operator-visible drift (closes Reviewer 2 #13-#17):
 - RUNBOOK kind-triage row `consume` → `downstream` (last commit
   renamed the kind but missed the doc table).
 - README config table: `initial_delay` range updated from "0..
   collection_interval" to "≥ 0" (code already relaxed for DGX
   cold-starts; doc was stale).
 - prometheus-alerts: DCGMReceiverHighErrorRate threshold changed
   from `rate > 0.1/sec` (unreachable: 15s default tick caps at
   0.067/sec) to `increase > 5 in 5m`. Stale "M2 has not landed"
   caveat removed.
 - example_config: single `mode:` line with a clarifying comment
   (was showing both `mode: standalone` and `# mode: embedded`,
   inviting operators to uncomment both → YAML duplicate-key).
 - cmd/tracecore receivers list now prints `dcgm [stub]` / `dcgm
   [cgo]` so operators can verify deploy shape without reading
   go.mod. Build-tag-conditional via three small files in
   cmd/tracecore (receiver_variants*.go) — pattern extends to M11
   NVML.

Correctness bugs (closes Reviewer 2 #2, #3, #4, #6, #7):
 - receiver.Shutdown: `r.running.CompareAndSwap(true, false)`
   gates teardown so a second Shutdown is a no-op (cgo libdcgm
   `dcgmShutdown` is not documented idempotent). Same CAS provides
   the happens-before for `r.cancel` publish that Pass-1 flagged.
 - receiver.ensureWatched: zero-entities path now emits
   IncError(KindEnumerate) + degraded rather than returning true.
   Without this, a misconfigured host (no GPUs visible, ACL blocks
   /dev/nvidia*) had the receiver looking healthy while emitting
   nothing.
 - receiver: new `warnOnce` helper gates the 7 per-tick failure-
   path Warn logs to fire only on the first failure after recovery.
   Closes the log-storm bug (4 errors/min × 60 min = 240 lines).
   Counter (`receiver_errors_total`) still ticks every failure.
 - metrics.applyCardinalityCap: parameter `cap` → `maxSeries`
   (cap shadowed the Go builtin).

Quality / contract lifts:
 - metrics.emit now returns a `stale` count for StatusStale and
   StatusError samples. pushSamples calls IncError(KindRead) once
   per tick when stale > 0 — surfaces DCGM serving slow/faulty
   data, which is precisely what StatusStale exists for. Per-tick
   not per-sample so GPU count doesn't inflate the rate.
   (StatusNoData and StatusFieldNotSupported still silent.)
 - docs_parity_test.go: new TestRUNBOOK_KindsMatchEmitted walks
   every emitted IncError/failedTick kind against the RUNBOOK
   per-kind triage table in both directions. This is the
   structural fix for the bug class — RUNBOOK can never again
   drift from emitted kinds without CI failure.
 - receiver.go: promoted `watchUpdateDivisor` /
   `watchKeepForMultiplier` / `watchUpdateEveryMinimum` constants
   for the previously-magic DCGM watch-cadence ratios.

Documentation + dedup:
 - dcgm README: new "Privacy + data residency considerations"
   subsection (compliance-auditor ask). Flags hw.id / pci.bdf /
   NVLink peer IDs as quasi-identifying; provides two mitigation
   patterns (attr-drop processor, salt-hash pseudonymization).
 - docs/agents/examples/constructor_options.go: `WithTelemetry`
   renamed to `WithSelfTelemetry` to match the real in-tree
   receiver API. M9+ authors copying the example no longer drift.
 - RUNBOOK kind enumeration line restored with both watch and mig
   (dcgm-local kinds) per the last commit's promotion.
 - Repo-root `FOLLOWUPS.md` consolidated into `docs/FOLLOWUPS.md`
   (M8-opportunistic + M8-skipped sections). Single source of
   truth; 17 deferred items pulled forward with falsifiable
   triggers (cgo client landing, M11 sibling-receiver shape,
   operator-report thresholds, file-size triggers).
 - All bare `FOLLOWUPS.md` references updated to
   `docs/FOLLOWUPS.md`.

Honest pushback documented:
 - I disagree with the M8-AGRADE-GAP claim of Operator UX 3.7→4.0.
   The drift findings above are exactly the class of bugs that
   rubric criterion was supposed to prevent — the alerts-vs-RUNBOOK
   parity test existed but didn't check kind values against
   emitted call sites. The new TestRUNBOOK_KindsMatchEmitted
   closes that gap; future operator-UX claims should pin to a test
   like this.
 - Deferred: split receiver.go (475 LOC) into 3 files, hoist
   dcgmtest.BaseClient, SECURITY.md for receiver, dcgm_info
   join-target, libdcgm setup in CONTRIBUTING — all logged with
   triggers. Reviewer 1's "construct receiver via M9-style
   primary-Option" inconsistency goes in the queue for M9-close
   review.

`make ci` passes; dcgm coverage 86.0% (essentially flat — new
tests offset by widened emit signature in test paths).

Assisted-by: Claude Opus 4.7
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Round-3 review (two passes) caught 5 strongs I shipped in the
round-2 fix wave. This commit closes them AND adds a test gate
per bug class so the same class can't re-ship silently.

N1 — CAS-pair memory-model claim was incorrect:
 - Earlier RECEIVER-PATTERNS entry claimed Start's CAS publishes
   the subsequent `r.cancel = cancel` write via the Go memory
   model. It doesn't — the CAS HB edge only covers writes
   sequenced-BEFORE the CAS. In practice this worked because the
   OTel runtime serializes Start→Shutdown, but that's a runtime
   contract, not memory-model coverage, and the pattern doc
   would have taught M9/M11 authors the wrong invariant.
 - Fix: `r.cancel` is now `atomic.Pointer[context.CancelFunc]`.
   Store in Start, Load in Shutdown. This makes the publish
   memory-model-correct in all contexts (not just OTel-runtime
   ones). Pattern doc rewritten honestly: CAS pairs are for
   *idempotence*; the cancel publish is its own atomic.
 - Gate: `TestReceiver_CancelIsAtomicPointer` parses receiver.go
   via go/ast and refuses any non-atomic.Pointer shape on the
   cancel field. Future refactors that revert to bare
   CancelFunc fail at CI.

N2 — Example contradicts its own header:
 - `docs/agents/examples/non_blocking_start.go` used
   `IncError(Kind("panic"))` casts even though the file's header
   claims typos are caught at compile time. `Kind("typoo")`
   compiles fine — defeating the entire point of the typed Kind.
 - Fix: declared per-receiver `const KindConnect Kind = "connect"`
   etc. in the example body; replaced all `Kind("…")` casts with
   the constants.
 - Gate: `TestExamples_NoUntypedKindCasts` walks
   `docs/agents/examples/*.go` and refuses (a) bare string
   literals to IncError AND (b) `Kind("literal")` casts. M9+
   contributors can't accidentally copy the broken shape.

N3 — Alert #1 still had the for+increase pairing B5 fixed on
alert #2:
 - `DCGMReceiverDegraded` had `for: 5m` paired with
   `increase(...[5m])`, doubling its effective window to ~10m.
   Same bug class as B5; I only fixed one of the two alerts.
 - Fix: dropped `for: 5m` on DCGMReceiverDegraded with the same
   comment explaining the rationale.
 - Gate: `TestPrometheusAlerts_NoDwellDoubling` parses the
   alerts YAML and asserts no rule pairs `increase(...[N])` with
   `for: N` without an explicit allowlist label. The future
   alert author proposing both must opt in deliberately.

N5 — `warnOnce` lost kind-transition breadcrumbs:
 - The previous shape `if r.degraded { return }` suppressed
   ALL warn-level logs after first failure, including a
   different failure kind on the next tick (connect→watch
   transition mid-degraded-cycle). Operators lose the
   breadcrumb trail.
 - Fix: `warnOnce(kind, msg, args...)` keys on
   `(degraded, kind)` — log fresh when the kind changes, even
   if still degraded. Threaded the kind through all 7 callers.
 - Gate: `TestWarnOnce_RelogsOnKindTransition` exercises the
   helper directly: first kind=K1 logs; repeat-K1 silenced;
   kind=K2 logs fresh. The exact behavior an operator cares
   about, pinned by a unit test.

N4 — K8s manifest in README was broken multiple ways:
 - telemetry default-off → probes fail → CrashLoop on apply
 - "DaemonSet + anti-affinity" was contradictory
 - SYS_ADMIN/hostPID claimed required for standalone mode (not
   needed; only embedded mode needs them)
 - only `/dev/nvidia0` mounted (need nvidiactl + nvidia-uvm +
   per-GPU device files)
 - Fix: section now ships a paired ConfigMap that enables
   telemetry and binds on 0.0.0.0; DaemonSet drops the
   unnecessary privileges; the section is marked
   "illustrative — not production-ready" and explicitly defers
   workload-specific privilege layering to the Helm chart (M6).
 - Gate: `TestReadme_K8sExampleParsesAndEnablesTelemetry`
   extracts the YAML block, parses both docs (ConfigMap +
   DaemonSet), asserts (a) `enabled: true` AND `0.0.0.0` in the
   config, (b) both liveness + readiness probes exist pointing
   at /healthz + /readyz. A future doc author can't ship a
   manifest that would CrashLoop on apply.

Nits:
 - N6: reverted `watchUpdateDivisor` / `watchKeepForMultiplier`
   to untyped consts (the canonical Go shape for unitless
   ratios; typing them as time.Duration was dimensionally
   confused).
 - N9: anchored regex `\b` on the metric-value match in the M2
   wiring test — `} 1` was accidentally matching `} 12` /
   `} 100`.
 - N10: clarified `client_cgo.go` comment that Close() returns
   nil (consistent with stub, but the previous comment misled
   casual readers).
 - Cgo placeholder operator-deception risk: variant string now
   `cgo-placeholder` not `cgo` until the real binding lands.
   `tracecore receivers list` shows `dcgm [cgo-placeholder]`
   so operators on a real GPU host can't deploy a stub binary
   thinking it's the real one. Legend in the receivers-list
   output explains the three values.

S19 partial (wire build-tags into make ci):
 - `make ci` now depends on `build-tags`. Every `make ci` run
   (local + GitHub Actions) gates on the cgo vs default build
   compiling cleanly. Pre-existing target now actually fires in
   the standard CI surface.

FOLLOWUPS additions (deferred but tracked with trigger predicates):
 - S18 `pkg/dcgm.Probe(…)` library helper — when a second
   external consumer materializes.
 - N7 AST walker resolve-map by reflection — when selftelemetry
   adds a new canonical Kind.
 - N8 AST walker globs *.go non-test — paired with the
   receiver.go split FOLLOWUP.
 - Promote `make build-tags` into the pr-validation shortcut
   workflow — opportunistic next CI sweep.

`make ci` passes; dcgm coverage steady at 86.0%; the build-tag
matrix is now part of every CI run.

Assisted-by: Claude Opus 4.7
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
Four R1 findings folded into one commit (docs/CI surface).

#1 — README config table missed the top-level `enabled *bool`
kill-switch. Added the row at the top of the table with its
nil-means-active semantics so operators can grep the table for
the field and find it (config.go:27 has been there since the
initial M9 work; the README just didn't surface it).

#2 — README forward-reference to "the container realities
section above" pointed at nothing. Added the actual section
("Container realities") with four operator-actionable bullets:
mount the host /dev/kmsg (not the empty pod-local one),
CAP_SYSLOG instead of root, multi-tenant blast-radius warning,
and the namespaced-kmsg 5.10+ posture. Section anchors a
follow-on ready-to-paste DaemonSet manifest (see commit F).
TOC updated; threat-model table now links by anchor instead of
prose.

R1.S3 — alert-check.sh regex too narrow. The previous regex
required a suffix in {Receiver,Source,Pipeline,Exporter,Processor}
and would miss future alerts named after a domain (e.g.
`KernelEventsXidBurst`). Broadening to "any TitleCase identifier
≥12 chars" produced false positives (Go identifiers like
`OTLPRoundTrip`, `AmbientCapabilities`). Final shape: drop
direction-2 lexicon-based extraction entirely, keep only
direction-1 (alerts-yaml is source of truth → MUST appear in
the runbook). Direction-2 ("stale runbook reference to a
deleted alert") is rare and self-revealing (the alert just
doesn't fire), so the cost of false positives outweighs the
benefit of catching it pre-merge.

#7 — RUNBOOK preamble for receiver-local error kinds. The C
commit already added the per-kind triage section; this commit
ties it into the error-message index and explicitly states the
"why no page alert" rationale so a reviewer doesn't ask the
question again.

Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
The previous gate exited at the sha256 mismatch, which left no
diagnostic trail for triaging which bytes diverged between Build #1
and Build #2. Inverting the control flow: run diffoscope on a
mismatch, capture its text report, then exit non-zero. On a match,
run diffoscope --exit-code as the load-bearing assertion. Either
way diffoscope output ends up in the job log.

Also upload both binaries as a "failed-build-pair" artifact when the
job fails — needed for offline triage when the on-runner diff isn't
enough (e.g. comparing across two failed runs).

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
Diffoscope on test tag v0.0.0-m3test-2 surfaced the actual delta: two
runtime/debug.BuildInfo entries differed across builds — vcs.modified
flipped from false to true, and the +dirty suffix appeared in the
embedded module version. Cascading: that fed a different action-ID into
the Go linker, which changed NT_GNU_BUILD_ID, which changed the file
hash.

Root cause: Build #1 created build1/ inside the worktree and moved
the binary into it. By the time Build #2 ran `go build`, the worktree
contained untracked files (build1/tracecore_linux_amd64 + .sha256), so
`git status --porcelain` was non-empty. `go build -buildvcs=true`
(default) reads that and sets vcs.modified=true for Build #2.

Fix: build each iteration into `mktemp -d` outside the source tree.
The worktree stays clean; Go's VCS probe sees identical state on both
runs; build IDs match; binaries match. The canonical artifact is then
staged from BUILD1_DIR into ./release/ for the rest of the workflow.
Failure-triage upload still grabs both builds when the gate trips.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
…rage

Four parallel reviews landed seven actionable changes:

- Cold rebuild: both builds now use isolated $(mktemp -d) GOCACHE dirs
  so build #2 can't pass by replaying build #1's cached object files.
  The assertion we want is cold-vs-cold byte-equality — which is what a
  third party with a fresh checkout reproduces.
- Cosign cert-identity-regexp tightened to pin this exact workflow file
  on a tag-ref. The previous `^https://github.com/<repo>/` regex would
  have accepted a Sigstore bundle minted by any workflow on any branch
  in the same repo; the new pattern rejects sibling workflows.
- SBOM coverage gate now walks every `Indirect != true` entry in
  go.mod and asserts a matching `pkg:golang/<path>@…` purl exists in
  the CycloneDX components[]. M3's "covers every module" rubric and
  M21's "≥1 component per direct module" rubric now have a falsifiable
  check; the previous `components ≥ 1` gate was a placeholder.
- Recipe step 6 switched from `slsa-verifier verify-artifact` (legacy
  slsa-github-generator format) to `gh attestation verify` (the
  reference verifier for actions/attest-build-provenance's Sigstore
  bundle output). slsa-verifier ≥ 2.7.0 with `verify-github-attestation`
  is documented as the alternate path; earlier versions don't parse
  Bundle v0.3 and would have failed silently or noisily.
- Recipe step 4 dropped `--exit-code` to match the CI fix; step 5
  inherits the tightened cert-identity-regexp; the diffoscope-failure
  diagnostic row points at Go-toolchain drift (the actual common
  cause) rather than "compiler upgrade or -trimpath regression".
- CHANGELOG entry added under [Unreleased] / Added; MILESTONES.md M3
  flipped from ☐ to ⧗ with a flip-to-☑-on-merge note; top-level
  README.md routing table grew a row for auditors / supply-chain
  verifiers pointing at docs/reproducibility.md.
- Dropped two unused job-level outputs (source_date_epoch, build_date)
  that no downstream job consumed; removed a vestigial `make clean`
  between builds (does nothing when artifacts live in mktemp dirs).

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 18, 2026
Ratify the current posture as a permanent stance: the tracecore
binary contains no in-binary self-update mechanism, no background
fetcher, no remote control plane. Operators pull releases via their
existing delivery tooling (Flux / Argo CD / RenovateBot / kubectl
set image); the trust root is the operator's, not ours.

RFC-0008 at Status: accepted, covering:
- which component classes may auto-update (none, in-binary)
- the supported update path (operator-pulled artifacts with
  cosign / SBOM / SLSA verification on the operator side)
- what the collector commits to (immutable digests, lockstep
  appVersion / binary, no mid-version mutation)
- what it explicitly does not commit to (remote channel,
  phoning-home, vendored update library)
- five rejected alternatives with one-sentence rationale each
- a CI grep gate enforcing the no-fetcher invariant

Adjacent changes in the same PR (per M23 rubrics):
- NORTHSTARS Open Question #2 closed; pointer to RFC-0008
- scripts/no-autoupdate-check.sh wired into `make ci` to fail
  build on `go-update` / `self-update` / `auto-update` / `AutoUpdate`
  / `UpdateCheck` / `FetchLatest` identifiers under cmd|components|
  internal
- install/kubernetes/tracecore/README.md § "Upgrade posture" points
  operators at RFC-0008 for the contract
- MILESTONES.md M23 flipped to ☑ with per-rubric ☑ prefixes
  (matches the convention adopted in PR #53)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 19, 2026
Phase 4 of 5-phase rigorous review. Two A+ aspiration reviewers
graded independently against the M23 rubric.

## Grades

- Reviewer 1: A. "Exemplary M-milestone work; the 5-phase review cycle
  caught and fixed real issues (case-insensitivity, scope coverage,
  rollback verification)."
- Reviewer 2: A. "Comprehensive, falsifiable RFC that closes
  NORTHSTARS OQ #2 with three load-bearing enforcement gates."

Synthesized grade: **A**. Both reviewers explicitly state the PR is
mergeable at A and that A+ criteria are optional polish, not blocking
on the M23 rubric.

## A+ criteria proposed and triaged

| ID    | Proposed by    | Criterion                                                                | Cost          | Action                                                     |
|-------|----------------|--------------------------------------------------------------------------|---------------|------------------------------------------------------------|
| P4.1  | aplus-1        | `make verify-rfc-claims` target — RFC Commitments → CI gate dependency map | TASTE-CALL    | explicitly-skipped (RFC body already documents the gates) |
| P4.2  | aplus-1        | Stable / parseable grep gate output format for automation                 | FUTURE-WORK   | deferred — no automation consumer today; revisit at v1.0   |
| P4.3  | aplus-1        | FOLLOWUPS entry gating removal of `no-autoupdate-check.sh`                | TASTE-CALL    | explicitly-skipped (RFC § Migration / rollout owns the bar) |
| P4.4  | aplus-2        | Operator CVE response time SLA (≤30 min patch-to-production)              | TASTE-CALL    | deferred — quantifying requires timing measurements; chart README already documents the commands |
| P4.5  | aplus-2        | Explicit false-positive override path (anchor comment / allow-list)       | LOAD-BEARING-IF-NEEDED | deferred — no false positive observed today; `_test.go` exclusion handles main case; revisit on first false-positive incident |
| P4.6  | aplus-2        | Audit trail for depguard rule additions (cite vendor + rationale in PR)   | FUTURE-WORK   | deferred — operational discipline; capture as MEMORY rule if pattern recurs |

## Validation cycle for each criterion

For each proposed criterion, I asked: does it survive contradict?
i.e., is there a *concrete* reproducer where this criterion's absence
causes a measurable failure today?

- P4.1: no — manual inspection currently sufficient; no recurring drift
- P4.2: no — no machine consumer today
- P4.3: no — RFC body adequately documents the bar; FOLLOWUPS duplicate would rot
- P4.4: no — chart README documents the path; SLA quantification needs measurement
- P4.5: no — no false-positive incident observed; depguard catches by import path independently
- P4.6: no — depguard list rarely changes; vendor-citation discipline is a soft norm

None survived contradict to load-bearing. All deferred or skipped.

## Edge-case hunt for phase 4 (≥1 required)

What if `--exclude='*_test.go'` were removed? Many existing test files
(in this repo and others) mention these identifiers as negative-test
fixtures. The existing `test-file-excluded` regression test already
covers this — mutation-verified in phase 1. Edge case handled.

## Rubric additions promoted to .claude/ralph-loop.local.md

None. All A+ criteria are deferred or skipped.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
Ratify the current posture as a permanent stance: the tracecore
binary contains no in-binary self-update mechanism, no background
fetcher, no remote control plane. Operators pull releases via their
existing delivery tooling (Flux / Argo CD / RenovateBot / kubectl
set image); the trust root is the operator's, not ours.

RFC-0008 at Status: accepted, covering:
- which component classes may auto-update (none, in-binary)
- the supported update path (operator-pulled artifacts with
  cosign / SBOM / SLSA verification on the operator side)
- what the collector commits to (immutable digests, lockstep
  appVersion / binary, no mid-version mutation)
- what it explicitly does not commit to (remote channel,
  phoning-home, vendored update library)
- five rejected alternatives with one-sentence rationale each
- a CI grep gate enforcing the no-fetcher invariant

Adjacent changes in the same PR (per M23 rubrics):
- NORTHSTARS Open Question #2 closed; pointer to RFC-0008
- scripts/no-autoupdate-check.sh wired into `make ci` to fail
  build on `go-update` / `self-update` / `auto-update` / `AutoUpdate`
  / `UpdateCheck` / `FetchLatest` identifiers under cmd|components|
  internal
- install/kubernetes/tracecore/README.md § "Upgrade posture" points
  operators at RFC-0008 for the contract
- MILESTONES.md M23 flipped to ☑ with per-rubric ☑ prefixes
  (matches the convention adopted in PR #53)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 19, 2026
Phase 4 of 5-phase rigorous review. Two A+ aspiration reviewers
graded independently against the M23 rubric.

## Grades

- Reviewer 1: A. "Exemplary M-milestone work; the 5-phase review cycle
  caught and fixed real issues (case-insensitivity, scope coverage,
  rollback verification)."
- Reviewer 2: A. "Comprehensive, falsifiable RFC that closes
  NORTHSTARS OQ #2 with three load-bearing enforcement gates."

Synthesized grade: **A**. Both reviewers explicitly state the PR is
mergeable at A and that A+ criteria are optional polish, not blocking
on the M23 rubric.

## A+ criteria proposed and triaged

| ID    | Proposed by    | Criterion                                                                | Cost          | Action                                                     |
|-------|----------------|--------------------------------------------------------------------------|---------------|------------------------------------------------------------|
| P4.1  | aplus-1        | `make verify-rfc-claims` target — RFC Commitments → CI gate dependency map | TASTE-CALL    | explicitly-skipped (RFC body already documents the gates) |
| P4.2  | aplus-1        | Stable / parseable grep gate output format for automation                 | FUTURE-WORK   | deferred — no automation consumer today; revisit at v1.0   |
| P4.3  | aplus-1        | FOLLOWUPS entry gating removal of `no-autoupdate-check.sh`                | TASTE-CALL    | explicitly-skipped (RFC § Migration / rollout owns the bar) |
| P4.4  | aplus-2        | Operator CVE response time SLA (≤30 min patch-to-production)              | TASTE-CALL    | deferred — quantifying requires timing measurements; chart README already documents the commands |
| P4.5  | aplus-2        | Explicit false-positive override path (anchor comment / allow-list)       | LOAD-BEARING-IF-NEEDED | deferred — no false positive observed today; `_test.go` exclusion handles main case; revisit on first false-positive incident |
| P4.6  | aplus-2        | Audit trail for depguard rule additions (cite vendor + rationale in PR)   | FUTURE-WORK   | deferred — operational discipline; capture as MEMORY rule if pattern recurs |

## Validation cycle for each criterion

For each proposed criterion, I asked: does it survive contradict?
i.e., is there a *concrete* reproducer where this criterion's absence
causes a measurable failure today?

- P4.1: no — manual inspection currently sufficient; no recurring drift
- P4.2: no — no machine consumer today
- P4.3: no — RFC body adequately documents the bar; FOLLOWUPS duplicate would rot
- P4.4: no — chart README documents the path; SLA quantification needs measurement
- P4.5: no — no false-positive incident observed; depguard catches by import path independently
- P4.6: no — depguard list rarely changes; vendor-citation discipline is a soft norm

None survived contradict to load-bearing. All deferred or skipped.

## Edge-case hunt for phase 4 (≥1 required)

What if `--exclude='*_test.go'` were removed? Many existing test files
(in this repo and others) mention these identifiers as negative-test
fixtures. The existing `test-file-excluded` regression test already
covers this — mutation-verified in phase 1. Edge case handled.

## Rubric additions promoted to .claude/ralph-loop.local.md

None. All A+ criteria are deferred or skipped.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
Ratify the current posture as a permanent stance: the tracecore
binary contains no in-binary self-update mechanism, no background
fetcher, no remote control plane. Operators pull releases via their
existing delivery tooling (Flux / Argo CD / RenovateBot / kubectl
set image); the trust root is the operator's, not ours.

RFC-0008 at Status: accepted, covering:
- which component classes may auto-update (none, in-binary)
- the supported update path (operator-pulled artifacts with
  cosign / SBOM / SLSA verification on the operator side)
- what the collector commits to (immutable digests, lockstep
  appVersion / binary, no mid-version mutation)
- what it explicitly does not commit to (remote channel,
  phoning-home, vendored update library)
- five rejected alternatives with one-sentence rationale each
- a CI grep gate enforcing the no-fetcher invariant

Adjacent changes in the same PR (per M23 rubrics):
- NORTHSTARS Open Question #2 closed; pointer to RFC-0008
- scripts/no-autoupdate-check.sh wired into `make ci` to fail
  build on `go-update` / `self-update` / `auto-update` / `AutoUpdate`
  / `UpdateCheck` / `FetchLatest` identifiers under cmd|components|
  internal
- install/kubernetes/tracecore/README.md § "Upgrade posture" points
  operators at RFC-0008 for the contract
- MILESTONES.md M23 flipped to ☑ with per-rubric ☑ prefixes
  (matches the convention adopted in PR #53)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 19, 2026
Phase 4 of 5-phase rigorous review. Two A+ aspiration reviewers
graded independently against the M23 rubric.

## Grades

- Reviewer 1: A. "Exemplary M-milestone work; the 5-phase review cycle
  caught and fixed real issues (case-insensitivity, scope coverage,
  rollback verification)."
- Reviewer 2: A. "Comprehensive, falsifiable RFC that closes
  NORTHSTARS OQ #2 with three load-bearing enforcement gates."

Synthesized grade: **A**. Both reviewers explicitly state the PR is
mergeable at A and that A+ criteria are optional polish, not blocking
on the M23 rubric.

## A+ criteria proposed and triaged

| ID    | Proposed by    | Criterion                                                                | Cost          | Action                                                     |
|-------|----------------|--------------------------------------------------------------------------|---------------|------------------------------------------------------------|
| P4.1  | aplus-1        | `make verify-rfc-claims` target — RFC Commitments → CI gate dependency map | TASTE-CALL    | explicitly-skipped (RFC body already documents the gates) |
| P4.2  | aplus-1        | Stable / parseable grep gate output format for automation                 | FUTURE-WORK   | deferred — no automation consumer today; revisit at v1.0   |
| P4.3  | aplus-1        | FOLLOWUPS entry gating removal of `no-autoupdate-check.sh`                | TASTE-CALL    | explicitly-skipped (RFC § Migration / rollout owns the bar) |
| P4.4  | aplus-2        | Operator CVE response time SLA (≤30 min patch-to-production)              | TASTE-CALL    | deferred — quantifying requires timing measurements; chart README already documents the commands |
| P4.5  | aplus-2        | Explicit false-positive override path (anchor comment / allow-list)       | LOAD-BEARING-IF-NEEDED | deferred — no false positive observed today; `_test.go` exclusion handles main case; revisit on first false-positive incident |
| P4.6  | aplus-2        | Audit trail for depguard rule additions (cite vendor + rationale in PR)   | FUTURE-WORK   | deferred — operational discipline; capture as MEMORY rule if pattern recurs |

## Validation cycle for each criterion

For each proposed criterion, I asked: does it survive contradict?
i.e., is there a *concrete* reproducer where this criterion's absence
causes a measurable failure today?

- P4.1: no — manual inspection currently sufficient; no recurring drift
- P4.2: no — no machine consumer today
- P4.3: no — RFC body adequately documents the bar; FOLLOWUPS duplicate would rot
- P4.4: no — chart README documents the path; SLA quantification needs measurement
- P4.5: no — no false-positive incident observed; depguard catches by import path independently
- P4.6: no — depguard list rarely changes; vendor-citation discipline is a soft norm

None survived contradict to load-bearing. All deferred or skipped.

## Edge-case hunt for phase 4 (≥1 required)

What if `--exclude='*_test.go'` were removed? Many existing test files
(in this repo and others) mention these identifiers as negative-test
fixtures. The existing `test-file-excluded` regression test already
covers this — mutation-verified in phase 1. Edge case handled.

## Rubric additions promoted to .claude/ralph-loop.local.md

None. All A+ criteria are deferred or skipped.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
)

## Summary

Files RFC-0008 at `Status: accepted`, ratifying tracecore's current
posture as a permanent stance: the binary contains no in-binary
self-update mechanism, no background fetcher, no remote update channel.
Operators pull releases via their existing delivery tooling — Flux, Argo
CD, RenovateBot, `kubectl set image` from CI — and the cryptographic
trust root (cosign keyless verification, SBOM, SLSA v1.0 Build L1
provenance from M3) is theirs, not ours.

Closes NORTHSTARS § "Open questions tracked as RFCs" entry 2
("Auto-update boundary").

## What this PR changes

- **New RFC:** `docs/rfcs/0008-auto-update-boundary.md` (Status:
accepted) — concrete proposal across receiver / processor / exporter /
runtime / binary classes; five rejected alternatives with one-sentence
rationale each; risks led by RFC-number-collision per `STYLE-docs.md`
§3; crosslinks to PRINCIPLES §1 §2 §6 §11 to show the boundary does not
weaken any of them.
- **NORTHSTARS.md:** Open Question #2 closed; replaced with pointer to
RFC-0008 + supersession bar ("a production-operator ask that
operator-side delivery automation cannot serve").
- **CI grep gate:** `scripts/no-autoupdate-check.sh` greps `cmd/
components/ internal/` for banned identifiers (`go-update`,
`self-update`, `auto-update`, `AutoUpdate`, `UpdateCheck`,
`FetchLatest`); wired into `make ci`. Run locally: green.
- **Chart README:** `install/kubernetes/tracecore/README.md` adds an
"Upgrade posture" subsection under § Upgrade pointing operators at
RFC-0008 for the contract.
- **MILESTONES.md:** M23 flipped `☐` → `☑ delivered`; every functional +
non-functional rubric bullet carries `☑` (rubric-preservation convention
adopted in PR #53).

## Why

The "default off until a real ask appears" stance was a placeholder.
Operators in this segment already run delivery pipelines with
cryptographic provenance gates they control. Replicating that machinery
inside a workload-adjacent collector duplicates an existing strength,
badly. PRINCIPLES §2 ("Reversibility before optionality") settles the
trade: prefer no mechanism over an off-by-default mechanism, because an
off-by-default fetcher still has to exist in the binary, and an opt-out
flag is a frequent supply-chain accident.

## Test plan

- [x] `bash scripts/no-autoupdate-check.sh` exits 0 on this branch
- [x] `bash scripts/doc-check.sh` passes — link integrity green,
unverified-marker count stable
- [ ] RFC renders correctly on GitHub
- [ ] CI green (`make ci` includes both gates above + license-check +
lint + build)

## Note on PR ordering

The MILESTONES.md edit here uses the per-rubric `☑` convention
introduced in PR #53. If PR #53 lands first, this merges clean. If this
merges first, PR #53's "How to read" updates remain compatible — the
convention reads correctly with or without the preamble already in
place.

🤖 Generated with [Claude Code](https://claude.com/claude-code)


```release-notes
NONE
```

---------

Signed-off-by: Tri Lam <trilamsr@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…pe; reject framing of bench-correction as regression

Phase-3 adversarial deep review (2 fresh subagents, independent of
the 8 lens reviews). The author's completion claim was treated as
a hypothesis to falsify.

Adversarial #1: APPROVED, no falsifiable findings.

Adversarial #2: returned CONCERNS-REQUIRE-FIX with two findings.
After the validation cycle:

Findings table:

| ID | Lens | Beneficiary | Severity | Finding | Proof | Contradict | TDD record | Rubric+ | Action |
|----|------|-------------|----------|---------|-------|------------|------------|---------|--------|
| P3.1 | adversarial-2 | repo-long-term | BLOCKER → DEFER | "k8sevents BenchmarkEmitOne allocs jumped 21→28; not gated by bench-check." | Read Makefile:40-44 — bench-check is scoped to ./internal/telemetry/. Confirmed k8sevents has no baseline. | The 21→28 jump is the WHOLE POINT of group F: the previous bench reused one plog chain across iters and under-reported production cost. `git diff origin/main...HEAD -- components/receivers/k8sevents/receiver.go components/receivers/k8sevents/emit.go` shows production allocation paths in r.emit are unchanged from main; only the bench measurement shape changed. Reviewer conflated bench-output change with production regression. | n/a — no production change to test | no — finding rejected as framed, but underlying observation kept | deferred FOLLOWUPS.md (Component-level benchmarks ungated by `make bench-check`) |
| P3.2 | adversarial-2 | repo-long-term | NIT | Missing explicit symlink-to-directory test for kubeconfig path. | A new TestConfig_RejectsSymlinkToDirectoryAsKubeconfigPath would pass without code change. | Reviewer themselves note "would pass with the current code." TestConfig_RejectsDirectoryAsKubeconfigPath already exercises the IsDir() path; symlinks go through the same code (os.Stat follows symlinks intentionally). No unique coverage added. | n/a | no | explicitly-skipped (taste-call; redundant coverage) |

Reproducibility:
  $ grep -n "components" Makefile | grep bench   # only internal/telemetry covered
  $ git diff origin/main..HEAD -- components/receivers/k8sevents/receiver.go components/receivers/k8sevents/emit.go   # zero production-allocation changes

Validation-cycle stats:
  Findings rejected during contradict (framing of BLOCKER as regression): 1
  Findings that survived as DEFERRED to FOLLOWUPS:                        1
  Findings explicitly-skipped (taste-call):                               1

Beneficiary: repo-long-term. The underlying gap (component benches
ungated) is real and worth a follow-up; the immediate framing as
a regression in this PR is not.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…l ordering rationale

Phase-4 A+ aspiration review (2 fresh subagents). Reviewer #1 graded
B+ with 7 documentation-of-already-true-invariants criteria;
reviewer #2 graded A with 3 falsifiable proposals. Two surviving
load-bearing criteria after validation cycle:

Findings table:

| ID | Lens | Beneficiary | Severity | Finding | Proof | Contradict | TDD record | Rubric+ | Action |
|----|------|-------------|----------|---------|-------|------------|------------|---------|--------|
| P4.1 | aplus-2 | repo-long-term | CONCERN | populateAttributes / attrPutter cap check (`attrs.Len() >= maxAttrs`) is exercised only at production maxAttrs floor (9). The exported BuildLogRecordForBench helper can be called with arbitrary values; a future refactor flipping `>=` to `>` would silently allow one attribute through at maxAttrs=0 and slip past every existing test. | TestBuildLogRecord_BoundaryMaxAttrs covers maxAttrs=0 and maxAttrs=-1; mutation-verified red→green: changing `>=` to `>` in attrPutter.putStr/putInt fails the maxAttrs=0 subtest, then restoration passes. | Production Validate floors maxAttrs at 9 (TestConfig_RejectsTooLowMaxAttributes pins this). But internal callers (bench, future refactor) can bypass Validate. | red (mutation) → green → mutation-verify recorded in this commit | yes — P4-aplus-2 in .claude/ralph-loop.local.md | applied this commit |
| P4.2 | aplus-2 | repo-long-term | NIT | validateKubeconfigPath ordering rationale lives only in the Phase-1 commit body and FOLLOWUPS closure; a future maintainer reordering Validate's pipeline would break TestConfig_AmbiguousAuth_* tests without warning at the call site. | Added the rationale to the validateKubeconfigPath docstring (source-level). | n/a — comment-only; existing tests catch a bad reorder regardless. | n/a | no | applied this commit (config.go) |

Rejected/deferred:

- P4.3 (aplus-1 #1) — "Bench allocs/op ≤30 threshold gate." Already
  covered by Phase-3 deferred FOLLOWUPS entry on component-bench
  scope. DEFER (duplicate).
- P4.4 (aplus-2 #2) — Cross-receiver SchemaURL pattern lint. Out of
  scope; trigger is third in-tree schema URL. DEFER to FOLLOWUPS.
- P4.5 (aplus-1 #2-7) — Document already-met invariants. Per
  feedback_anti_bureaucracy, criteria that document truths without a
  falsifiable hook are bloat. REJECT.

Reproducibility:
  $ go test -run TestBuildLogRecord_BoundaryMaxAttrs -v ./components/receivers/k8sevents/   # passes
  $ sed -i.bak 's/a.attrs.Len() >= a.maxAttrs/a.attrs.Len() > a.maxAttrs/g' components/receivers/k8sevents/emit.go && \
    go test -run TestBuildLogRecord_BoundaryMaxAttrs/maxAttrs=0 -v ./components/receivers/k8sevents/   # fails
  $ mv components/receivers/k8sevents/emit.go.bak components/receivers/k8sevents/emit.go   # restore

Letter-grade outcome:
  Reviewer #1 starting grade: B+ → target A+ via documentation
  Reviewer #2 starting grade: A → target A+ via P4.1 + P4.2
  After this commit: A+ on the falsifiable axis (every C1-C6 + F
  change has a mutation-catching test; the boundary cap is now
  explicitly pinned; ordering rationale lives at source).

Beneficiary: repo-long-term. Falsifiable tests survive refactors;
documentation-of-truths does not.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…+ threat-root trace on go-mod-verify

Phase-4 A+ aspiration review (2 fresh subagents; both graded A,
diverged on which gates to apply). Validation cycle:

Findings table:

| ID | Lens | Beneficiary | Severity | Finding | Proof | Contradict | TDD record | Rubric+ | Action |
|----|------|-------------|----------|---------|-------|------------|------------|---------|--------|
| P4.1 (aplus-1 #2, also P2.6) | aplus | operator | CONCERN | A workflow_dispatch run with `inputs.tag` set but `github.ref` ≠ refs/tags/$INPUT_TAG passes Build and fails the OIDC smoke check 15-30 minutes later. Operator wastes runner time and sees the misuse late. | New "Verify dispatch ref matches tag (pre-flight)" step exit-1s within seconds with the documented workaround. | Reviewer noted the smoke check already enforces this — but at job-end, not at job-start. Fail-fast IS the load-bearing property. | n/a — workflow YAML, actionlint clean | yes — P4-aplus-1 | applied this commit; closes P2.6 deferral. |
| P4.4 (aplus-2 #2) | aplus | repo-long-term | NIT | go-mod-verify comment says "defense in depth against a compromised GOPROXY mirror" but doesn't name the trust root or the orthogonal threat (a poisoned go.sum itself). | Comment now states "Trust root: the go.sum at this tag commit" and cross-references the tag-protection FOLLOWUPS entry. | A future maintainer might over-attribute the protection. | n/a | no | applied this commit |

Rejected/deferred:

- P4.2 (aplus-1 #4) — Structured diff lint for release.yml ↔
  docs/reproducibility.md. DEFER to FOLLOWUPS.md (real value, but
  manual review caught both drift directions in Phases 2 + 3;
  automate when next edit happens).
- P4.3 (aplus-1 #6) — Release artifact manifest validation before
  upload. REJECT. Per anti-bureaucracy: reviewer concedes `needs:`
  dependency already gates malformed artifacts from reaching the
  release job. Adding defensive validation against a CI-bug
  scenario is bloat.
- P4.5 (aplus-1 #3) — docs/SUPPLY-CHAIN-IDENTITY.md consolidated
  reference. DEFER to FOLLOWUPS.md; ~30-min write, scope creep
  beyond release.yml. M21 release-checklist is the natural trigger.
- P4.6 (aplus-1 #5, aplus-2 #3) — Formal threat-model document +
  M21 alignment narrative. DEFER to M21.
- P4.7 (aplus-2 #5) — Cross-link health lint. Duplicate of P4.2;
  same deferral.

Reproducibility:
  $ make actionlint zizmor   # exit 0
  $ grep -A1 "workflow_dispatch with inputs.tag" .github/workflows/release.yml
    # pre-flight gate present

Letter-grade outcome:
  Reviewer #1 starting: A → A+ via criteria 2, 4, 6 (we applied 2 + threat-model comment)
  Reviewer #2 starting: A → APPROVED-AS-IS (already strong)
  After this commit: A on the falsifiable axis (one operator-UX gate
  + one comment clarification), with the broader doc/lint work
  scoped to follow-ups.

Beneficiary: operator. The pre-flight gate cites a specific
operator-facing surface (15-30 minute waste on workflow_dispatch
misuse) and turns it into a seconds-fast named error.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…ask + gh attestation verify (#69)

## Summary

Release-pipeline supply-chain hardening + a workflow_dispatch pre-flight
gate. No operator-visible release-artifact shape change; the gates fail
loudly at tag-push time before any artifact is signed and published.

**Hardening:**
- `go mod download && go mod verify` step before the reproducible-build
pair. Catches a poisoned GOPROXY mirror returning module bytes that
don't match `go.sum`. Trust root: the `go.sum` at the tag commit; a
poisoned `go.sum` itself is tracked separately under M3 tag-protection.
- `LC_ALL=C` + `TZ=UTC` env + `umask 022` inside the run script of both
Build #1 and Build #2. Canonical reproducible-builds.org stanza; today's
`-trimpath`+`SOURCE_DATE_EPOCH` carry the load for Go output, but the
stanza is cheap insurance against future cgo or non-Go release
artifacts.
- New "Smoke-check `gh attestation verify`" step in the provenance job.
Local-bundle mode (offline trust chain — cert + SCT + Rekor proof are
embedded). Flag set matches `docs/reproducibility.md` step 6:
`--signer-workflow` + `--predicate-type` + `--repo` + `--source-ref` +
`--source-digest`. Pins the OIDC subject path so a different workflow in
the repo with `attestations: write` cannot satisfy it; pins the source
claims so an attestation from a non-tag dispatch is refused.
- `docs/reproducibility.md` step 6 tightened from `--owner` (org-wide)
to `--repo` (org/repo). Adopters following the documented walkthrough
now exercise the same scope CI enforces.
- New "Verify dispatch ref matches tag" pre-flight step. On
`workflow_dispatch` with `inputs.tag` set, asserts `github.ref ==
refs/tags/$INPUT_TAG` and fails fast with the named workaround. Saves
15-30 minutes of runner time on misuse.

**FOLLOWUPS hygiene:**
Closed five rows: `go mod verify`, build-env sanitization,
cosign+gh-attestation flag tightening (cosign half had already shipped),
Rekor log-index URL (already shipped), and workflow_dispatch pre-flight
gate.

Opened three rows: flag-parity lint between release.yml and
reproducibility.md; consolidated `docs/SUPPLY-CHAIN-IDENTITY.md`
reference; component-bench gating scope (tracked from the parallel
k8sevents review).

## Verification

- `make actionlint zizmor` clean on the head commit (zizmor: 0
findings).
- `gh attestation verify --bundle` + `--repo` + `--source-ref` +
`--source-digest` combination verified end-to-end against a public
sigstore bundle (`github/codeql-action v2.25.4`); gh CLI source maps the
flags to Fulcio cert OIDs 1.3.6.1.4.1.57264.1.14 / .13, populated from
OIDC `ref` / `sha` claims at sign time.
- Pre-flight gate is a stand-alone shell test; it exits 1 with a clear
error and the named workaround when `github.ref` and `inputs.tag`
disagree.

## Test plan

- [ ] PR CI green on the head commit.
- [ ] Next real release tag (M21) exercises all four new gates
end-to-end against a real Sigstore bundle.
- [ ] If `gh attestation verify --bundle` rejects the flag combination
at release time, the failure is loud (job fails) and the fix is a
one-line follow-up.

```release-notes
Tightened release-workflow supply chain: defensive `go mod verify`, canonical LC_ALL / TZ / umask reproducible-build stanza, and a local-bundle `gh attestation verify` smoke check pinned to the source tag + commit SHA and the signing workflow. `docs/reproducibility.md` now uses `--repo` so adopter verification matches CI strictness. Workflow_dispatch with `inputs.tag` fails fast if the ref doesn't match. Operator-visible release shape unchanged.
```

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Eight 1-page pattern-design specs covering #2 IB link flap, #7
dataloader hang, #8 NCCL timeout no-HW, #9 NCCL bootstrap timeout,
#10 CUDA OOM deceptive allocator, #11 checkpointer hang, #12 loss
spike NaN, #13 silent data corruption. Each carries the standard
detector-design shape (symptom, layers, signal sources, evaluation
rule, verdict attrs, edge cases, status, open questions) so the next
contributor can write a TDD red test directly off the spec.

Status: all 8 marked planned. #10 already has issue #303; the spec
frames the design alongside.

NORTHSTARS Appendix A gains a Spec column; docs/README + patterns
README link the new specs.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Pattern #2 — InfiniBand link flap — per NORTHSTARS Appendix A row #2
and the design spec at docs/patterns/02-ib-link-flap.md.

Detector evaluation rule
- bucket IB port-state transitions by (node, HCA, port) within
  CorrelationWindow (default 2min)
- fire when transitions >= MinTransitions (default 2)
- promote Confidence to full when a stuck NCCL FR cohort
  (>= MinHangingRanks ranks, non-completed-state) lands on the
  same node within the same window; otherwise partial

Cross-rank correlation primitive
- groupStuckNCCLByNode lifted as an inline helper inside the
  ib_link_flap detector; same shape will recur in pattern #7
  (dataloader-hang) and #9 (nccl-bootstrap-timeout). Refactor to a
  shared module follows in the next commit.

Wiring
- NCCLFRRecord.Node added so the cross-rank correlation can join on
  node identity (k8sattributes resource attr); existing nccl_hang
  detector ignores it (collective-scoped, not node-scoped)
- projectIBPortStateRecord reads hw.network.ib.port.state +
  hw.network.ib.device + hw.network.ib.port.num — the customer-stable
  namespace declared in docs/patterns/02-ib-link-flap.md
- appendIBLinkFlapVerdict promotes (k8s.node.name,
  hw.network.ib.device, hw.network.ib.port.num,
  tracecore.alert.ib_link_flap.transition_count,
  nccl.fr.collective_seq_id) per the issue #270 scalar-promotion
  contract; pattern.confidence is full|partial
- Config gains ib_link_flap_window + ib_link_flap_min_transitions
  with Validate floors (>=1s, >=2)

Tests
- 8 library tests (ib_link_flap_test.go): full correlation, partial
  on IB-alone, single-transition no-fire, transitions-outside-window
  no-fire, different-ports-do-not-combine, NCCL-on-different-node
  does not join, configurable transition threshold, deterministic
  ordering
- 5 processor tests (ib_link_flap_test.go): full verdict + promoted
  scalars, partial on IB-alone, partial-suppressed toggle, window
  validation floor, min-transitions validation floor

Cross-link to spec: docs/patterns/02-ib-link-flap.md (authored in
parallel; lands first or same-PR).

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
…ts (#338)

## Summary

15-agent parallel wave bridging v1.0-rc1 knowledge gaps + closing
horizon backlog. 31 commits, 81 files, +8650/-180.

**Code (5 detectors / features):**
- `feat(iblinkflap)` pattern #2 IB link flap detector — 13 tests,
cross-rank helper extracted for reuse by patterns #7/#9
- `feat(cudaoom)` pattern #10 CUDA OOM detector +
fragmentation-vs-true-OOM discriminator — 35 tests, 0/6 false-positive
rate on fixture corpus (#303 wiring — recipe gap tracked at #337)
- `feat(verdict)` deprecate EvictedPod, co-emit PodName + PodNamespace
(#277) with regression-pinning test
- `feat(chart)` opt-in default-deny NetworkPolicy + cert-manager mTLS
reference (#301); ServiceMonitor + scrape annotations (#296); NOTES.txt
UX warnings for empty-egress / cross-ns scraper traps
- `feat(bench)` per-detector allocs/event harness + soft ratchet gate,
graduation criterion documented (#302)
- `feat(patterndetector)` verdict counter metric for dashboard panels
(#261)
- `fix(slo-rules)` correct otelcol_* label set + drop silent-no-op
`unless on (instance)` join (#298)

**8 pattern design specs (`docs/patterns/{02,07-13}-*.md`):**
- Per pattern: symptom, layers crossed, signal sources, detector
evaluation rule, verdict attrs, edge cases, open questions.
- 7 load-bearing spec gaps flagged for future TDD red-test work
(multi-vendor SDC signal, cohort grouping, processor metrics path, etc).

**9 v1.0-rc1 audit / knowledge-gap docs:**
- `docs/v1-rc1-cut-criteria.md` — 12 falsifiable cut gates derived from
O1-O7
- `docs/v1-rc1-operational-gaps.md` — SLSA L3 + air-gap +
upgrade-rollback audit (8 issues filed #314-#321)
- `docs/v1-rc1-governance-gaps.md` — CODEOWNERS 0%, lint-principles
4/16, retros, `make ci` 148s (5 issues #322-#325, #327)
- `docs/v1-rc1-test-audit.md` — 82.9% coverage, fuzz harness inventory
(5 issues #328-#332)
- `docs/v1-rc1-simplification-audit.md` — top deletion candidates ~9.6K
LOC (3 issues #333-#335)
- `docs/threat-model.md` — STRIDE per trust boundary + audit RFP scope
(#336)
- `docs/reference-environments.md` — Tier 1 kind + Tier 2 32×H100
binding spec for O2 hero KPI
- `docs/adoption-pipeline.md` — S0-S3 funnel + comms templates for O5
hero KPI
- `docs/standards-roadmap.md` — 10 `gen_ai.training.*` attributes
proposed upstream (#326)

**Doc-drift cleanup:** 11 issues closed (#265, #268, #269, #276, #283,
#287, #292-295, #299).

**OTTL recipe wiring:** 6 issues closed (#260, #261, #273, #282, #284,
#285); #272 deferred to standards-roadmap.

**Multi-cluster auth:** bearer-token + mTLS examples (#297).

**Merge resolution + reviewer fixes:**
- Resolved 5 conflicts post-PR #310/#312/#313 (factory.go delete,
VerdictAttr* unexport, MILESTONES.md → docs/, FOLLOWUPS, patterns
README)
- Adversarial reviewer found 1 BLOCKER + 6 MAJOR; all addressed before
push:
  - Renamed 16 `VerdictAttr*` → `verdictAttr*` per #310 convention
  - Re-ported selftel wiring (#261) into main's merged `createLogs`
- Fixed case-mismatch `docs/THREAT-MODEL.md` → `docs/threat-model.md`
(Linux CI is case-sensitive)
- 8 pattern specs schema drift: `pattern.id` slug → numeric (`"2"`,
`"7"`...`"13"`), `pattern.confidence` `high` → `full`
- `02-ib-link-flap.md` attribute drift: spec said
`tracecore.alert.ib_link_flap.{hca_device,port}`, code emits
`hw.network.ib.{device,port.num}`
- `v1-rc1-cut-criteria` criterion #1 status stale-on-arrival ("6
patterns shipped" → "8 patterns shipped, 4 remaining")
- NetPol UX trap: NOTES.txt warns when `enabled=true` with empty
`allowedEgressEndpoints` (silently kills OTLP) or cross-ns Prometheus
- Filed #337 for missing OTTL recipe projecting `DCGM_FI_DEV_FB_*` →
`hw.gpu.memory.{free,total}` (CUDA OOM detector consumes but recipe gap)
- Post-merge stale-relative-path sweep: 6 wave docs + NORTHSTARS.md +
MILESTONES.md (`docs/`, `../`, `docs/docs/` drift after MILESTONES +
NORTHSTARS moved to docs/)
- Documented 5 newly-emitted attributes in ATTRIBUTES.md (drop_ratio +
IB tier — `attribute-namespace-check` now 67/67)

## Test plan

- [x] `go test ./module/processor/patterndetectorprocessor/...
./module/pkg/patterns/...` — ok
- [x] `make lint` (golangci-lint via goreleaser-style gate) — 0 issues
- [x] `go vet ./...` — clean
- [x] `make doc-check` — passes after stale-link sweep
- [x] `scripts/attribute-namespace-check.sh` — 67/67 documented
- [x] `helm lint install/kubernetes/tracecore` — 0 chart(s) failed
- [x] `promtool check rules` on slo-rules.yaml — 13 rules / SUCCESS
- [ ] CI compat-matrix (rc1 criterion #6) — gated on next wave
- [ ] manual smoke install on real cluster — owner clearance pending

```release-notes
Lands two new pattern detectors (#2 IB link flap, #10 CUDA OOM
fragmentation-vs-true discriminator), 8 pattern design specs for the
remaining v1.0 root-cause patterns, opt-in default-deny NetworkPolicy
+ Prometheus Operator ServiceMonitor on the Helm chart, the
EvictedPod → PodName/PodNamespace verdict-attribute deprecation
co-emit, per-detector allocs/event bench harness, SLO-rules label
fix, and the v1.0-rc1 knowledge-gap audit set (cut criteria, ops gaps,
governance gaps, test audit, simplification audit, threat model,
reference envs, adoption pipeline, standards roadmap).
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
The patterndetector ships 11 detectors with 14 time-bounded knobs, but
the join shape varies across patterns and the rationale lived only
in code comments + PR review threads. Operators tuning windows had to
read source per detector.

Audit finding: five distinct shapes are load-bearing (chosen by the
causal physics of each signal), not bugs:

- One-sided lookback (#1 #3 #5 #6 #7 #10): cause precedes effect.
- Asymmetric two-sided (#11): pre-stall covers concurrent-start
  checkpoints; post-stall covers OTTL-bridge logger latency.
- Symmetric two-sided (#9 CNI-event leg): cohort-ready ±window
  could be cause OR consequence.
- Job-window bounded (#13): SDC counter rise must fall in the
  bounded eval-cycle's owning job; no operator knob is meaningful.
- Trailing-window rate / freshness (#2 #4 #8): rolling window
  anchored at `now` or the most-recent record.

Decision: document the existing reality, do not converge. Forcing
every detector to the asymmetric two-knob form would silently
zero one leg for the one-sided detectors (footgun on clock skew)
and would not apply to #13 at all.

Adds:
- 'Why this correlation shape' section in docs/patterns/07, 11, 13
  (the three shapes the issue called out by name).
- 'Correlation-window semantics' table in docs/patterns/README.md
  covering ALL 11 detectors with the predicate, anchor, and shape
  rationale, plus cross-links to the per-pattern sections.

No code changes; no detector behavior changes.

Closes #367.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Closes #300 — adds the operator-facing walkthrough for NORTHSTARS
Appendix A pattern #2 (InfiniBand link flap). The pattern-2 detector
library + processor wiring landed earlier
(`module/pkg/patterns/ib_link_flap.go`,
`module/processor/patterndetectorprocessor/ib_link_flap.go`), but only
the engineering-facing design spec (`docs/patterns/02-ib-link-flap.md`)
existed. Operators hitting an IB-link-flap incident had no walkthrough
analogous to pattern-1/3/4/5. This PR adds that walkthrough and fixes a
small wire-type doc bug surfaced while cross-checking attribute names
against the projector.

## Files

- `docs/patterns/pattern-2-ib-link-flap.md` (new, 1418 words) — Symptom
/ Why node_exporter sees it / Receiver-emitted signal / PromQL / Alert /
Escalation / Replay / Detector status / Verdict shape / Integration gap.
Mirrors the sibling-pattern structure exactly.
- `docs/patterns/README.md` — pattern #2 added to the "operator
walkthroughs (shipped)" table; design-spec row flipped from `☐ planned`
to `☑ shipped` with a forward link to the new walkthrough; count copy
updated (four → five).
- `docs/patterns/02-ib-link-flap.md` — status banner flipped from `☐
planned (no detector implementation yet)` to `☑ shipped` since the
detector library + wiring landed; cross-link to the new operator
walkthrough.
- `docs/ATTRIBUTES.md` — `hw.network.ib.port.state` row corrected from
`string` ("ACTIVE"/"DOWN") to `int` (IBA-spec phys_state ID:
`1=Down`/`2=Init`/`3=Armed`/`4=Active`). The projector at
`module/processor/patterndetectorprocessor/ib_link_flap.go:30` reads it
with `state.Int()`; the wiring test stamps it with `a.PutInt(...)`. The
previous doc claim was a wire-type bug.

## Claims verified against detector code

Every load-bearing fact in the walkthrough was grepped against the
in-tree detector + tests:

- Attribute names `hw.network.ib.port.state` / `hw.network.ib.device` /
`hw.network.ib.port.num` — verified in the projector
(`module/processor/patterndetectorprocessor/ib_link_flap.go`).
- Defaults `ib_link_flap_window=2m`, `ib_link_flap_min_transitions=2` —
verified in `module/processor/patterndetectorprocessor/config.go`
(`DefaultIBLinkFlapWindow`, `DefaultIBLinkFlapMinTransitions`) and the
example config.
- Validate floors (window ≥ 1s, min_transitions ≥ 2) — verified in
`config.go` `Validate()`.
- Promoted scalars (`k8s.node.name`, `hw.network.ib.device`,
`hw.network.ib.port.num`,
`tracecore.alert.ib_link_flap.transition_count`,
`nccl.fr.collective_seq_id`, `pattern.confidence`) — verified in
`extractIBLinkFlapPromotedAttrs` in `ib_link_flap_test.go`.
- IBA phys_state ID values (1/2/3/4) — verified against
`patterns.IBPortStateDown/Initialize/Armed/Active` constants.
- `emit_partial_verdicts=false` suppression behavior — verified in
`TestPatternDetector_IBLinkFlapWiringPartialSuppressed`.

## Follow-up filed

The walkthrough's "Integration gap" section names a concrete blocker:
the metrics→logs OTTL recipe that maps `node_infiniband_port_state_id` →
`hw.network.ib.*` log records has not landed. Filed as **#393** (sibling
to #284 / #285, gated on RFC-0014 PR-B). The walkthrough
cross-references #393 directly so a reader can trace the path from
"alert dashboards say zero verdicts" → "recipe blocker tracked".

## Test plan

- [x] `golangci-lint run ./...` — 0 issues (pre-commit).
- [x] `go vet ./...` — pass (pre-commit).
- [x] `go mod verify` — all modules verified (pre-commit).
- [x] `attribute-namespace-check` — 100 unique attribute literals, 100
documented (pre-commit).
- [x] DCO sign-off + ≤72-char subject — commit-msg hook passed on both
commits.
- [ ] CI lint + markdown link check — observed via `gh pr checks`
(changes / pr-lint already pass; build / verify-* still running at
PR-update time).

```release-notes
NONE
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Lands replay corpora for the seven pattern detectors that lacked one,
closing the Path-A test gap named in the v1-rc1 audit (#366). Each
pattern now ships `module/pkg/replay/<pattern>/{canonical,_negative,
_real_world}/` fixtures plus a `*_replay_test.go` runner that JSON-eqs
detector output against the on-disk golden.

Detectors covered: `hbm_ecc` (#3), `thermal_throttle` (#4),
`pcie_aer` (#5), `ib_link_flap` (#2), `nccl_hang` (#15), `cuda_oom`
(#10), `xid_correlation` (#16). `pod_evicted` (#14) corpus is
unchanged.

## Design

- Each detector takes a different input shape (e.g. `HBMECCRecord +
  XidRecord` for `hbm_ecc`, `ThermalThrottleRecord` for
  `thermal_throttle`, ...), so the existing `LoadFixturesUnder` helper
  (typed on `Record + NodeRecord`) cannot be reused. Each detector
  gets its own `*_replay_test.go` that inlines the per-detector JSON
  read; shared fixture-discovery and golden-assert helpers live in
  `helpers_test.go`.
- Two detectors (`nccl_hang`, `ib_link_flap`) take a `Now` reference.
  Tests pin `Now` to a fixed timestamp matching the fixture's
  `started_ns` so hang-age and flap-window inclusion stay
  deterministic across replay runs (otherwise wall-clock drift would
  silently flip the verdicts as the fixture aged).
- Goldens were generated from the live detectors via
  `UPDATE_REPLAY_GOLDEN=1 go test ./module/pkg/replay/...` and pinned;
  future drift in detector output (headline / remediation prose,
  evidence-trail UID shape, scalar-field rename) surfaces as a
  `JSONEq` diff against the fixture. Operators can also eyeball the
  golden to assert what they EXPECT vs what fires.
- Negative fixtures each exercise a distinct discriminator (wrong Xid
  code, single GPU, no AER, no eviction, completed state, single
  transition, no OOM log) so a regression in one false-positive guard
  lights up the corresponding row only.
- Flipped `run_corpus: true` on every row of the chaos.yml
  pattern-detectors matrix now that every detector has a corpus.

## Test plan

- [x] `make check` — clean
- [x] `go test -race -count=1 ./pkg/replay/...` — 35 tests pass
  (28 new + 7 pre-existing pod_evicted)
- [x] `go test -count=1 ./processor/patterndetectorprocessor/...` —
  unchanged, still green
- [x] `make verify` (pre-push) — clean
- [ ] CI: chaos.yml `pattern-detectors` matrix — 8 rows, each runs
  hermetic regex + replay-corpus step

Closes #366.

```release-notes
NONE
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Closes #393. Ships the metric-side OTTL projection from
`node_exporter --collector.infiniband`'s `node_infiniband_port_state_id`
Gauge onto the customer-stable `hw.network.ib.*` namespace
(`hw.network.ib.port.state` int + `hw.network.ib.device` +
`hw.network.ib.port.num`) so pattern #2's `IBLinkFlapDetector`
consumes the same vendor-neutral wire shape regardless of whether the
underlying source is node_exporter, a Mellanox exporter, or the
`mlx5_core` journald stream.

Detector library + processor wiring already shipped in #391 (closed
#300). Only the metric-side input recipe was missing — pattern #2 was
configured-but-quiet on real deployments. This PR closes that gap.

## Wire contract (node_exporter raw → hw.network.ib.*)

```
node_infiniband_port_state_id{device="mlx5_0", port="1"} = 4
                                                          (IBA phys_state ID)
            ↓ transform/ib_to_hw_semconv
Gauge metric "hw.network.ib.port.state" with datapoint attrs:
  hw.network.ib.device  = "mlx5_0"        (str, from `device` label)
  hw.network.ib.port.num = 1              (int, from Int(`port` label))
  value                  = 4              (int, the phys_state ID)
```

The future RFC-0014 PR-B metrics→logs bridge emitter (shared with
patterns #3/#4/#5/#10) will lift these three attributes onto a log
record at emit time. The bridge log-record schema for pattern #2 is
pinned in `docs/integrations/prometheus-scrape.md §Pattern #2 —
hw.network.ib.port.state (issue #393)` so PR-B has no per-pattern
reconstruction work to do.

The companion series `node_infiniband_state{state="<name>"}` (string
label) is intentionally NOT mapped — the detector
(`module/processor/patterndetectorprocessor/ib_link_flap.go`) compares
`state.Int()` against `patterns.IBPortState*` integer constants, so
the string variant would round-trip wrong.

## No detector code change required

The detector reads three attribute names off a log record:
`hw.network.ib.port.state`, `hw.network.ib.device`,
`hw.network.ib.port.num`. The recipe stanza stamps the exact same
three names on the metric datapoint. The wire format `port.Int()`
expects (the projector at
`module/processor/patterndetectorprocessor/ib_link_flap.go` line 39
calls `int(port.Int())`) is satisfied because the OTTL `Int()` cast
on the Prometheus `port` string label produces a pdata int Value.
Confirmed by the new `TestRecipe_IBLinkFlap_RoundTripFiresVerdict`
test.

## Root cause + scope

- **Root cause of #393**: missing metric-side OTTL stanza. Fixed.
- **Out of scope (separate blocker, tracked under #260 PR-B)**: the
  metrics→logs bridge emitter. Upstream-blocked at OTel-contrib v0.130
  — `transformprocessor`'s `metric_statements` cannot reference
  `log.*` paths and no contrib connector emits log records from a
  metrics pipeline (per

[RFC-0014](https://github.com/TraceCoreAI/tracecore/blob/main/docs/rfcs/0014-metrics-to-logs-pattern-input.md)).
  The recipe doc explicitly documents this gating relationship; PR-B
  is shared with patterns #3/#4/#5/#10 and lands the bridge once.

## Files changed

- `docs/integrations/examples/prometheus-scrape.yaml` — new
  `transform/ib_to_hw_semconv` processor; wired into the
  `metrics/scrape` pipeline. Validates with `./_build/tracecore
  validate` (exit 0).
- `docs/integrations/prometheus-scrape.md` — new "Pattern #2 —
  InfiniBand link flap" projection section, intro updated from "Two"
  to "Three OTTL transforms", and bridge log-record contract
  subsection added under the "Metrics-to-logs bridge contract"
  section.
- `docs/patterns/pattern-2-ib-link-flap.md` — deleted "Integration
  gap" section, replaced with "Integration recipe" pointing at the
  shipped stanza; updated "Why node_exporter sees it" prose to drop
  the "pending" hedge.
-
`module/processor/patterndetectorprocessor/ib_link_flap_recipe_test.go`
  — new file. `TestRecipe_IBLinkFlap_StanzaPinsWireContract` parses
  the example YAML and asserts every load-bearing token is present
  (source metric name, three `hw.network.ib.*` attrs, the `Int()`
  cast on the port label, the transform name, and the pipeline
  wiring). `TestRecipe_IBLinkFlap_RoundTripFiresVerdict` simulates
  the end-to-end path: builds `plog.Logs` with the exact attribute
  shape the recipe stamps and asserts the processor emits a flap
  verdict.

## Test plan

- [x] `./_build/tracecore validate
--config=docs/integrations/examples/prometheus-scrape.yaml` → exit 0
- [x] `bash scripts/validator-recipe.sh` → 9 validated, 3 skipped
(non-linux host)
- [x] `bash scripts/doc-check.sh` → clean (no orphan test refs)
- [x] `go test ./module/processor/patterndetectorprocessor/... -count=1`
→ PASS (incl. the two new tests + all 5 existing IB tests)
- [x] `go build ./...` and `go vet ./...` → clean
- [x] Pre-commit hooks: golangci-lint 0 issues, go mod verify,
attribute-namespace-check 100/100
- [x] Mutation-verified: dropping `hw.network.ib.port.state` from the
recipe yaml fails `TestRecipe_IBLinkFlap_StanzaPinsWireContract` with
the expected remediation message naming the missing identifier
- [ ] CI on the PR (waiting on push)

```release-notes
feat(recipe): InfiniBand link-flap OTTL stanza projecting node_exporter's `node_infiniband_port_state_id` onto the tracecore-canonical `hw.network.ib.*` namespace (`hw.network.ib.port.state` int + `hw.network.ib.device` + `hw.network.ib.port.num`). Pattern #2's `IBLinkFlapDetector` now has its metric-side input wired; metrics→logs bridge emitter remains gated on RFC-0014 PR-B (#260).
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
…451)

## Summary

- Adds the `transform/cuda_oom` OTTL processor to
`docs/integrations/examples/filelog-container.yaml`, stamping
`cuda_oom.tried_alloc_bytes` (Int, bytes; unit-normalized
KiB/MiB/GiB/TiB) and `cuda_oom.gpu_index` (Int) off PyTorch's canonical
`RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>. GPU N
has a total capacity of ...` stderr line.
- Closes the integration gap pattern #10's detector (PR #338) carried
since merge: `projectCUDAOOMLogRecord`
(`module/processor/patterndetectorprocessor/cuda_oom.go`) gates on
`cuda_oom.tried_alloc_bytes` + `gpu.id` but no upstream recipe stamped
them, so the compiled detector received no real input at runtime.

## Root cause

Issue #303's deliverable list included `projectCUDAOOMLogRecord`
(shipped in PR #338) but explicitly deferred the filelog OTTL stanza to
a sibling follow-up (issue #285 / #436). The detector compiled green and
its wiring tests passed against synthetic plog input, but production
stderr never carried the customer-stable attributes the projector reads.
This PR is the missing link — a recipe-only change with zero
detector-source edits.

## Recipe design

- **Per-unit-branch shape** (KiB / MiB / GiB / TiB) because OTTL has no
capture-group-conditional dispatch — the multiplier must be a literal
`int64` per stanza.
- **Unit normalization via OTTL Math Expressions**: `Int(whole)*UNIT +
Int(frac)*(UNIT/100)` against PyTorch's `%.2f` `format_size` shape
(verified against `c10/cuda/CUDACachingAllocator.cpp`).
Integer-divide-by-100 floors per-frac-unit precision loss at <1% of the
unit base — three orders of magnitude under the detector's 5%
fragmentation threshold.
- **`gpu.id` is NOT stamped here**: the CUDA-runtime ordinal
`cuda_oom.gpu_index` is not a PCI BDF. The recipe markdown documents two
operator paths: (a) k8sattributesprocessor +
`nvidia.com/gpu-PCIDeviceBusID` device-plugin annotation, or (b) DCGM
BDF-lookup transform indexed by `cuda_oom.gpu_index`. The detector's
resource-attr fallback reads `gpu.id` off the log resource either way.
- **Tight `where IsMatch` guard** on `CUDA out of memory\. Tried to
allocate` — generic CUDA errors (illegal memory access, NCCL watchdog,
DataLoader worker killed) do not trip the stanza.

## Tests

TDD red → green via three new tests in
`module/processor/patterndetectorprocessor/cuda_oom_recipe_test.go`:

- `TestRecipe_CUDAOOM_StanzaPinsWireContract` — pins 7 load-bearing
tokens (`cuda_oom.tried_alloc_bytes`, `cuda_oom.gpu_index`,
KiB/MiB/GiB/TiB, `transform/cuda_oom`) + pipeline-wiring against the
live projector.
- `TestRecipe_CUDAOOM_RoundTripFiresVerdict` — end-to-end gate:
recipe-shaped log records flow through `CUDAOOMDetector` and emit a
`kind=fragmentation` verdict with the expected scalar-promotion
contract.
- `TestRecipe_CUDAOOM_RegexCoversCanonicalPyTorchMessages` — 5 canonical
positives (KiB / MiB / GiB / GiB-fractional / TiB) + 3 negatives
(DataLoader worker killed, NCCL watchdog, illegal memory access).
Exceeds the ≥3-positive A-tier acceptance criterion from #436.

## Self-grade: **A+**

- B: YAML syntactically valid OTel (`tracecore validate` exit 0); regex
extracts bytes + GPU index with unit normalization; documented. ✓
- A: integration test green; `make validator-recipe` covers this file;
regex tested against ≥3 canonical messages (5 positives total); negative
cases verified. ✓
- A+: edge cases handled (multi-line traceback flattening via filelog
container parser, mixed-unit messages, OOM without GPU index via tight
`IsMatch` guard); cross-linked from
`docs/patterns/10-cuda-oom-deceptive.md` §"Signal sources" + Open
Question #2; new §`cuda_oom.*` attribute stanza in
`docs/integrations/filelog-container.md` with unit-normalization
arithmetic table, two `gpu.id` source paths, and a Failure-modes row. ✓

## Cross-references

- Detector source (untouched per hard rule):
`module/processor/patterndetectorprocessor/cuda_oom.go`.
- Sibling DCGM metric-side recipe: PR #337 /
`docs/integrations/examples/prometheus-scrape.yaml`.
- Pattern doc: `docs/patterns/10-cuda-oom-deceptive.md` — Open Q#2
resolved.
- Convention: PR #431 (recipe stanzas placement under
`docs/integrations/examples/<target>.yaml`).

## Test plan

- [x] `go test ./processor/patterndetectorprocessor/ -run
TestRecipe_CUDAOOM -count=1 -v` — PASS (3 tests, 8 sub-tests)
- [x] `go test ./processor/patterndetectorprocessor/ -count=1` — PASS
(no regressions)
- [x] `make build` — `_build/tracecore` compiles via OCB
- [x] `./_build/tracecore validate
--config=docs/integrations/examples/filelog-container.yaml` — exit 0
- [x] `make validator-recipe` — 9 validated, 3 skipped (non-linux host)
of 12 recipe(s)
- [x] `make doc-check` — PASS (new cross-link resolves)
- [x] `make ci-fast` — PASS (lint, vet, mod-verify,
attribute-namespace-check, doc-check)

```release-notes
**Pattern #10 (CUDA OOM, deceptive allocator)** — filelogreceiver + OTTL recipe lands. The `transform/cuda_oom` stanza in `docs/integrations/examples/filelog-container.yaml` projects PyTorch's `RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit>` stderr line onto `cuda_oom.tried_alloc_bytes` (unit-normalized to bytes across KiB/MiB/GiB/TiB) and `cuda_oom.gpu_index`, closing the load-bearing input gap left by the v0.3 detector ship (PR #338).
```

Closes #436.
Refs #338, #303, #337.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
…ard) (#477)

## Summary

Closes the `docs/MILESTONES.md` §M6 carry-forward: *"every fenced block
in `docs/getting-started.md` is exercised by `scripts/smoke.sh`"*.

The ≤5-count gate shipped with the M6 wave; the binding half was tracked
carry-forward because `smoke.sh` ran a parallel hand-written
hostmetrics→debug config rather than the doc's actual YAML.

## Root cause

Two scripts owned the "first OTLP byte" config — `smoke.sh` rendered one
inline, `docs/getting-started.md` carried another. They happened to
agree, but nothing forced them to. The carry-forward existed because the
binding was *correct by inspection*, not *correct by construction*.

The fix is to make the doc the single source: `smoke.sh` extracts the
YAML from `docs/getting-started.md`'s `## Walkthrough` heredoc at
runtime. If the doc grows a typo, a renamed receiver, or a different
scraper, `smoke.sh` exercises the change automatically. If the heredoc
disappears, the extractor fails loud with a named error.

## Changes

- `scripts/smoke.sh` — extracts the Walkthrough heredoc via a perl
one-liner, writes it to a tempfile, then runs `tracecore validate
--config=` + `tracecore --config=` against it (Walkthrough steps 3 + 4).
Lifecycle-log assertions retained, with `"Shutdown complete"` now
load-bearing against the doc's post-walkthrough prose.
- `scripts/doc-check.sh` — new gate (right after the existing ≤5-count
gate) asserts the smoke↔doc binding with four mutation-verified clauses:
Walkthrough scope, `"$BIN" validate --config=` invocation, `"$BIN"
--config=` run invocation, `docs/getting-started.md` path reference.
- `scripts/smoke_test.sh` — new mutation-verify harness mirroring the
gate at runtime, plus an inline mutant-doc test that proves the
extractor exits 1 and the wrapper emits the named error when the heredoc
is removed.
- `Makefile` — `make smoke` now also runs `smoke_test.sh`; wired into
`ci-full` alongside the existing `smoke-quickstart` target.
- `docs/MILESTONES.md` — §M6 status `⧗ partial` → `☑ delivered`;
getting-started rubric `⧗` → `☑`; carry-forward bullet rewritten
(remaining work is operator-config branch-protection only).

## Runtime

End-to-end `bash scripts/smoke.sh` on darwin/arm64: **~2.2s** (extract +
validate + 1.5s run window + lifecycle-log assertions). Well under the
120s ci-fast budget. No hardware required — uses the `hostmetrics` load
scraper, portable across linux/darwin/windows.

## Test plan

```release-notes
ci(smoke): scripts/smoke.sh now extracts its YAML config from docs/getting-started.md '## Walkthrough' instead of carrying a parallel hand-written config; doc-check.sh gates the doc↔smoke binding with four mutation-verified clauses. Closes the M6 carry-forward.
```

- [x] `bash scripts/smoke.sh` exits 0 on clean main (verified locally,
~2.2s).
- [x] `bash scripts/smoke_test.sh` all assertions pass.
- [x] `bash scripts/doc-check.sh` reports `scripts/smoke.sh binds to
docs/getting-started.md (M6: every block exercised by smoke.sh)`.
- [x] Mutation test #1: `sed -i 's/"$BIN" validate --config=/"$BIN" XXX
--config=/' scripts/smoke.sh` → doc-check exits 1 naming "validate
--config= invocation (Walkthrough step 3)".
- [x] Mutation test #2: `sed -i 's/"$BIN" --config=/"$BIN" XXX=/'
scripts/smoke.sh` → doc-check exits 1 naming "run invocation
(Walkthrough step 4)".
- [x] Mutation test #3: `sed -i 's/Walkthrough/Section/'
scripts/smoke.sh` → doc-check exits 1 naming "extraction scope lost".
- [x] Mutation test #4: `sed -i
's/docs/getting-started.md/docs/SOMEWHERE-ELSE.md/' scripts/smoke.sh` →
doc-check exits 1 naming "binding source missing".
- [x] Mutation test #5: getting-started.md with no `## Walkthrough`
heredoc → smoke.sh exits 1 with named error message (covered by
`smoke_test.sh`).
- [x] `make lint` 0 issues; `make vet` clean; `make doc-check` clean
(all 18 gates pass).
- [x] `make smoke` end-to-end including `smoke_test.sh` passes.

## Related

- Refs `docs/MILESTONES.md` §M6 (Documentation scaffold).
- Sibling #460 (`fix(doc-check): drop unconditional exit 0`) made this
carry-forward visible — before #460, the new gate would have been
silently skipped by the line-99 short-circuit.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
## Summary

- Replace the `ErrPending` stub at `tools/failure-inject/ncclhang/` with
a deterministic wrapper over `module/pkg/nccl/fr_parser.Synthesize`.
Output is one of the canonical M11 hang fixtures (`nccl-2.29.x-hang` /
`nccl-2.30.x-hang`), selected by `--seed mod 2`; bytes round-trip
through `frparser.Parse` and a re-synthesize is byte-identical — closes
**M4b carry-forward #1**.
- Pin the new SHA in `tools/failure-inject/testdata/golden.sha256` so
`chaos.yml`'s `harness-determinism` job (matrix `linux/amd64` +
`linux/arm64`) replays the same argv on both arches and enforces
cross-arch SHA equality — closes **M4b carry-forward #2**.
- Flip ⧗ → ☑ on the two M4b functional rubrics (round-trip,
safe-opcodes) and the M4b determinism non-functional rubric, plus the
M11 synthetic-fixture-generator rubric. Remove the `failure-inject
nccl-hang` follow-up from `docs/followups/M4b.md` and from M11's
carry-forward list.

## Root cause

M4b shipped at v0.1 with the `nccl-hang` subcommand stubbed
(`ErrPending`, exit 70) because `pkg/nccl/fr_parser/synthesize.go` was
still pending under M11. M11 landed the synthesizer plus the canonical
hang fixtures (`fixture229Hang`, `fixture230Hang`) in
`module/pkg/nccl/fr_parser/`. The CLI shim was carry-forward — this PR
is the wiring.

## What's in the diff

- `tools/failure-inject/ncclhang/ncclhang.go` — `Options{Seed uint64}`;
`Run` selects a hang variant by `Seed % len(hangVariants)`, calls
`FixtureSpec.Bytes()` (which delegates to `frparser.Synthesize`), writes
to `w`. `ErrPending` deleted; `ctx.Err()` honoured before any write.
- `tools/failure-inject/main.go` — pass `Options{Seed: *c.flagSeed}`
through to `ncclhang.Run`; drop the `errors.Is(err, ncclhang.ErrPending)
→ exit 70` branch.
- `tools/failure-inject/ncclhang/ncclhang_test.go` — RED → GREEN:
`TestRun_RoundTrip` (synthesize → parse → re-synthesize byte-identical),
`TestRun_SeedDeterminism` (same seed → same bytes, 4 seeds),
`TestRun_SafeOpcodesOnly` (delegates to `frparser.Parse` as the
safe-opcode oracle — a naive byte scan false-positives on opcode bytes
inside `SHORT_BINUNICODE` string literals), `TestRun_CtxCancelled`.
- `tools/failure-inject/main_test.go` — replace
`TestRun_NCCLHangReturnsNotImplemented` with `TestRun_NCCLHangRoundTrip`
+ `TestRun_NCCLHangSeedDeterminism` so the contract is pinned through
the actual argv path too.
- `tools/failure-inject/testdata/golden.sha256` — add `failure-inject
--seed=0 nccl-hang → e6f49920…`. The existing `TestRun_GoldenSHA` loop
in `main_test.go` and the `Golden SHA pin` step in `chaos.yml` pick it
up automatically.
- `docs/MILESTONES.md` — flip §M4b rubrics ⧗ → ☑ (round-trip,
safe-opcodes, cross-arch determinism) and §M11 synthetic-fixture rubric;
trim carry-forward list.
- `docs/followups/M4b.md` — mark the `nccl-hang` entry closed with the
wiring-PR pointer.
- `tools/failure-inject/README.md` — add a `nccl-hang` section; remove
`nccl-hang` from carve-outs (now only `pod-evict --allow-cluster-write`
carves).
- `module/receiver/ncclfrreceiver/README.md` — replace stale `tracecore
failure-inject` invocation with the actual `go run
./tools/failure-inject` path.

## Test plan

- [x] `go test -race -count=1 ./tools/failure-inject/...` — green (4
packages).
- [x] `(cd module && go test -race -count=1 ./pkg/nccl/fr_parser/...)` —
green (no semantic change here, gate against accidental drift).
- [x] `go build ./... && (cd module && go build ./...)` — clean.
- [x] Pre-commit gates: `golangci-lint`, `go vet`, `go mod verify`,
`attribute-namespace-check` — all 0 issues.
- [x] End-to-end determinism: `failure-inject --seed=0 nccl-hang |
sha256sum` reproduces the pinned SHA (`e6f49920…`) twice in a row.
- [x] Seed variance: `--seed=1` produces a distinct SHA (`2788a726…`);
`--seed=42` (42 mod 2 = 0) matches `--seed=0` per the documented modulo
mapping.
- [x] `failure-inject nccl-hang --help` documents `--seed` and `--out`
and the round-trip-through-`fr_parser` purpose.

## Self-grade

**A+**: round-trip green, determinism golden-SHA pinned, safe-opcode set
verified via parser oracle, cross-arch SHA equality wired into existing
`chaos.yml` matrix, MILESTONES.md flipped on four ⧗ rubrics, `M4b.md`
follow-up closed with a pointer, doc drift swept.

```release-notes
tools(failure-inject): `nccl-hang` subcommand now produces parseable byte-deterministic NCCL FlightRecorder bytes via `pkg/nccl/fr_parser` (was a stub returning `ErrPending`). `--seed` flag selects variant + deterministic synthesis; cross-arch SHA enforced in `chaos.yml` (linux/amd64 + linux/arm64). Closes M4b carry-forward #1 + #2.
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 4, 2026
## Summary

Patterns 1-5 in `docs/patterns/` carried `pattern-N-slug.md` while
patterns 7-13 used the lexsort-stable `NN-slug.md` prefix — two schemes
side-by-side. Pattern #2 carried **both** an engineering design spec
(`02-ib-link-flap.md`) AND an operator walkthrough
(`pattern-2-ib-link-flap.md`); these look like dup-naming but are
intentionally distinct doc types per the `docs/patterns/README.md`
two-table split (operator walkthroughs vs. design specs / TDD red-test
inputs).

This PR unifies the numeric-prefix scheme across the directory while
preserving the spec/walkthrough type distinction via a filename suffix:

- `NN-slug.md`              = engineering design spec
- `NN-slug-walkthrough.md`  = operator-facing runbook

### Renames (5)

| Old | New |
|---|---|
| `pattern-1-nvlink-degradation.md` |
`01-nvlink-degradation-walkthrough.md` |
| `pattern-2-ib-link-flap.md`       | `02-ib-link-flap-walkthrough.md` |
| `pattern-3-hbm-ecc.md`            | `03-hbm-ecc-walkthrough.md` |
| `pattern-4-thermal-throttle.md` | `04-thermal-throttle-walkthrough.md`
|
| `pattern-5-pcie-aer.md`           | `05-pcie-aer-walkthrough.md` |

### Pattern #2 dup investigation (not a dup)

`02-ib-link-flap.md` (engineering design spec) and
`pattern-2-ib-link-flap.md` (operator walkthrough with PromQL alert
+ escalation runbook) are distinct doc types that cross-reference
each other. `docs/patterns/README.md` already lists them in separate
tables (operator walkthroughs vs design specs). Both retained;
walkthrough renamed to `02-ib-link-flap-walkthrough.md` per the
unified convention.

### `recipes-path-check*.sh` retained (not the dup-scheme validator)

The original task plan flagged `scripts/recipes-path-check.sh` +
`_test.sh` for deletion as "the validator policing both schemes".
On inspection: those scripts implement the **issue #427** convention
gate that lints commit subjects / PR titles for references to a
non-existent `recipes/pattern-N/` *directory* layout. They have
nothing to do with `docs/patterns/` filenames. Retained.

### Inbound-ref updates (9 files)

- `docs/MILESTONES.md`, `docs/NORTHSTARS.md`
- `docs/integrations/prometheus-scrape.md`
- `docs/rfcs/0014-metrics-to-logs-pattern-input.md`
- `docs/followups/M4b.md` (forward-ref to planned
  `14-pod-evicted-walkthrough.md`)
- `docs/patterns/README.md` (table rows + new "Filename convention"
  section documenting the NN- / NN-walkthrough split)
- `docs/patterns/02-ib-link-flap.md` (spec's cross-link to its
  walkthrough)
- `module/pkg/patterns/{hbm_ecc,thermal_throttle,pcie_aer}.go`
  (doc-comment references)
- `module/pkg/replay/thermal_throttle/canonical/manifest.json`
  (replay-fixture description text)

### Why this shape (vs collapsing both schemes into one)

The original task framing assumed the two schemes were unintended
divergence — but the README's two-table layout treats them as a
deliberate audience split (engineering TDD-spec readers vs.
operators triaging incidents). Collapsing the walkthroughs into the
spec namespace would have destroyed that signal. The
`-walkthrough` suffix preserves the semantic distinction while
giving the directory the lexsort-stable numeric prefix the task
wanted.

## Test plan

- [x] `make doc-check` exit 0 **pre-change** (217 anchors + 1105
      markdown links + 239 non-md intra-repo links resolve)
- [x] `make doc-check` exit 0 **post-change** (same counts; zero
      broken refs introduced)
- [x] `rg 'pattern-[1-5]-' docs/ install/ .github/ module/ scripts/`
      returns only in-page heading anchors (`#pattern-2--…`,
      `#m17-pattern-1-…`), no stale filename refs
- [x] Pre-commit hook: `attribute-namespace-check` clean (100
      attributes documented), `slo-rules-check` 13 rules OK,
      `chart-appversion-check` matches, all module verify pass
- [x] Pre-push hook: `no-autoupdate-check_test` clean

```release-notes
docs: unify `docs/patterns/` filename convention to a single
`NN-slug.md` / `NN-slug-walkthrough.md` scheme. Operator walkthroughs
for patterns 1-5 renamed; design-spec files keep the `NN-slug.md`
shape; pattern #2 retains both (spec + walkthrough).
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 4, 2026
## Summary

Wave-end audit flagged the patterndetectorprocessor fanout site as an
unmet refactor: `ConsumeLogs` hand-rolled dispatch for every shipped
detector (12 today: 7 inline + 5 wrapped), so adding pattern #13
required editing the fanout body — not registering a new entry. Past the
rule-of-three by 9x.

This PR introduces a minimal Detector registry seam:

- `module/pkg/patterns/detector.go`: new `Detector` interface
(`PatternID() string`) + `Registered` slice that pins all 12 detector
pointers. Each `*Detector` struct gets a one-line `PatternID()` method.
- `module/pkg/patterns/detector_test.go`:
`TestRegistered_PinsAllPatterns` (exact PatternID set + count),
`TestRegistered_UniquePatternIDs`, `TestRegistered_NonEmptyPatternIDs`.
Drift gate — accidental drops fail in CI.
- `patterndetector.go`: introduces `detectorRunners []detectorRunner`
closure list iterated by `ConsumeLogs`. `ConsumeLogs` body drops from
~77 lines to 12. Adding pattern #13 = one append to `Registered` + one
append to `detectorRunners`, no fanout-site edit.

### Design decision: metadata-only interface

The `Detector` interface is intentionally `PatternID() string` only —
not a uniform `Evaluate` method. Each detector's Evaluate signature is
intrinsically heterogeneous (different input record shapes —
events+nodeConds, ncclRecs, xidRecs+events, etc. — and different verdict
types). A uniform Evaluate would force a lossy `any`-typed contract that
the typed test suite has been fighting for 12 patterns. The
closure-per-detector approach keeps the typed Evaluate calls at their
concrete-runner sites while letting the registry pin identity +
iteration.

### Behavior preservation

- Same telemetry vocabulary: PodEvicted and IBLinkFlap still
`IncVerdict` with `string(v.Confidence)` (they gate on partial); the
other 5 inline detectors still pass `""`. The 5 wrapped runners still
tick inside their own helpers (unchanged).
- Same emission order: `detectorRunners` is declared in the legacy
emission order.
- Same partial-confidence gating: `emitPodEvicted` and `emitIBLinkFlap`
preserve the `!emitPartial` skip.

### Test plan

- [x] `cd module && go build ./...` clean
- [x] `cd module && go test ./pkg/patterns/` green (incl. 3 new pin
tests)
- [x] `cd module && go test ./processor/patterndetectorprocessor/` green
except pre-existing #497
(`TestPatternDetector_NegativeFixturesEmitNoVerdicts/synthetic-2026-06-multi-rank-disk-pressure`,
fixed in Lane J)
- [x] `make lint` clean (0 issues)
- [x] `make vet`, `go mod verify`, attribute-namespace-check all green
(pre-push hook)

### LoC delta

- +321 / -79 across 3 files.
- `ConsumeLogs` body: 77 → 12 lines.
- Growth is in: registry plumbing (164 lines, mostly comments + the pin
tests), runner closures (one per detector). The seam earns its bytes —
adding pattern #N is now O(append) instead of O(edit-fanout).

### Closes-the-loop

Closes wave-end-audit next-wave item #2 (pattern registry seam).

```release-notes
- refactor(patterns): introduce `patterns.Detector` interface + `patterns.Registered` slice. The patterndetectorprocessor now iterates a registry-driven runner list instead of hand-rolled fanout — adding a new pattern is one append, not a processor edit. Behavior-preserving; no operator-facing change.
```

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant