Skip to content

ci(deps): bump the gh-actions group with 5 updates#1

Closed
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/github_actions/gh-actions-000bb7bef9
Closed

ci(deps): bump the gh-actions group with 5 updates#1
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/github_actions/gh-actions-000bb7bef9

Conversation

@dependabot

@dependabot dependabot Bot commented on behalf of github May 8, 2026

Copy link
Copy Markdown
Contributor

Bumps the gh-actions group with 5 updates:

Package From To
actions/checkout 4 6
actions/setup-go 5 6
golangci/golangci-lint-action 6 9
actions/upload-artifact 4 7
github/codeql-action 3 4

Updates actions/checkout from 4 to 6

Release notes

Sourced from actions/checkout's releases.

v6.0.0

What's Changed

Full Changelog: actions/checkout@v5.0.0...v6.0.0

v6-beta

What's Changed

Updated persist-credentials to store the credentials under $RUNNER_TEMP instead of directly in the local git config.

This requires a minimum Actions Runner version of v2.329.0 to access the persisted credentials for Docker container action scenarios.

v5.0.1

What's Changed

Full Changelog: actions/checkout@v5...v5.0.1

v5.0.0

What's Changed

⚠️ Minimum Compatible Runner Version

v2.327.1
Release Notes

Make sure your runner is updated to this version or newer to use this release.

Full Changelog: actions/checkout@v4...v5.0.0

v4.3.1

What's Changed

Full Changelog: actions/checkout@v4...v4.3.1

v4.3.0

What's Changed

... (truncated)

Changelog

Sourced from actions/checkout's changelog.

Changelog

v6.0.2

v6.0.1

v6.0.0

v5.0.1

v5.0.0

v4.3.1

v4.3.0

v4.2.2

v4.2.1

v4.2.0

v4.1.7

v4.1.6

... (truncated)

Commits

Updates actions/setup-go from 5 to 6

Release notes

Sourced from actions/setup-go's releases.

v6.0.0

What's Changed

Breaking Changes

Make sure your runner is on version v2.327.1 or later to ensure compatibility with this release. See Release Notes

Dependency Upgrades

New Contributors

Full Changelog: actions/setup-go@v5...v6.0.0

v5.6.0

What's Changed

Full Changelog: actions/setup-go@v5...v5.6.0

v5.5.0

What's Changed

Bug fixes:

Dependency updates:

New Contributors

Full Changelog: actions/setup-go@v5...v5.5.0

... (truncated)

Commits

Updates golangci/golangci-lint-action from 6 to 9

Release notes

Sourced from golangci/golangci-lint-action's releases.

v9.0.0

In the scope of this release, we change Nodejs runtime from node20 to node24 (https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/).

What's Changed

Changes

Full Changelog: golangci/golangci-lint-action@v8.0.0...v9.0.0

v8.0.0

Requires golangci-lint version >= v2.1.0

What's Changed

Changes

Full Changelog: golangci/golangci-lint-action@v7...v8.0.0

v7.0.1

What's Changed

Documentation

Dependencies

New Contributors

Full Changelog: golangci/golangci-lint-action@v7.0.0...v7.0.1

v7.0.0

... (truncated)

Commits

Updates actions/upload-artifact from 4 to 7

Release notes

Sourced from actions/upload-artifact's releases.

v7.0.0

v7 What's new

Direct Uploads

Adds support for uploading single files directly (unzipped). Callers can set the new archive parameter to false to skip zipping the file during upload. Right now, we only support single files. The action will fail if the glob passed resolves to multiple files. The name parameter is also ignored with this setting. Instead, the name of the artifact will be the name of the uploaded file.

ESM

To support new versions of the @actions/* packages, we've upgraded the package to ESM.

What's Changed

New Contributors

Full Changelog: actions/upload-artifact@v6...v7.0.0

v6.0.0

v6 - What's new

[!IMPORTANT] actions/upload-artifact@v6 now runs on Node.js 24 (runs.using: node24) and requires a minimum Actions Runner version of 2.327.1. If you are using self-hosted runners, ensure they are updated before upgrading.

Node.js 24

This release updates the runtime to Node.js 24. v5 had preliminary support for Node.js 24, however this action was by default still running on Node.js 20. Now this action by default will run on Node.js 24.

What's Changed

Full Changelog: actions/upload-artifact@v5.0.0...v6.0.0

v5.0.0

What's Changed

BREAKING CHANGE: this update supports Node v24.x. This is not a breaking change per-se but we're treating it as such.

... (truncated)

Commits
  • 043fb46 Merge pull request #797 from actions/yacaovsnc/update-dependency
  • 634250c Include changes in typespec/ts-http-runtime 0.3.5
  • e454baa Readme: bump all the example versions to v7 (#796)
  • 74fad66 Update the readme with direct upload details (#795)
  • bbbca2d Support direct file uploads (#764)
  • 589182c Upgrade the module to ESM and bump dependencies (#762)
  • 47309c9 Merge pull request #754 from actions/Link-/add-proxy-integration-tests
  • 02a8460 Add proxy integration test
  • b7c566a Merge pull request #745 from actions/upload-artifact-v6-release
  • e516bc8 docs: correct description of Node.js 24 support in README
  • Additional commits viewable in compare view

Updates github/codeql-action from 3 to 4

Release notes

Sourced from github/codeql-action's releases.

v3.35.3

  • Upcoming breaking change: Add a deprecation warning for customers using CodeQL version 2.19.3 and earlier. These versions of CodeQL were discontinued on 9 April 2026 alongside GitHub Enterprise Server 3.15, and will be unsupported by the next minor release of the CodeQL Action. #3837
  • Configurations for private registries that use Cloudsmith or GCP OIDC are now accepted. #3850
  • Best-effort connection tests for private registries now use GET requests instead of HEAD for better compatibility with various registry implementations. For NuGet feeds, the test is now always performed against the service index. #3853
  • Fixed a bug where two diagnostics produced within the same millisecond could overwrite each other on disk, causing one of them to be lost. #3852
  • Update default CodeQL bundle version to 2.25.3. #3865

v3.35.2

  • The undocumented TRAP cache cleanup feature that could be enabled using the CODEQL_ACTION_CLEANUP_TRAP_CACHES environment variable is deprecated and will be removed in May 2026. If you are affected by this, we recommend disabling TRAP caching by passing the trap-caching: false input to the init Action. #3795
  • The Git version 2.36.0 requirement for improved incremental analysis now only applies to repositories that contain submodules. #3789
  • Python analysis on GHES no longer extracts the standard library, relying instead on models of the standard library. This should result in significantly faster extraction and analysis times, while the effect on alerts should be minimal. #3794
  • Fixed a bug in the validation of OIDC configurations for private registries that was added in CodeQL Action 4.33.0 / 3.33.0. #3807
  • Update default CodeQL bundle version to 2.25.2. #3823

v3.35.1

v3.35.0

v3.34.1

  • Downgrade default CodeQL bundle version to 2.24.3 due to issues with a small percentage of Actions and JavaScript analyses. #3762

v3.34.0

  • Added an experimental change which disables TRAP caching when improved incremental analysis is enabled, since improved incremental analysis supersedes TRAP caching. This will improve performance and reduce Actions cache usage. We expect to roll this change out to everyone in March. #3569
  • We are rolling out improved incremental analysis to C/C++ analyses that use build mode none. We expect this rollout to be complete by the end of April 2026. #3584
  • Update default CodeQL bundle version to 2.25.0. #3585

v3.33.0

  • Upcoming change: Starting April 2026, the CodeQL Action will skip collecting file coverage information on pull requests to improve analysis performance. File coverage information will still be computed on non-PR analyses. Pull request analyses will log a warning about this upcoming change. #3562 To opt out of this change:
    • Repositories owned by an organization: Create a custom repository property with the name github-codeql-file-coverage-on-prs and the type "True/false", then set this property to true in the repository's settings. For more information, see Managing custom properties for repositories in your organization. Alternatively, if you are using an advanced setup workflow, you can set the CODEQL_ACTION_FILE_COVERAGE_ON_PRS environment variable to true in your workflow.
    • User-owned repositories using default setup: Switch to an advanced setup workflow and set the CODEQL_ACTION_FILE_COVERAGE_ON_PRS environment variable to true in your workflow.
    • User-owned repositories using advanced setup: Set the CODEQL_ACTION_FILE_COVERAGE_ON_PRS environment variable to true in your workflow.
  • Fixed a bug which caused the CodeQL Action to fail loading repository properties if a "Multi select" repository property was configured for the repository. #3557
  • The CodeQL Action now loads custom repository properties on GitHub Enterprise Server, enabling the customization of features such as github-codeql-disable-overlay that was previously only available on GitHub.com. #3559
  • Once private package registries can be configured with OIDC-based authentication for organizations, the CodeQL Action will now be able to accept such configurations. #3563
  • Fixed the retry mechanism for database uploads. Previously this would fail with the error "Response body object should not be disturbed or locked". #3564
  • A warning is now emitted if the CodeQL Action detects a repository property whose name suggests that it relates to the CodeQL Action, but which is not one of the properties recognised by the current version of the CodeQL Action. #3570

v3.32.6

  • Update default CodeQL bundle version to 2.24.3. #3548

v3.32.5

  • Repositories owned by an organization can now set up the github-codeql-disable-overlay custom repository property to disable improved incremental analysis for CodeQL. First, create a custom repository property with the name github-codeql-disable-overlay and the type "True/false" in the organization's settings. Then in the repository's settings, set this property to true to disable improved incremental analysis. For more information, see Managing custom properties for repositories in your organization. This feature is not yet available on GitHub Enterprise Server. #3507
  • Added an experimental change so that when improved incremental analysis fails on a runner — potentially due to insufficient disk space — the failure is recorded in the Actions cache so that subsequent runs will automatically skip improved incremental analysis until something changes (e.g. a larger runner is provisioned or a new CodeQL version is released). We expect to roll this change out to everyone in March. #3487
  • The minimum memory check for improved incremental analysis is now skipped for CodeQL 2.24.3 and later, which has reduced peak RAM usage. #3515
  • Reduced log levels for best-effort private package registry connection check failures to reduce noise from workflow annotations. #3516
  • Added an experimental change which lowers the minimum disk space requirement for improved incremental analysis, enabling it to run on standard GitHub Actions runners. We expect to roll this change out to everyone in March. #3498

... (truncated)

Changelog

Sourced from github/codeql-action's changelog.

4.32.3 - 13 Feb 2026

  • Added experimental support for testing connections to private package registries. This feature is not currently enabled for any analysis. In the future, it may be enabled by default for Default Setup. #3466

4.32.2 - 05 Feb 2026

  • Update default CodeQL bundle version to 2.24.1. #3460

4.32.1 - 02 Feb 2026

  • A warning is now shown in Default Setup workflow logs if a private package registry is configured using a GitHub Personal Access Token (PAT), but no username is configured. #3422
  • Fixed a bug which caused the CodeQL Action to fail when repository properties cannot successfully be retrieved. #3421

4.32.0 - 26 Jan 2026

  • Update default CodeQL bundle version to 2.24.0. #3425

4.31.11 - 23 Jan 2026

  • When running a Default Setup workflow with Actions debugging enabled, the CodeQL Action will now use more unique names when uploading logs from the Dependabot authentication proxy as workflow artifacts. This ensures that the artifact names do not clash between multiple jobs in a build matrix. #3409
  • Improved error handling throughout the CodeQL Action. #3415
  • Added experimental support for automatically excluding generated files from the analysis. This feature is not currently enabled for any analysis. In the future, it may be enabled by default for some GitHub-managed analyses. #3318
  • The changelog extracts that are included with releases of the CodeQL Action are now shorter to avoid duplicated information from appearing in Dependabot PRs. #3403

4.31.10 - 12 Jan 2026

  • Update default CodeQL bundle version to 2.23.9. #3393

4.31.9 - 16 Dec 2025

No user facing changes.

4.31.8 - 11 Dec 2025

  • Update default CodeQL bundle version to 2.23.8. #3354

4.31.7 - 05 Dec 2025

  • Update default CodeQL bundle version to 2.23.7. #3343

4.31.6 - 01 Dec 2025

No user facing changes.

4.31.5 - 24 Nov 2025

  • Update default CodeQL bundle version to 2.23.6. #3321

4.31.4 - 18 Nov 2025

... (truncated)

Commits
  • 68bde55 Merge pull request #3885 from github/update-v4.35.4-803d9e8c3
  • 9739ad2 Update changelog for v4.35.4
  • 803d9e8 Merge pull request #3883 from github/mbg/test/macro-wrapper
  • 0fd9c7d Merge pull request #3882 from github/dependabot/github_actions/dot-github/wor...
  • 922d6fb Use makeMacro instead of test.macro
  • df77e87 Update test macro snippet
  • 6e3f985 Add wrapper for test.macro
  • e7a347d Merge pull request #3881 from github/update-bundle/codeql-bundle-v2.25.4
  • 17eabb2 Rebuild
  • aaef09c Bump ruby/setup-ruby
  • Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore <dependency name> major version will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself)
  • @dependabot ignore <dependency name> minor version will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself)
  • @dependabot ignore <dependency name> will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself)
  • @dependabot unignore <dependency name> will remove all of the ignore conditions of the specified dependency
  • @dependabot unignore <dependency name> <ignore condition> will remove the ignore condition of the specified dependency and ignore conditions

Bumps the gh-actions group with 5 updates:

| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `4` | `6` |
| [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` |
| [golangci/golangci-lint-action](https://github.com/golangci/golangci-lint-action) | `6` | `9` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` |
| [github/codeql-action](https://github.com/github/codeql-action) | `3` | `4` |


Updates `actions/checkout` from 4 to 6
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v4...v6)

Updates `actions/setup-go` from 5 to 6
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](actions/setup-go@v5...v6)

Updates `golangci/golangci-lint-action` from 6 to 9
- [Release notes](https://github.com/golangci/golangci-lint-action/releases)
- [Commits](golangci/golangci-lint-action@v6...v9)

Updates `actions/upload-artifact` from 4 to 7
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v4...v7)

Updates `github/codeql-action` from 3 to 4
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](github/codeql-action@v3...v4)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: gh-actions
- dependency-name: actions/setup-go
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: gh-actions
- dependency-name: golangci/golangci-lint-action
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: gh-actions
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: gh-actions
- dependency-name: github/codeql-action
  dependency-version: '4'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: gh-actions
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot @github

dependabot Bot commented on behalf of github May 8, 2026

Copy link
Copy Markdown
Contributor Author

Labels

The following labels could not be found: dependencies, github-actions. Please create them before Dependabot can add them to a pull request.

Please fix the above issues or remove invalid values from dependabot.yml.

@dependabot @github

dependabot Bot commented on behalf of github May 8, 2026

Copy link
Copy Markdown
Contributor Author

Looks like these dependencies are updatable in another way, so this is no longer needed.

@dependabot dependabot Bot closed this May 8, 2026
@dependabot dependabot Bot deleted the dependabot/github_actions/gh-actions-000bb7bef9 branch May 8, 2026 06:23
trilamsr added a commit that referenced this pull request May 14, 2026
Closes PR-13 review #1: assembly was in cmd/tracecore where it could
only be exercised by spawning the binary. Now it lives in its own
package + can be reused by anyone building pipelines (future plugin
surface, `tracecore validate` as a library, etc.).

Why a sibling and not under internal/pipeline: internal/config
already imports internal/pipeline (it returns pipeline.Signal /
pipeline.NewType). Putting the builder INSIDE internal/pipeline would
create a cycle (pipeline → config → pipeline). pipelinebuilder
sibling sidesteps it; both directions stay one-way.

Move scope:
- cmd/tracecore/build.go         → internal/pipelinebuilder/builder.go
- cmd/tracecore/signalops.go     → internal/pipelinebuilder/signalops.go
- cmd/tracecore/fuzz_test.go     → internal/pipelinebuilder/fuzz_test.go
- buildPipelines (unexported)    → BuildPipelines (exported entry point)
- helpers stay package-private inside pipelinebuilder
- cmd/tracecore/{collect,validate}.go call pipelinebuilder.BuildPipelines

cmd/tracecore main.go remains the place where kingpin wires CLI →
runCollect/runValidate → pipelinebuilder + components(). Generated
components() stays in cmd/tracecore because it's the binary's
registry-of-choice.

Coverage tooling fixes that follow from the move:
- `make coverage` now uses -coverpkg=./cmd/...,./components/...,
  ./internal/... so cross-package coverage is correctly attributed
  (cmd/tracecore tests exercise pipelinebuilder; coverage credits
  pipelinebuilder, not cmd/tracecore).
- tools/coverage-check now deduplicates duplicate file:range entries
  in coverage.out (Go writes one row per test run per instrumented
  line when -coverpkg is active; raw sum would multiply the denominator
  by run-count). Test coverage holds:
    pipelinebuilder 74%, pipeline 94.5%, fanout 100%, config 94.4%

- internal/pipelinebuilder/builder_test.go added: a processor-stage
  test using fake echoReceiver / noopProcessor / sinkExporter
  factories. No in-tree component exercises buildProcessors today;
  without this, that 80+ lines of code would be uncovered.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Reviewer-C (security + failure modes) returned 3 blockers + 13 strong;
Reviewer-D (docs + adoption) returned 4 strong. Dispositions in
docs/loops/m9-review-notes.md.

Blocker fixes:

- Attribute value sanitization (parser.go:289). Every attribute value
  passes through sanitizeAttrValue: strip non-printable control
  bytes (<0x20 except \t \n \r, plus 0x7f DEL); cap length at
  4 KiB with `...` truncation. Defends against attacker-controlled
  syslog/kmsg payloads breaking downstream JSON/Loki/Elastic
  parsers (research § Pass-3 B5.2).
- kmsg `bufio.ErrTooLong` no longer crashes the source. Scanner
  errors now distinguish `kmsg_oversized` (record exceeded the
  1 MiB ceiling) from `kmsg_overflow` (EPIPE/ENOTRECOV from the
  ring buffer) — operators alert on the right kind.
- decodeJournaldMessage range-checks byte-array values (0-255);
  out-of-range now returns a parse error instead of silently
  truncating high bytes to byte() — data integrity invariant.

Strong fixes:

- journalctl --version probe at supervise start; degrade once with
  an actionable message when systemd<200 lacks --output=json
  support.
- journald arg-building sorts map keys before emitting Matches
  entries — argv is now deterministic (PRINCIPLES.md §12).
- JournalctlPath rejected at Validate if not absolute.
- parseKmsgRecord now errors on a malformed sequence number.
- Source goroutines wrap their hot loop in safeRun / safeSupervise
  with defer/recover + telemetry.IncError("panic") + markDegraded.
- Removed dead `var _ = errors.Is; _ = io.EOF` block from kmsg.go.
- example_config.yaml default min_severity changed from `warning`
  to `info` so NVLink-down notes (priority 6, the canonical Xid 79
  signal for Pattern #1) are not silently filtered out.

Deferred (FOLLOWUPS / Carry-forward M9): subprocess env scrubbing,
journalctl stderr capture, facilityNumToName O(1) reverse map,
field map cap before attribute construction, maxRetries variable
rename, clean-exit-as-crash recovery, goroutine close-race
tightening.

Coverage stays >70% with new sanitization + truncation + bad-
sequence + rejected-byte-array + version-probe tests.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
The previous AggregateSLOSource computed lifetime cumulative failure
ratio. After a single failure on call #1, the gauge stayed > 0
forever — useless for SLO alerting that targets a recent window.

Replace with a sliding-window source: maintain a ring of timestamped
(success, failure) snapshots; on each scrape, find the latest sample
≥ window-old as the anchor and compute (Δfailure / Δtotal) since.
Returns 0 while warming up (no anchor yet) and on zero in-window
calls. DefaultSLOWindow is 60s — matches typical k8s probe cadence.

API change: AggregateSLOSource gains state and a constructor
(NewAggregateSLOSource). cmd/tracecore updated; tests rewritten to
exercise the windowing semantics:

- TestAggregateSLOSource_WindowedRate: signal in window appears as
  the expected rate; subsequent signal at the same in-window ratio
  stays at the same rate.
- TestAggregateSLOSource_WindowedRate_LifetimeRatioNotReflected:
  the bug-driving case — a long-ago single failure doesn't pin the
  gauge above 0 once it rolls out of the window.

Ring buffer is pruned to 2× window of samples per scrape so memory
stays bounded under fast scrape cadence.

Coverage: internal/telemetry up to 83.6%. make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Address item #1: the 292-line internal/selftelemetry/impl.go mixed
three concerns (receiver impl + exporter impl + init-error tracking),
forcing readers to context-switch across responsibilities. Split:

  receiver_impl.go  (~200 lines) — NewReceiver, receiverImpl, the
                                   five-method binding, degraded-
                                   seconds bookkeeping
  exporter_impl.go  (~80 lines)  — NewExporter, exporterImpl,
                                   FailureRateReader satisfaction
  init_errors.go    (~40 lines)  — RecordInitError

Common state hoisted to receiver_impl.go: ErrNilMeterProvider
sentinel and `instrumentationScope` constant (the package-stable
Meter scope name shared across all three call sites).

No API change; tests pass without modification. Each file is now a
single coherent unit and a future maintainer reading "what does
NewExporter do?" doesn't have to scroll past Receiver internals.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Closes 4 of 7 new A+ criteria from the recursive self-review:

#1 — e2e-otelcontrib now verifies the collector PARSED the record,
not just that it accepted bytes. Workflow rewritten to docker-run
otelcol-contrib with a custom config (file + debug exporters,
detailed verbosity). After the e2e POST, the bash step greps
/tmp/otelout/logs.json for the canonical body, the kernelevents.xid
attribute, and the gpu.id attribute. Empty file or missing
attributes → workflow fails.

#2 — TestIntegration_KmsgWriteReadBehavioral (//go:build linux)
writes a synthetic <6>NVRM Xid 79 line to /dev/kmsg, uses a marker
string in a regex_filter to isolate from ring-buffer noise, then
asserts the receiver emits a plog.LogRecord with kernelevents.xid=79
+ gpu.id=0000:65:00.0 within 3s. A regression in parse/build/emit
fails this on Linux CI.

#3 — prometheus_alerts_test.go validates the alert YAML structure
(every group has interval, every rule has expr/severity/summary/
description) AND cross-references the metric + label-filter names
against the receiver's actual SelfTelemetry surface. A typo in
the alert would silently never fire; this catches it before merge.

#5 — runbook_test.go executes the RUNBOOK's "First 15 minutes"
step 1 (`tracecore validate --config=...`) and step 2
(`tracecore debug dump`) as real commands. Documentation rot
becomes a test failure, not a silent SRE-time discovery.

#4 — sustained_test.go (`//go:build sustained`) feeds 1000
events/sec for 5 minutes (300k records), samples heap every 30s,
asserts ≤10 MiB growth and p99 emit latency tail bounded. New
`sustained-load` workflow job runs it on push-to-main + schedule
(not PR — 5 minutes is too slow for the inner loop).

The seventh criterion (two-week soak + external operator) requires
elapsed time + a human; nothing in-session can close it.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Round-3 review (two passes) caught 5 strongs I shipped in the
round-2 fix wave. This commit closes them AND adds a test gate
per bug class so the same class can't re-ship silently.

N1 — CAS-pair memory-model claim was incorrect:
 - Earlier RECEIVER-PATTERNS entry claimed Start's CAS publishes
   the subsequent `r.cancel = cancel` write via the Go memory
   model. It doesn't — the CAS HB edge only covers writes
   sequenced-BEFORE the CAS. In practice this worked because the
   OTel runtime serializes Start→Shutdown, but that's a runtime
   contract, not memory-model coverage, and the pattern doc
   would have taught M9/M11 authors the wrong invariant.
 - Fix: `r.cancel` is now `atomic.Pointer[context.CancelFunc]`.
   Store in Start, Load in Shutdown. This makes the publish
   memory-model-correct in all contexts (not just OTel-runtime
   ones). Pattern doc rewritten honestly: CAS pairs are for
   *idempotence*; the cancel publish is its own atomic.
 - Gate: `TestReceiver_CancelIsAtomicPointer` parses receiver.go
   via go/ast and refuses any non-atomic.Pointer shape on the
   cancel field. Future refactors that revert to bare
   CancelFunc fail at CI.

N2 — Example contradicts its own header:
 - `docs/agents/examples/non_blocking_start.go` used
   `IncError(Kind("panic"))` casts even though the file's header
   claims typos are caught at compile time. `Kind("typoo")`
   compiles fine — defeating the entire point of the typed Kind.
 - Fix: declared per-receiver `const KindConnect Kind = "connect"`
   etc. in the example body; replaced all `Kind("…")` casts with
   the constants.
 - Gate: `TestExamples_NoUntypedKindCasts` walks
   `docs/agents/examples/*.go` and refuses (a) bare string
   literals to IncError AND (b) `Kind("literal")` casts. M9+
   contributors can't accidentally copy the broken shape.

N3 — Alert #1 still had the for+increase pairing B5 fixed on
alert #2:
 - `DCGMReceiverDegraded` had `for: 5m` paired with
   `increase(...[5m])`, doubling its effective window to ~10m.
   Same bug class as B5; I only fixed one of the two alerts.
 - Fix: dropped `for: 5m` on DCGMReceiverDegraded with the same
   comment explaining the rationale.
 - Gate: `TestPrometheusAlerts_NoDwellDoubling` parses the
   alerts YAML and asserts no rule pairs `increase(...[N])` with
   `for: N` without an explicit allowlist label. The future
   alert author proposing both must opt in deliberately.

N5 — `warnOnce` lost kind-transition breadcrumbs:
 - The previous shape `if r.degraded { return }` suppressed
   ALL warn-level logs after first failure, including a
   different failure kind on the next tick (connect→watch
   transition mid-degraded-cycle). Operators lose the
   breadcrumb trail.
 - Fix: `warnOnce(kind, msg, args...)` keys on
   `(degraded, kind)` — log fresh when the kind changes, even
   if still degraded. Threaded the kind through all 7 callers.
 - Gate: `TestWarnOnce_RelogsOnKindTransition` exercises the
   helper directly: first kind=K1 logs; repeat-K1 silenced;
   kind=K2 logs fresh. The exact behavior an operator cares
   about, pinned by a unit test.

N4 — K8s manifest in README was broken multiple ways:
 - telemetry default-off → probes fail → CrashLoop on apply
 - "DaemonSet + anti-affinity" was contradictory
 - SYS_ADMIN/hostPID claimed required for standalone mode (not
   needed; only embedded mode needs them)
 - only `/dev/nvidia0` mounted (need nvidiactl + nvidia-uvm +
   per-GPU device files)
 - Fix: section now ships a paired ConfigMap that enables
   telemetry and binds on 0.0.0.0; DaemonSet drops the
   unnecessary privileges; the section is marked
   "illustrative — not production-ready" and explicitly defers
   workload-specific privilege layering to the Helm chart (M6).
 - Gate: `TestReadme_K8sExampleParsesAndEnablesTelemetry`
   extracts the YAML block, parses both docs (ConfigMap +
   DaemonSet), asserts (a) `enabled: true` AND `0.0.0.0` in the
   config, (b) both liveness + readiness probes exist pointing
   at /healthz + /readyz. A future doc author can't ship a
   manifest that would CrashLoop on apply.

Nits:
 - N6: reverted `watchUpdateDivisor` / `watchKeepForMultiplier`
   to untyped consts (the canonical Go shape for unitless
   ratios; typing them as time.Duration was dimensionally
   confused).
 - N9: anchored regex `\b` on the metric-value match in the M2
   wiring test — `} 1` was accidentally matching `} 12` /
   `} 100`.
 - N10: clarified `client_cgo.go` comment that Close() returns
   nil (consistent with stub, but the previous comment misled
   casual readers).
 - Cgo placeholder operator-deception risk: variant string now
   `cgo-placeholder` not `cgo` until the real binding lands.
   `tracecore receivers list` shows `dcgm [cgo-placeholder]`
   so operators on a real GPU host can't deploy a stub binary
   thinking it's the real one. Legend in the receivers-list
   output explains the three values.

S19 partial (wire build-tags into make ci):
 - `make ci` now depends on `build-tags`. Every `make ci` run
   (local + GitHub Actions) gates on the cgo vs default build
   compiling cleanly. Pre-existing target now actually fires in
   the standard CI surface.

FOLLOWUPS additions (deferred but tracked with trigger predicates):
 - S18 `pkg/dcgm.Probe(…)` library helper — when a second
   external consumer materializes.
 - N7 AST walker resolve-map by reflection — when selftelemetry
   adds a new canonical Kind.
 - N8 AST walker globs *.go non-test — paired with the
   receiver.go split FOLLOWUP.
 - Promote `make build-tags` into the pr-validation shortcut
   workflow — opportunistic next CI sweep.

`make ci` passes; dcgm coverage steady at 86.0%; the build-tag
matrix is now part of every CI run.

Assisted-by: Claude Opus 4.7
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
## Summary

Lands the M8 DCGM receiver scaffold — vendor-SDK isolation, the
receiver itself, full operator surface, and a documented path to
A+ on the universal receiver rubric.

The cgo Client (`pkg/dcgm/client_cgo.go`) and the hardware
integration test runs are deferred to a follow-up PR on a Linux
GPU runner (this PR was authored on a macOS host without
`libdcgm-dev`). The full sequencing of follow-up work lives in
[`docs/M8-NEXT.md`](docs/M8-NEXT.md).

## Release notes

```release-notes
[FEATURE] DCGM receiver (alpha). Ships in build-tag-isolated
stub-only mode for safe cross-platform deployment; cgo path lands
in a Linux+GPU follow-up. `tracecore receivers list` shows the
deployed variant as `dcgm [stub]` / `dcgm [cgo]` so operators can
verify the binary's hardware-binding without reading go.mod.
```

## What landed

**Core receiver:**

- `components/receivers/dcgm/` — config (19-case Validate),
  factory (mirrors clockreceiver), receiver lifecycle (non-
  blocking Start, reconnect loop, idempotent double-Start +
  double-Shutdown, panic-recovery on the scrape goroutine),
  Sample→pmetric emission for all 13 metric families with
  cardinality cap + deterministic drop order + NaN/Inf guard,
  kind-aware resource attribution (GPU vs MIG Instance vs NVSwitch).
- `pkg/dcgm/` — `Client` interface (no build tag), sentinel
  errors, pure-Go types (Entity, Sample, FieldGroup, EntityKind,
  Version), `client_stub.go` (`//go:build !dcgm`) returning
  `ErrDCGMUnavailable`.
- Centralised metric NAMES + attribute KEYS + well-known values
  as constants in `metric_names.go` — rename is one-file.
- Wired into the binary via `components.yaml` + `make generate`;
  `tracecore validate` accepts a dcgm config; `tracecore
  receivers list` prints `dcgm [stub]`; `make smoke` boots-
  degrades-shuts-down end-to-end.

**Operator docs:**

- `README.md` (configuration table, Configuration-errors table,
  what-it-emits with PROPOSED-extension flags, lifecycle state
  diagram, SLI/SLO targets marked "target, not measured",
  cardinality budget with the worst-case math, backend
  compatibility matrix incl. Prom `.→_` rewrite caveat,
  Quickstart, Privacy + data residency considerations,
  "Want to add a sibling receiver?")
- `example_config.yaml` (minimal) + `example_config_full.yaml`
  (every knob)
- `RUNBOOK.md` keyed by alert name + per-kind triage table
- `prometheus-alerts.example.yaml` (3 starter alerts;
  thresholds chosen so they're reachable at default 15s tick)
- `HARDWARE-TESTING.md` (Linux+GPU walkthrough)
- `.github/ISSUE_TEMPLATE/component-bug-dcgm.yml`
- `docs/patterns/` — four pattern walkthroughs (NVLink
  degradation, HBM ECC, thermal throttle, PCIe AER) with
  PromQL alerts and replay tests
- `docs/rfcs/0005-dcgm-receiver-scope.md`

**Process artifacts:**

- `docs/AGRADE-RECEIVER-RUBRIC.md` — universal A+ rubric (37
  criteria, 6 lenses) for vendor-SDK receivers
- `docs/M8-AGRADE-GAP.md` — scoring vs the rubric + what's gated
- `docs/M8-NEXT.md` — consolidated index of all 30 deferred /
  follow-up items
- `docs/retros/M8-fourloop.md` — what the four-loop process
  caught (9 blockers) vs missed (3 self-review finds) with 5
  concrete process changes for M9+
- `docs/proposals/semconv-hw-gpu-extensions.md` — staged
  upstream PR body for the four PROPOSED semconv extensions
- `docs/FOLLOWUPS.md` — opportunistic + skipped items with
  falsifiable trigger predicates (M8 section absorbed from
  the formerly-separate repo-root file)
- `MILESTONES.md` M8 row updated; STRATEGY.md gets four new
  divergence rows (one resolved via M2 self-telemetry landing)
- `docs/agents/RECEIVER-PATTERNS.md` gets six patterns +
  Pattern-selection table; `docs/agents/examples/` ships six
  runnable Go files (`//go:build ignore`) per pattern

**Tests:**

- Lifecycle: non-blocking Start, idempotent double-Start +
  double-Shutdown, panic recovery, recover-from-degraded, MIG
  re-enumerate, ConnectionLost reset, healthy-end-to-end,
  ConsumerGPU partial-field path.
- Per-sentinel fault injection: 4 error sentinels table-driven
  via `injectingClient`; `StatusStale` now surfaces as
  `IncError(KindRead)` (was a silent drop); `StatusNoData` stays
  silent (transient by spec); panic injection via `panicClient`.
- Metric emission: per-family pin, kind-aware resource decoration,
  NaN/Inf guard, group-by-metric-name, cardinality cap
  determinism, fuzz-based invariants over 200 random inputs.
- Stress: 100-cycle Start/Shutdown asserts no goroutine leak;
  10-repeat Shutdown asserts idempotence.
- End-to-end: capturingConsumer asserts the full
  downstream-visible shape (Resource attrs, scope name, metric
  kinds, units, OTLP/JSON marshalling).
- Coexistence: `exporterPreemptedClient` proves the
  dcgm-exporter co-deployment constraint is test-pinned.
- Pattern replay: 4 tests reproducing each NORTHSTARS Appendix
  A pattern's signature.
- Docs parity: README references every shipped ancillary;
  example configs cover every README-documented knob;
  Validate's error substrings appear in README/RUNBOOK; alerts
  in YAML have RUNBOOK headings.
- **`TestRUNBOOK_KindsMatchEmitted`** — new structural test
  walks every emitted IncError/failedTick kind against the
  RUNBOOK per-kind triage table in both directions. Closes the
  drift bug class (`consume` vs `downstream`) at CI time.
- **`TestReceiver_M2WiringFromMeterProvider`** — new test
  pins the M2 canonical self-telemetry wiring; a future
  refactor that deletes the 6-line wiring block would not be
  caught without this (noop fallback hides regressions).
- Symmetric drop-order pins: every emitter group in dropOrder;
  every dropOrder entry has an emitter or an allowlisted
  placeholder.
- Performance budget: `TestEmit_StaysUnderBudget` fails the
  build if emit() regresses past 1ms (today: ~165µs under -race).

Coverage on `components/receivers/dcgm/`: ~86%.

## Carry-forward (must land before alpha → beta)

Single index: [`docs/M8-NEXT.md`](docs/M8-NEXT.md). High points:

- `pkg/dcgm/client_cgo.go` via `NVIDIA/go-dcgm`
- `//go:build dcgm,hardware` integration test running in CI
- Linux GPU runner provisioned;
`.github/workflows/ci-hardware.yml.staged` renamed `.yml`
- Cardinality cap validated against three reference fleets
- Measured overhead numbers in the README's SLI/SLO table
- Upstream OTel semconv PR for the four PROPOSED extensions
- External operator pilots the receiver in production

## Loops

- Loop 1 (Research, 5 passes): 4 parallel research agents →
  citation-backed Findings + Candidate Designs A/B/C → Design C
  (mode-toggle, default standalone) chosen via scoring matrix.
- Loop 2 (Scrutinization, 3 passes): 18 questions; key revision
  reversed the cardinality drop order to preserve NVLink
  profiling for pattern-#1 diagnosis.
- Loop 3 (Coding): atomic commits, one per work item +
  fix-up commits.
- Loop 4 (Review): 6 reviewer subagents across 3 passes
  surfaced 9 blockers + 26 strong + nits. Every finding
  dispositioned in `docs/loops/m8-review-notes.md` (worktree-
  local).
- Post-merge passes after M2 landed: typed `selftelemetry.Kind`
  refactor catches the kind-rename bug class at compile time;
  external review findings (operator-drift fixes, double-Close
  bug, log-storm gating, StatusStale signalling) addressed in
  the final commits with a structural drift test.
- A+ rubric scored M8 at composite ~3.85 / 5 (A-). Real A+
  requires hardware + future-milestone evidence.

## Test plan

- [x] `make ci` clean (cmd/tracecore integration tests flake
      under -race on macOS-arm64 in parallel; retry-once
      pattern logged in `docs/FLAKY-TESTS.md`)
- [x] `make smoke` runs in CI — validates example config, boots
      binary, asserts lifecycle log lines
- [x] `tracecore validate --config=example_config.yaml` accepts
- [x] `tracecore receivers list` shows `dcgm [stub]` (or
      `dcgm [cgo]` when built with `-tags dcgm`) + `clockreceiver`
- [x] Coverage ≥60% per components/ floor (actual: ~86%)
- [x] Goroutine-leak stress (100 cycles), cardinality fuzz (200
      trials), end-to-end shape pinning
- [x] Every Go file carries the SPDX-License-Identifier header
- [ ] Hardware path — gated by the cgo follow-up PR on a Linux
      GPU runner

---------

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
Four R1 findings folded into one commit (docs/CI surface).

#1 — README config table missed the top-level `enabled *bool`
kill-switch. Added the row at the top of the table with its
nil-means-active semantics so operators can grep the table for
the field and find it (config.go:27 has been there since the
initial M9 work; the README just didn't surface it).

#2 — README forward-reference to "the container realities
section above" pointed at nothing. Added the actual section
("Container realities") with four operator-actionable bullets:
mount the host /dev/kmsg (not the empty pod-local one),
CAP_SYSLOG instead of root, multi-tenant blast-radius warning,
and the namespaced-kmsg 5.10+ posture. Section anchors a
follow-on ready-to-paste DaemonSet manifest (see commit F).
TOC updated; threat-model table now links by anchor instead of
prose.

R1.S3 — alert-check.sh regex too narrow. The previous regex
required a suffix in {Receiver,Source,Pipeline,Exporter,Processor}
and would miss future alerts named after a domain (e.g.
`KernelEventsXidBurst`). Broadening to "any TitleCase identifier
≥12 chars" produced false positives (Go identifiers like
`OTLPRoundTrip`, `AmbientCapabilities`). Final shape: drop
direction-2 lexicon-based extraction entirely, keep only
direction-1 (alerts-yaml is source of truth → MUST appear in
the runbook). Direction-2 ("stale runbook reference to a
deleted alert") is rare and self-revealing (the alert just
doesn't fire), so the cost of false positives outweighs the
benefit of catching it pre-merge.

#7 — RUNBOOK preamble for receiver-local error kinds. The C
commit already added the per-kind triage section; this commit
ties it into the error-message index and explicitly states the
"why no page alert" rationale so a reviewer doesn't ask the
question again.

Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
The previous gate exited at the sha256 mismatch, which left no
diagnostic trail for triaging which bytes diverged between Build #1
and Build #2. Inverting the control flow: run diffoscope on a
mismatch, capture its text report, then exit non-zero. On a match,
run diffoscope --exit-code as the load-bearing assertion. Either
way diffoscope output ends up in the job log.

Also upload both binaries as a "failed-build-pair" artifact when the
job fails — needed for offline triage when the on-runner diff isn't
enough (e.g. comparing across two failed runs).

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
Diffoscope on test tag v0.0.0-m3test-2 surfaced the actual delta: two
runtime/debug.BuildInfo entries differed across builds — vcs.modified
flipped from false to true, and the +dirty suffix appeared in the
embedded module version. Cascading: that fed a different action-ID into
the Go linker, which changed NT_GNU_BUILD_ID, which changed the file
hash.

Root cause: Build #1 created build1/ inside the worktree and moved
the binary into it. By the time Build #2 ran `go build`, the worktree
contained untracked files (build1/tracecore_linux_amd64 + .sha256), so
`git status --porcelain` was non-empty. `go build -buildvcs=true`
(default) reads that and sets vcs.modified=true for Build #2.

Fix: build each iteration into `mktemp -d` outside the source tree.
The worktree stays clean; Go's VCS probe sees identical state on both
runs; build IDs match; binaries match. The canonical artifact is then
staged from BUILD1_DIR into ./release/ for the rest of the workflow.
Failure-triage upload still grabs both builds when the gate trips.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
…rage

Four parallel reviews landed seven actionable changes:

- Cold rebuild: both builds now use isolated $(mktemp -d) GOCACHE dirs
  so build #2 can't pass by replaying build #1's cached object files.
  The assertion we want is cold-vs-cold byte-equality — which is what a
  third party with a fresh checkout reproduces.
- Cosign cert-identity-regexp tightened to pin this exact workflow file
  on a tag-ref. The previous `^https://github.com/<repo>/` regex would
  have accepted a Sigstore bundle minted by any workflow on any branch
  in the same repo; the new pattern rejects sibling workflows.
- SBOM coverage gate now walks every `Indirect != true` entry in
  go.mod and asserts a matching `pkg:golang/<path>@…` purl exists in
  the CycloneDX components[]. M3's "covers every module" rubric and
  M21's "≥1 component per direct module" rubric now have a falsifiable
  check; the previous `components ≥ 1` gate was a placeholder.
- Recipe step 6 switched from `slsa-verifier verify-artifact` (legacy
  slsa-github-generator format) to `gh attestation verify` (the
  reference verifier for actions/attest-build-provenance's Sigstore
  bundle output). slsa-verifier ≥ 2.7.0 with `verify-github-attestation`
  is documented as the alternate path; earlier versions don't parse
  Bundle v0.3 and would have failed silently or noisily.
- Recipe step 4 dropped `--exit-code` to match the CI fix; step 5
  inherits the tightened cert-identity-regexp; the diffoscope-failure
  diagnostic row points at Go-toolchain drift (the actual common
  cause) rather than "compiler upgrade or -trimpath regression".
- CHANGELOG entry added under [Unreleased] / Added; MILESTONES.md M3
  flipped from ☐ to ⧗ with a flip-to-☑-on-merge note; top-level
  README.md routing table grew a row for auditors / supply-chain
  verifiers pointing at docs/reproducibility.md.
- Dropped two unused job-level outputs (source_date_epoch, build_date)
  that no downstream job consumed; removed a vestigial `make clean`
  between builds (does nothing when artifacts live in mktemp dirs).

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
Five P1 items from the second-pass parallel review:

- Recipe step 6 now reads the bundle from disk (--bundle "$ATTEST")
  instead of pulling it from GitHub's attestation API, and pins
  --signer-workflow + --predicate-type. Two practical wins: the
  verification works offline / air-gapped, and a sibling workflow
  elsewhere in the repo cannot mint an attestation that passes the
  documented command — cosign step 5 and gh-attest step 6 now anchor
  to the same workflow-on-tag-ref identity.
- "If a step fails" row 6 label switched from `slsa-verifier` to
  `gh attestation verify` so the diagnostic table matches the verb
  the walkthrough uses.
- Recipe prerequisites paragraph dropped its dangling `slsa-verifier
  ≥ 2.7.0` alternate-path promise. The walkthrough never showed the
  alternate command; adding it would have doubled the recipe surface
  for marginal benefit. `gh attestation verify` is the single
  documented verifier.
- SBOM job's checkout now pins to ${{ github.sha }} (the commit that
  triggered the workflow) instead of the tag. A force-push to the tag
  between the build and sbom jobs cannot produce an SBOM for a
  different tree than was signed.
- MILESTONES.md M3 status line dropped the m3test-4 reference (stale
  after subsequent test tags landed); replaced with "across the
  v0.0.0-m3test-* series" so future test-tag iterations don't restale
  the line. docs/FOLLOWUPS.md gains an M21 release-asset-shape
  reconciliation bullet (raw binary vs tar.gz, .cosign.bundle vs .sig,
  the .intoto.jsonl extension on Sigstore bundle JSON).
- Build #1 comment trimmed from an essay block to two sentences;
  rationale lives in the commit history.

Deferred from Pass-2 (P2/L, not M3-blocking):
- diffoscope local exit-status wrapping (verifier copy-pasting one
  block at a time can miss a non-zero exit; recipe polish, not gate
  break)
- Repo tag-protection ruleset (org-policy decision, not PR scope)
- `go mod verify` in the build job (cheap hardening; defer to
  separate supply-chain PR)
- Rekor log-index URL in release notes (post-fact audit polish)
- Caching `diffoscope-minimal` apt install (~10s, marginal on a
  tag-triggered workflow)

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 18, 2026
)

## What this PR does

Bundles 8 ready-now items from `docs/FOLLOWUPS.md` whose triggers were
satisfied; closes 2 more as already-shipped or already-satisfied. Three
themes: CI/release-pipeline hardening, code-quality sweeps, and
FOLLOWUPS hygiene.

**CI / release-pipeline hardening:**
- SHA-pin every GitHub Actions ref across all workflows. Dependabot's
  `github-actions` group keeps these bumped weekly as one grouped PR.
- Reconcile `actions/upload-artifact` major-version drift: callsites
  were split between `@v5` and `@v7.0.1`. Unified on v7.0.1.
- Tighten `cosign verify-blob` smoke check with
`--certificate-github-workflow-ref refs/tags/$TAG` and `--trigger push`.
Strictly tighter than the prior `IDENTITY_REGEXP`-only check.
- Mirror tightened flags in `docs/reproducibility.md` step 5; add
`--source-ref` / `--source-digest` to step 6's `gh attestation verify`.
- Emit Rekor `logIndex` URL into release notes so transparency-log
audits don't require bundle archaeology.
- Wire `make mod-verify` into `make ci`.

**Code-quality sweeps:**
- Convert ~49 C-style `for i := 0; i < N; i++` loops to Go 1.22+ `for i
:= range N` (or `for range N` when the index is unused). 6 holdouts have
non-convertible conditions (compound `&&`, `i += 2`, or non-`i`
predicate).
- Backfill 18 raw `"Normal"`/`"Warning"` sites in k8sevents tests to use
`EventTypeNormal` / `EventTypeWarning` constants.
- Export `k8sevents.ComponentType = "k8s_events"`; convert 8 test
callsites.
- Lock the no-`Server`-header invariant in `internal/telemetry` with a
test. (Audit finding: Go's `net/http` does not emit a default `Server`
header in any path; the FOLLOWUPS row had nothing to strip — the test
prevents future regression.)

**Closed-as-stale (no code change, FOLLOWUPS updated with rationale):**
- Next-up #1 `make doc-check`: already shipped (Makefile:192, in `make
ci` chain).
- M8 opportunistic "promote build-tags to pr-validation.yml":
`ci.yml:37` already runs `make build-tags` directly; no
`pr-validation.yml` exists.

## Linked issue(s)

_No linked issue._

## Release notes

```release-notes
[SECURITY] All GitHub Actions are now SHA-pinned; cosign and gh attestation verification flags are tightened to bind to the exact release tag and `push` trigger.
[ENHANCEMENT] Release notes include a Rekor transparency-log entry URL for after-the-fact audit.
```

## Checklist

- [x] Tests added or updated (`TestServer_NoServerHeader`; existing
tests continue to pass)
- [x] `make ci` passes on the worktree branch (exit 0)
- [x] Commits are signed off
- [x] No new components; existing component STYLE.md layout untouched

## Test plan

- [x] `make ci` exit 0 (coverage above floor, govulncheck clean,
doc-check + alert-check pass, vet clean across default + `dcgm` build
tags)
- [ ] CI green on this PR
- [ ] Release dry-run not exercised — release workflow only fires on tag
push; flag-tightening + Rekor URL emission verified by inspection rather
than e2e. Worth a manual `workflow_dispatch` once merged or at next tag
cut.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Tri Lam <trilamsr@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…pe; reject framing of bench-correction as regression

Phase-3 adversarial deep review (2 fresh subagents, independent of
the 8 lens reviews). The author's completion claim was treated as
a hypothesis to falsify.

Adversarial #1: APPROVED, no falsifiable findings.

Adversarial #2: returned CONCERNS-REQUIRE-FIX with two findings.
After the validation cycle:

Findings table:

| ID | Lens | Beneficiary | Severity | Finding | Proof | Contradict | TDD record | Rubric+ | Action |
|----|------|-------------|----------|---------|-------|------------|------------|---------|--------|
| P3.1 | adversarial-2 | repo-long-term | BLOCKER → DEFER | "k8sevents BenchmarkEmitOne allocs jumped 21→28; not gated by bench-check." | Read Makefile:40-44 — bench-check is scoped to ./internal/telemetry/. Confirmed k8sevents has no baseline. | The 21→28 jump is the WHOLE POINT of group F: the previous bench reused one plog chain across iters and under-reported production cost. `git diff origin/main...HEAD -- components/receivers/k8sevents/receiver.go components/receivers/k8sevents/emit.go` shows production allocation paths in r.emit are unchanged from main; only the bench measurement shape changed. Reviewer conflated bench-output change with production regression. | n/a — no production change to test | no — finding rejected as framed, but underlying observation kept | deferred FOLLOWUPS.md (Component-level benchmarks ungated by `make bench-check`) |
| P3.2 | adversarial-2 | repo-long-term | NIT | Missing explicit symlink-to-directory test for kubeconfig path. | A new TestConfig_RejectsSymlinkToDirectoryAsKubeconfigPath would pass without code change. | Reviewer themselves note "would pass with the current code." TestConfig_RejectsDirectoryAsKubeconfigPath already exercises the IsDir() path; symlinks go through the same code (os.Stat follows symlinks intentionally). No unique coverage added. | n/a | no | explicitly-skipped (taste-call; redundant coverage) |

Reproducibility:
  $ grep -n "components" Makefile | grep bench   # only internal/telemetry covered
  $ git diff origin/main..HEAD -- components/receivers/k8sevents/receiver.go components/receivers/k8sevents/emit.go   # zero production-allocation changes

Validation-cycle stats:
  Findings rejected during contradict (framing of BLOCKER as regression): 1
  Findings that survived as DEFERRED to FOLLOWUPS:                        1
  Findings explicitly-skipped (taste-call):                               1

Beneficiary: repo-long-term. The underlying gap (component benches
ungated) is real and worth a follow-up; the immediate framing as
a regression in this PR is not.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…l ordering rationale

Phase-4 A+ aspiration review (2 fresh subagents). Reviewer #1 graded
B+ with 7 documentation-of-already-true-invariants criteria;
reviewer #2 graded A with 3 falsifiable proposals. Two surviving
load-bearing criteria after validation cycle:

Findings table:

| ID | Lens | Beneficiary | Severity | Finding | Proof | Contradict | TDD record | Rubric+ | Action |
|----|------|-------------|----------|---------|-------|------------|------------|---------|--------|
| P4.1 | aplus-2 | repo-long-term | CONCERN | populateAttributes / attrPutter cap check (`attrs.Len() >= maxAttrs`) is exercised only at production maxAttrs floor (9). The exported BuildLogRecordForBench helper can be called with arbitrary values; a future refactor flipping `>=` to `>` would silently allow one attribute through at maxAttrs=0 and slip past every existing test. | TestBuildLogRecord_BoundaryMaxAttrs covers maxAttrs=0 and maxAttrs=-1; mutation-verified red→green: changing `>=` to `>` in attrPutter.putStr/putInt fails the maxAttrs=0 subtest, then restoration passes. | Production Validate floors maxAttrs at 9 (TestConfig_RejectsTooLowMaxAttributes pins this). But internal callers (bench, future refactor) can bypass Validate. | red (mutation) → green → mutation-verify recorded in this commit | yes — P4-aplus-2 in .claude/ralph-loop.local.md | applied this commit |
| P4.2 | aplus-2 | repo-long-term | NIT | validateKubeconfigPath ordering rationale lives only in the Phase-1 commit body and FOLLOWUPS closure; a future maintainer reordering Validate's pipeline would break TestConfig_AmbiguousAuth_* tests without warning at the call site. | Added the rationale to the validateKubeconfigPath docstring (source-level). | n/a — comment-only; existing tests catch a bad reorder regardless. | n/a | no | applied this commit (config.go) |

Rejected/deferred:

- P4.3 (aplus-1 #1) — "Bench allocs/op ≤30 threshold gate." Already
  covered by Phase-3 deferred FOLLOWUPS entry on component-bench
  scope. DEFER (duplicate).
- P4.4 (aplus-2 #2) — Cross-receiver SchemaURL pattern lint. Out of
  scope; trigger is third in-tree schema URL. DEFER to FOLLOWUPS.
- P4.5 (aplus-1 #2-7) — Document already-met invariants. Per
  feedback_anti_bureaucracy, criteria that document truths without a
  falsifiable hook are bloat. REJECT.

Reproducibility:
  $ go test -run TestBuildLogRecord_BoundaryMaxAttrs -v ./components/receivers/k8sevents/   # passes
  $ sed -i.bak 's/a.attrs.Len() >= a.maxAttrs/a.attrs.Len() > a.maxAttrs/g' components/receivers/k8sevents/emit.go && \
    go test -run TestBuildLogRecord_BoundaryMaxAttrs/maxAttrs=0 -v ./components/receivers/k8sevents/   # fails
  $ mv components/receivers/k8sevents/emit.go.bak components/receivers/k8sevents/emit.go   # restore

Letter-grade outcome:
  Reviewer #1 starting grade: B+ → target A+ via documentation
  Reviewer #2 starting grade: A → target A+ via P4.1 + P4.2
  After this commit: A+ on the falsifiable axis (every C1-C6 + F
  change has a mutation-catching test; the boundary cap is now
  explicitly pinned; ordering rationale lives at source).

Beneficiary: repo-long-term. Falsifiable tests survive refactors;
documentation-of-truths does not.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…+ threat-root trace on go-mod-verify

Phase-4 A+ aspiration review (2 fresh subagents; both graded A,
diverged on which gates to apply). Validation cycle:

Findings table:

| ID | Lens | Beneficiary | Severity | Finding | Proof | Contradict | TDD record | Rubric+ | Action |
|----|------|-------------|----------|---------|-------|------------|------------|---------|--------|
| P4.1 (aplus-1 #2, also P2.6) | aplus | operator | CONCERN | A workflow_dispatch run with `inputs.tag` set but `github.ref` ≠ refs/tags/$INPUT_TAG passes Build and fails the OIDC smoke check 15-30 minutes later. Operator wastes runner time and sees the misuse late. | New "Verify dispatch ref matches tag (pre-flight)" step exit-1s within seconds with the documented workaround. | Reviewer noted the smoke check already enforces this — but at job-end, not at job-start. Fail-fast IS the load-bearing property. | n/a — workflow YAML, actionlint clean | yes — P4-aplus-1 | applied this commit; closes P2.6 deferral. |
| P4.4 (aplus-2 #2) | aplus | repo-long-term | NIT | go-mod-verify comment says "defense in depth against a compromised GOPROXY mirror" but doesn't name the trust root or the orthogonal threat (a poisoned go.sum itself). | Comment now states "Trust root: the go.sum at this tag commit" and cross-references the tag-protection FOLLOWUPS entry. | A future maintainer might over-attribute the protection. | n/a | no | applied this commit |

Rejected/deferred:

- P4.2 (aplus-1 #4) — Structured diff lint for release.yml ↔
  docs/reproducibility.md. DEFER to FOLLOWUPS.md (real value, but
  manual review caught both drift directions in Phases 2 + 3;
  automate when next edit happens).
- P4.3 (aplus-1 #6) — Release artifact manifest validation before
  upload. REJECT. Per anti-bureaucracy: reviewer concedes `needs:`
  dependency already gates malformed artifacts from reaching the
  release job. Adding defensive validation against a CI-bug
  scenario is bloat.
- P4.5 (aplus-1 #3) — docs/SUPPLY-CHAIN-IDENTITY.md consolidated
  reference. DEFER to FOLLOWUPS.md; ~30-min write, scope creep
  beyond release.yml. M21 release-checklist is the natural trigger.
- P4.6 (aplus-1 #5, aplus-2 #3) — Formal threat-model document +
  M21 alignment narrative. DEFER to M21.
- P4.7 (aplus-2 #5) — Cross-link health lint. Duplicate of P4.2;
  same deferral.

Reproducibility:
  $ make actionlint zizmor   # exit 0
  $ grep -A1 "workflow_dispatch with inputs.tag" .github/workflows/release.yml
    # pre-flight gate present

Letter-grade outcome:
  Reviewer #1 starting: A → A+ via criteria 2, 4, 6 (we applied 2 + threat-model comment)
  Reviewer #2 starting: A → APPROVED-AS-IS (already strong)
  After this commit: A on the falsifiable axis (one operator-UX gate
  + one comment clarification), with the broader doc/lint work
  scoped to follow-ups.

Beneficiary: operator. The pre-flight gate cites a specific
operator-facing surface (15-30 minute waste on workflow_dispatch
misuse) and turns it into a seconds-fast named error.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…ask + gh attestation verify (#69)

## Summary

Release-pipeline supply-chain hardening + a workflow_dispatch pre-flight
gate. No operator-visible release-artifact shape change; the gates fail
loudly at tag-push time before any artifact is signed and published.

**Hardening:**
- `go mod download && go mod verify` step before the reproducible-build
pair. Catches a poisoned GOPROXY mirror returning module bytes that
don't match `go.sum`. Trust root: the `go.sum` at the tag commit; a
poisoned `go.sum` itself is tracked separately under M3 tag-protection.
- `LC_ALL=C` + `TZ=UTC` env + `umask 022` inside the run script of both
Build #1 and Build #2. Canonical reproducible-builds.org stanza; today's
`-trimpath`+`SOURCE_DATE_EPOCH` carry the load for Go output, but the
stanza is cheap insurance against future cgo or non-Go release
artifacts.
- New "Smoke-check `gh attestation verify`" step in the provenance job.
Local-bundle mode (offline trust chain — cert + SCT + Rekor proof are
embedded). Flag set matches `docs/reproducibility.md` step 6:
`--signer-workflow` + `--predicate-type` + `--repo` + `--source-ref` +
`--source-digest`. Pins the OIDC subject path so a different workflow in
the repo with `attestations: write` cannot satisfy it; pins the source
claims so an attestation from a non-tag dispatch is refused.
- `docs/reproducibility.md` step 6 tightened from `--owner` (org-wide)
to `--repo` (org/repo). Adopters following the documented walkthrough
now exercise the same scope CI enforces.
- New "Verify dispatch ref matches tag" pre-flight step. On
`workflow_dispatch` with `inputs.tag` set, asserts `github.ref ==
refs/tags/$INPUT_TAG` and fails fast with the named workaround. Saves
15-30 minutes of runner time on misuse.

**FOLLOWUPS hygiene:**
Closed five rows: `go mod verify`, build-env sanitization,
cosign+gh-attestation flag tightening (cosign half had already shipped),
Rekor log-index URL (already shipped), and workflow_dispatch pre-flight
gate.

Opened three rows: flag-parity lint between release.yml and
reproducibility.md; consolidated `docs/SUPPLY-CHAIN-IDENTITY.md`
reference; component-bench gating scope (tracked from the parallel
k8sevents review).

## Verification

- `make actionlint zizmor` clean on the head commit (zizmor: 0
findings).
- `gh attestation verify --bundle` + `--repo` + `--source-ref` +
`--source-digest` combination verified end-to-end against a public
sigstore bundle (`github/codeql-action v2.25.4`); gh CLI source maps the
flags to Fulcio cert OIDs 1.3.6.1.4.1.57264.1.14 / .13, populated from
OIDC `ref` / `sha` claims at sign time.
- Pre-flight gate is a stand-alone shell test; it exits 1 with a clear
error and the named workaround when `github.ref` and `inputs.tag`
disagree.

## Test plan

- [ ] PR CI green on the head commit.
- [ ] Next real release tag (M21) exercises all four new gates
end-to-end against a real Sigstore bundle.
- [ ] If `gh attestation verify --bundle` rejects the flag combination
at release time, the failure is loud (job fails) and the fix is a
one-line follow-up.

```release-notes
Tightened release-workflow supply chain: defensive `go mod verify`, canonical LC_ALL / TZ / umask reproducible-build stanza, and a local-bundle `gh attestation verify` smoke check pinned to the source tag + commit SHA and the signing workflow. `docs/reproducibility.md` now uses `--repo` so adopter verification matches CI strictness. Workflow_dispatch with `inputs.tag` fails fast if the ref doesn't match. Operator-visible release shape unchanged.
```

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 19, 2026
## Summary

Adds two load-bearing lessons to `AGENTS.md` from this session's CI
work. Both prevent a future contributor from repeating the same trap.

**Aggregator bypass.** GitHub Actions short-circuits an aggregator job's
`needs:` to SKIPPED on any sub-job failure, and treats SKIPPED required
checks as satisfied. PR #73 silently merged past a failed `verify-test`
because the aggregator from PR #72's verify split was SKIPPED rather
than FAILURE. The fix shape (`if: always()` + `needs.*.result` check)
shipped in PR #74; this lesson documents the trap and the fix so anyone
splitting CI jobs in the future doesn't repeat it.

**Perf-budget regex flake class.** `require.Regexp` with implicit upper
bounds (e.g. `0\.0[0-9]+`) on values whose only invariant is `>0` flake
on slow CI runners. Two of these hit in one session:
`TestReceiver_SLIBudget` (emit-latency, observed 539ms) and
`TestReceiver_SetDegraded` (degraded-seconds, observed 0.126s). The fix
shape is the same in both — relax to any positive value
(`\d+\.[0-9]*[1-9]`) or use baseline-relative comparisons.

File goes from 128 to 148 lines (cap is 150, with 2 lines of remaining
headroom — next addition should consider demoting an older entry to a
topic note per the file's own promotion rule).

## Test plan

- [x] `wc -l AGENTS.md` reports 148, under the 150-line cap.
- [x] `make doc-check` clean (banned-phrase lint, 250 links resolve,
`(unverified)` count = 7 baseline).
- [x] Capture-flow format check (`learn-from-mistakes` skill): banned
vocabulary absent, no first-person AI phrasing, no AI attribution, both
entries carry `Anchor:` citations.
- [ ] CI on this PR exercises the same gates plus the aggregator that's
now itself an anchor of lesson #1.

```release-notes
NONE — documentation only. Adds two load-bearing lessons to `AGENTS.md` covering GitHub Actions aggregator semantics and a recurring perf-budget regex flake class. No runtime behavior change.
```

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
Two independent adversarial reviewers ran fresh against the post-
Phase-2 state. Convergent findings on three substantive issues plus
four smaller polish items. All findings survived contradiction.

Findings applied (7):

P3.F1 [CONCERN/applied] Version-skew "matrix" promised in evolved
  rubric [P2-sre] was delivered only as a label, not a table; the
  §Wire protocol paragraph ended on a dangling colon promising
  content that never appeared. Now ships a four-row receiver-action
  table covering (helper=receiver), (receiver_newer), (helper_newer),
  (unknown), with the Phase 4 alert binding called out explicitly.
  Beneficiary: operator (named: Phase 4 alert rule binds to a
  sustained non-zero `helper_newer` rate after a chart rollout).

P3.F2 [CONCERN/applied] An orphaned "Length-prefix framing eliminates
  any need for an in-band terminator" paragraph sat between the new
  version-skew lead and the versioning detail, breaking the topic
  flow. Hoisted into the framing-ceiling paragraph where it belongs.
  Beneficiary: repo-long-term (cold-reader test).

P3.F3 [CONCERN/applied] Helper lifecycle reorder fixed the
  filesystem-cleanup race but not the original "indeterminate UDS
  state" race the section's lead sentence names: the accepted
  connection fd was never closed during shutdown, only the listening
  socket. Added explicit step 3 closing the accepted-conn fd via a
  lock-protected `_active_conn` slot before join. Step ordering now
  closes accepted-conn → unlink → chain-prior-handler → join, so
  every operator-visible commitment survives a hung dump.
  Beneficiary: operator (named: SIGTERM cleanup completes without
  relying on the daemon-flag-forced-exit backstop; the indeterminate-
  UDS-state hazard the section opens with is now actually closed).

P3.F4 [CONCERN/applied] §Wire protocol RSS reconciliation paragraph
  enumerated three Phase-3 options without picking one. v0.1 alpha
  now pins defaults at `max_threads_per_dump=64` and
  `max_frames_per_stack=128` (worst-case ≈0.8 MiB, comfortable under
  the 10 MB RSS budget). Operators with wider workloads raise the
  caps explicitly with an acknowledged RSS waiver in site config.
  The 32 MiB framing ceiling stands as a protocol-level upper bound
  independent of the default caps. Beneficiary: repo-long-term
  (the rubric "must reconcile" promise now matches what the RFC
  delivers).

P3.F5 [CONCERN/applied] §Supply chain committed to four artifacts
  with no Phase mapping. Phase 2 deliverable list now names PyPI
  trusted-publisher OIDC config, PEP 740 attestation, and typosquat-
  reservation stubs. §Supply chain itself opens with "all four
  artifacts below are Phase 2 deliverables; none of the listed
  registry actions, signing setup, or hash-pinning is in place
  today" — making the forward-commitment status explicit for
  reviewers who only read that section.
  Beneficiary: repo-long-term (per NORTHSTARS O3).

P3.F6 [CONCERN/applied] §Design overview "Cadence pairing with M18"
  paragraph self-described its own silent-breakage hazard ("if
  either side moves, this derivation breaks silently") and deferred
  to a Phase 3 fixture-test with no recorded deliverable. Phase 3
  deliverable now explicitly carries "M13-cadence × M18-threshold
  cross-link fixture asserting the 45s sustained-state derivation
  holds at every build." The paragraph rewrite drops the hand-wave
  framing. Beneficiary: repo-long-term.

P3.F7 [NIT/applied] Defensive "not by oversight" phrasing in the
  `gen_ai.training.rank` attribute-table cell presupposed a prior
  accusation a six-months-cold reader would not understand.
  Rephrased to "Attributes carry this namespace per the NORTHSTARS
  O4 shepherding commitment."

Findings explicitly-skipped (not deferred, judged):

P3.NIT.SR1-citation: Phase 2's P2.SR.1 cited "operator dashboard"
  as a generic surface. Phase 3's new version-skew table now binds
  the metric to a specific Phase 4 alert rule (sustained non-zero
  `helper_newer` rate). Citation upgraded to specific via the
  applied finding above.

P3.NIT.soft-triggers: Two of the new FOLLOWUPS rows ("first operator
  report of X") have non-falsifiable triggers. Accepted as known
  limitation for v0.1; tracecore has no inbound issue-label channel
  to detect this in CI today. Revisit when the operator-feedback
  channel exists.

Validation-cycle stats:
- Findings raised by adversarial #1: 6
- Findings raised by adversarial #2: 6
- Convergent (same defect, both reviewers): 1 (version-skew matrix)
- Findings rejected during contradict: 0
- Findings whose hard-proof did not reproduce: 0
- Findings applied: 7
- Findings explicitly-skipped: 2 (citation upgraded inline; soft-trigger known limitation)
- Findings deferred to FOLLOWUPS: 0 (everything load-bearing was applied)

TDD discipline stats:
- New code changes landed via failing-test-first: 0 (doc-only PR)
- Hard-proof commands executed during validation: 7

Rubric additions accepted in Phase 3 (.claude/ralph-loop.local.md):
- [P3] When promising a "matrix" or "must reconcile" in the evolved
  rubric, deliver the artifact not the language.
- [P3] Lifecycle reorder fixes must close every named race in the
  section's lead sentence, not just the named filesystem cleanup.
- [P3] Sections committing artifacts that span multiple Phases need
  an explicit "all forward, none in place today" disclaimer at the
  section head — the global scope-disambiguation paragraph at the
  top of §Proposal only covers CI gates.
- [P3] Self-described silent-breakage hazards demand an explicit
  enforcement deliverable in the same paragraph.

Adversarial verdicts:
- Adversarial #1: CONCERNS-REQUIRE-FIX (6 findings, 6 surviving)
- Adversarial #2: CONCERNS-REQUIRE-FIX (6 findings, 6 surviving)

Beneficiary tally (applied / skipped):
- Operator:        3 / 0
- Repo long-term:  4 / 0

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

The `docs/adrs/` directory held one file
(`0001-metrics-to-logs-pattern-input.md`). A single-file decision-record
directory sitting parallel to `docs/rfcs/` (13 RFCs + README + template)
is taxonomy drift — operators and contributors had to learn two
near-identical conventions for "load-bearing architectural decision in
tree."

This PR collapses the split.

### What changed

- ADR-0001 → **RFC-0014**
(`docs/rfcs/0014-metrics-to-logs-pattern-input.md`). Content reformatted
to match the RFC template's section headings (Summary / Motivation /
Proposal / Alternatives / Open questions / Migration / References). The
substance (Option A vs Option B vs Option C analysis, the v0.130 contrib
survey, PR-A landed + PR-B pending sequencing) is preserved verbatim —
this is active design for issue #260 PR-B, not archaeology.
- `docs/adrs/` directory removed.
- 6 cross-references repointed at the new RFC path:
  - `docs/README.md` (subdirectories table — `adrs/` row removed)
- `docs/ATTRIBUTES.md` (2 spots: `tracecore.alert.pcie_rate_collapse.*`
row + "See also" link)
  - `docs/integrations/prometheus-scrape.md` (2 spots)
  - `docs/patterns/pattern-4-thermal-throttle.md`
  - `docs/patterns/pattern-5-pcie-aer.md`
- 5 source-code comments repointed
(`module/processor/patterndetectorprocessor/{patterndetector.go,thermal_throttle_test.go}`,
`module/pkg/patterns/{pcie_aer.go,thermal_throttle.go}`).
- `docs/rfcs/README.md` status-index gains an RFC-0014 row (`accepted`,
2026-05-31).

### Why convert (not delete)

ADR-0001 was evaluated against the "delete if RFC-0013 already covers
it" bias. It is not covered:

- RFC-0013 §5 mentions a `metricthresholdconnector` as a contribution
slot in one bullet. It does **not** evaluate Option A vs Option B vs
Option C, cite the contrib v0.130 evidence, or sequence the PR-A/PR-B
split for the metric-sourced detectors.
- ADR-0001 is the binding design contract for patterns #1 / #3 / #4 / #5
(4 of the next NORTHSTAR detectors). Source code (`pcie_aer.go`,
`thermal_throttle.go`, `patterndetector.go`) cites it as the reason for
the staged-but-quiet wire-up.

Delete would orphan 5 source comments and break the audit trail for an
active design decision. Convert keeps the record load-bearing without
preserving the parallel taxonomy.

### Verification

- `grep -rn "docs/adrs\|adrs/0001"` returns 0 hits.
- `grep -rn "ADR-0001\|ADR 0001"` returns 0 hits.
- `ls docs/ | grep -i adr` returns empty.
- pre-commit golangci-lint + go vet + go mod verify clean.

```release-notes
docs: collapse single-file `docs/adrs/` into `docs/rfcs/`. ADR-0001 (metrics-sourced pattern inputs) is promoted to RFC-0014 verbatim; cross-references across docs and module source repointed.
```

## Test plan

- [x] `grep -rn "docs/adrs"` returns 0 hits
- [x] `grep -rn "ADR-0001"` returns 0 hits
- [x] `docs/adrs/` directory removed
- [x] golangci-lint + go vet clean (pre-commit hook)
- [ ] CI green

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Extends prometheus-scrape.md with the bridge attribute contract for the
four metrics-derived patterns:

- pattern #1 NVLink (#260) — the `hw.gpu.nvlink.io` OTTL transform
  already lands in commit 0baa557; this PR closes #260's recipe-half.
- pattern #3 HBM ECC (#273) — `hw.errors.delta` + error.{type,
  subtype,persistence} + gpu.id contract.
- pattern #4 thermal throttle (#282) — `hw.gpu.throttle.duration.delta`
  in integer seconds + reason=thermal + gpu.id contract.
- pattern #5 PCIe AER Layer 2 (#284) — the `tracecore.alert.
  pcie_rate_collapse.*` namespace contract.

OTTL metrics->logs emission stays upstream-blocked at OTel-contrib
v0.130 (RFC-0014): no contrib processor or connector emits log records
from a metrics pipeline. The bridge contract documented here is the
load-bearing wire format any future emitter (an upstream
metricthresholdconnector OR the WithMetrics extension to
patterndetectorprocessor per RFC-0014 PR-B) MUST honor; the detector
projections at module/processor/patterndetectorprocessor/
patterndetector.go gate on this contract today.

last-verified marker bumped to 2026-06-01.

Closes #260. Closes #273. Closes #282. Refs #284 (Layer 1 closed
under #285 in a prior commit; Layer 2 contract documented here).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Pattern #10 - CUDA OOM, deceptive allocator - per NORTHSTARS Appendix A
row #10 and the design spec at docs/patterns/10-cuda-oom-deceptive.md.

Detector evaluation rule
- per-OOM, look up most-recent same-GPU FB sample within CorrelationWindow
  (default 2min, forward-only - fb.Timestamp <= oom.Timestamp)
- if fb_free_ratio >= FBFreeFragmentationThreshold (default 0.05) ->
  kind=fragmentation (raise max_split_size_mb, empty_cache)
- if fb_free_ratio < threshold -> kind=true_oom (shrink batch, shard)
- if no FB sample joins -> kind=unknown, confidence=partial

Discriminator value
- fragmentation vs true-OOM is the operator's #1 question on a CUDA OOM
- without DCGM cross-check the operator retries with same batch, hits
  same OOM, wastes a slot
- partial-confidence verdict surfaces the OOM even when DCGM scrape lags,
  so the operator branches on concurrent pod_evicted / xid_correlation
  rather than silence

Files
- module/pkg/patterns/cuda_oom.go - detector + verdict + records
- module/processor/patterndetectorprocessor/cuda_oom.go - projections,
  collectCUDAOOMInputs, appendCUDAOOMVerdict, runCUDAOOMDetector
- module/processor/patterndetectorprocessor/cuda_oom_test.go - 7 wiring
  tests + 2 Validate guards
- module/processor/patterndetectorprocessor/example_config.yaml -
  cuda_oom_correlation_window + cuda_oom_fb_free_fragmentation_threshold
  knobs
- docs/ATTRIBUTES.md - hw.gpu.memory.{free,total} namespace entries

Scalar promotions per issue #270 contract: gpu.id, k8s.{pod,node}.*,
cuda_oom.kind, cuda_oom.tried_alloc_bytes, cuda_oom.fb_free_bytes,
cuda_oom.fb_free_ratio, pattern.confidence.

Window-edge fenced both sides per PR #255 lesson. Threshold-boundary
fenced inclusive per same lesson. Most-recent-pre-OOM rule mirrors
xid_correlation / pcie_aer / hbm_ecc.

Integration-gap follow-ups (tracked separately on PR body):
- DCGM_FI_DEV_FB_USED/FREE OTTL recipe extension (sibling to #273)
- filelogreceiver OTTL stanza for CUDA OOM regex parsing (sibling to #285)
- metrics-path on patterndetectorprocessor per ADR-0001 (PR-B)

Tests
- 17 detector tests in module/pkg/patterns/cuda_oom_test.go (filed in
  red commit, now green)
- 11 schema-drift falsifier sub-tests on CUDAOOMVerdict
- 7 wiring + 2 Validate tests in processor cuda_oom_test.go
- all 35 green; full ./pkg/patterns + ./processor/patterndetectorprocessor
  suites green with -race; make check + make build green

Refs #303

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Resolve 5 conflicts post-PR #310 / #312 / #313:
- factory.go deleted on main (merged into patterndetector.go);
  port wave's selftel wiring (#261) into the merged createLogs
- VerdictAttr* unexported per #310; rename 16 wave-added consts
  + all callers across cuda_oom + ib_link_flap + pcie_aer tests
- docs/{MILESTONES,FOLLOWUPS,patterns/README}.md path + content
  reconcile after MILESTONES.md moved to docs/

Address reviewer findings before PR:
- docs/THREAT-MODEL.md case-mismatch -> docs/threat-model.md
  (Linux CI is case-sensitive)
- pattern.id schema drift: 8 specs said `ib_link_flap`/`cuda_oom`,
  code emits "2"/"10"/.../"13"; rewrite spec attribute tables to
  match shipped customer-stable namespace
- pattern.confidence: 8 specs said `high|partial`, code emits
  `full|partial`; rewrite
- 02-ib-link-flap.md attribute drift: spec said
  tracecore.alert.ib_link_flap.{hca_device,port}, code emits
  hw.network.ib.{device,port.num}; align spec to shipped code
- v1-rc1-cut-criteria criterion #1 status stale-on-arrival
  ("6 patterns shipped" -> "8 patterns shipped, 4 remaining")
- NetPol UX trap: NOTES.txt warning when networkPolicy.enabled=true
  with empty allowedEgressEndpoints (silently kills OTLP exporter)
  + warning when ServiceMonitor scraper in different namespace
- File #337 for missing OTTL recipe projecting DCGM FB_USED/FREE
  -> hw.gpu.memory.{free,total} log shape (CUDA OOM detector
  consumes but recipe gap means it ships dark)

Tests: ./module/processor/patterndetectorprocessor/... +
./module/pkg/patterns/... both ok.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
…ts (#338)

## Summary

15-agent parallel wave bridging v1.0-rc1 knowledge gaps + closing
horizon backlog. 31 commits, 81 files, +8650/-180.

**Code (5 detectors / features):**
- `feat(iblinkflap)` pattern #2 IB link flap detector — 13 tests,
cross-rank helper extracted for reuse by patterns #7/#9
- `feat(cudaoom)` pattern #10 CUDA OOM detector +
fragmentation-vs-true-OOM discriminator — 35 tests, 0/6 false-positive
rate on fixture corpus (#303 wiring — recipe gap tracked at #337)
- `feat(verdict)` deprecate EvictedPod, co-emit PodName + PodNamespace
(#277) with regression-pinning test
- `feat(chart)` opt-in default-deny NetworkPolicy + cert-manager mTLS
reference (#301); ServiceMonitor + scrape annotations (#296); NOTES.txt
UX warnings for empty-egress / cross-ns scraper traps
- `feat(bench)` per-detector allocs/event harness + soft ratchet gate,
graduation criterion documented (#302)
- `feat(patterndetector)` verdict counter metric for dashboard panels
(#261)
- `fix(slo-rules)` correct otelcol_* label set + drop silent-no-op
`unless on (instance)` join (#298)

**8 pattern design specs (`docs/patterns/{02,07-13}-*.md`):**
- Per pattern: symptom, layers crossed, signal sources, detector
evaluation rule, verdict attrs, edge cases, open questions.
- 7 load-bearing spec gaps flagged for future TDD red-test work
(multi-vendor SDC signal, cohort grouping, processor metrics path, etc).

**9 v1.0-rc1 audit / knowledge-gap docs:**
- `docs/v1-rc1-cut-criteria.md` — 12 falsifiable cut gates derived from
O1-O7
- `docs/v1-rc1-operational-gaps.md` — SLSA L3 + air-gap +
upgrade-rollback audit (8 issues filed #314-#321)
- `docs/v1-rc1-governance-gaps.md` — CODEOWNERS 0%, lint-principles
4/16, retros, `make ci` 148s (5 issues #322-#325, #327)
- `docs/v1-rc1-test-audit.md` — 82.9% coverage, fuzz harness inventory
(5 issues #328-#332)
- `docs/v1-rc1-simplification-audit.md` — top deletion candidates ~9.6K
LOC (3 issues #333-#335)
- `docs/threat-model.md` — STRIDE per trust boundary + audit RFP scope
(#336)
- `docs/reference-environments.md` — Tier 1 kind + Tier 2 32×H100
binding spec for O2 hero KPI
- `docs/adoption-pipeline.md` — S0-S3 funnel + comms templates for O5
hero KPI
- `docs/standards-roadmap.md` — 10 `gen_ai.training.*` attributes
proposed upstream (#326)

**Doc-drift cleanup:** 11 issues closed (#265, #268, #269, #276, #283,
#287, #292-295, #299).

**OTTL recipe wiring:** 6 issues closed (#260, #261, #273, #282, #284,
#285); #272 deferred to standards-roadmap.

**Multi-cluster auth:** bearer-token + mTLS examples (#297).

**Merge resolution + reviewer fixes:**
- Resolved 5 conflicts post-PR #310/#312/#313 (factory.go delete,
VerdictAttr* unexport, MILESTONES.md → docs/, FOLLOWUPS, patterns
README)
- Adversarial reviewer found 1 BLOCKER + 6 MAJOR; all addressed before
push:
  - Renamed 16 `VerdictAttr*` → `verdictAttr*` per #310 convention
  - Re-ported selftel wiring (#261) into main's merged `createLogs`
- Fixed case-mismatch `docs/THREAT-MODEL.md` → `docs/threat-model.md`
(Linux CI is case-sensitive)
- 8 pattern specs schema drift: `pattern.id` slug → numeric (`"2"`,
`"7"`...`"13"`), `pattern.confidence` `high` → `full`
- `02-ib-link-flap.md` attribute drift: spec said
`tracecore.alert.ib_link_flap.{hca_device,port}`, code emits
`hw.network.ib.{device,port.num}`
- `v1-rc1-cut-criteria` criterion #1 status stale-on-arrival ("6
patterns shipped" → "8 patterns shipped, 4 remaining")
- NetPol UX trap: NOTES.txt warns when `enabled=true` with empty
`allowedEgressEndpoints` (silently kills OTLP) or cross-ns Prometheus
- Filed #337 for missing OTTL recipe projecting `DCGM_FI_DEV_FB_*` →
`hw.gpu.memory.{free,total}` (CUDA OOM detector consumes but recipe gap)
- Post-merge stale-relative-path sweep: 6 wave docs + NORTHSTARS.md +
MILESTONES.md (`docs/`, `../`, `docs/docs/` drift after MILESTONES +
NORTHSTARS moved to docs/)
- Documented 5 newly-emitted attributes in ATTRIBUTES.md (drop_ratio +
IB tier — `attribute-namespace-check` now 67/67)

## Test plan

- [x] `go test ./module/processor/patterndetectorprocessor/...
./module/pkg/patterns/...` — ok
- [x] `make lint` (golangci-lint via goreleaser-style gate) — 0 issues
- [x] `go vet ./...` — clean
- [x] `make doc-check` — passes after stale-link sweep
- [x] `scripts/attribute-namespace-check.sh` — 67/67 documented
- [x] `helm lint install/kubernetes/tracecore` — 0 chart(s) failed
- [x] `promtool check rules` on slo-rules.yaml — 13 rules / SUCCESS
- [ ] CI compat-matrix (rc1 criterion #6) — gated on next wave
- [ ] manual smoke install on real cluster — owner clearance pending

```release-notes
Lands two new pattern detectors (#2 IB link flap, #10 CUDA OOM
fragmentation-vs-true discriminator), 8 pattern design specs for the
remaining v1.0 root-cause patterns, opt-in default-deny NetworkPolicy
+ Prometheus Operator ServiceMonitor on the Helm chart, the
EvictedPod → PodName/PodNamespace verdict-attribute deprecation
co-emit, per-detector allocs/event bench harness, SLO-rules label
fix, and the v1.0-rc1 knowledge-gap audit set (cut criteria, ops gaps,
governance gaps, test audit, simplification audit, threat model,
reference envs, adoption pipeline, standards roadmap).
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Closes #337.

The CUDA OOM detector (`projectFBMemoryRecord` at
`module/processor/patterndetectorprocessor/cuda_oom.go:114`) gates on
`hw.gpu.memory.{free,total}` log-record attributes, but nothing in
the recipe layer produced them: `dcgm-exporter` emits
`DCGM_FI_DEV_FB_USED` / `DCGM_FI_DEV_FB_FREE` as Prometheus gauges and
no OTTL transform projected them onto the customer-stable namespace.
Detector compiled, never fired on a real install — sibling gap to
#273 (pattern #3), #282 (#4), #284 (#5).

This PR closes the gap on the metric-side projection and pins the
load-bearing log-shape contract the bridge layer MUST honor.

- `docs/integrations/examples/prometheus-scrape.yaml`:
  - `DCGM_FI_DEV_FB_USED` → `hw.gpu.memory.used` (Gauge, unit `By`)
  - `DCGM_FI_DEV_FB_FREE` → `hw.gpu.memory.free` (Gauge, unit `By`)
  - Identity-preserving rename only; `hw.gpu.memory.total = used+free`
    deferred to the bridge layer per the named upstream limit (see
    below).
- `docs/integrations/prometheus-scrape.md`:
  - New `### Pattern #10 — CUDA OOM (framebuffer)` metric-side
    projection section with raw-series → semconv table.
  - New `#### Pattern #10 — hw.gpu.memory.{free,total}` bridge-
    contract subsection with full log-record schema (yaml-shaped)
    matching what `projectFBMemoryRecord` reads, plus MIG caveat
    and unit-test cross-link.
  - Intro + bridge-contract header bumped to include pattern #10.
- `docs/patterns/10-cuda-oom-deceptive.md`:
  - Signal-source line links to the recipe sections.
  - Open Question #1 (`DCGM_FI_DEV_FB_*` OTTL extension) struck
    through; resolution recorded.
- `docs/ATTRIBUTES.md`:
  - `hw.gpu.memory.free` / `.total` rows updated to distinguish
    metric vs log shape and to cross-link to the recipe section.
  - New `hw.gpu.memory.used` row (now projected on the metrics
    pipeline by this PR — dashboard evidence context).

## Root cause + named upstream limit

**Root cause (fixed in this PR):** the prometheus-scrape OTTL
transform had no stanza projecting the DCGM FB series onto
`hw.gpu.memory.*`. The detector's projection gate could not be
satisfied on a real install. Fixed by adding the rename stanza in
`transform/dcgm_to_hw_semconv` (same processor the #1/#3/#4/#5
projections already live in — no new processor surface).

**Named upstream limit (NOT worked around — tracked):** OTel-contrib
`transformprocessor` v0.130 `metric_statements` cannot perform
cross-series arithmetic — there is no OTTL path to compute
`hw.gpu.memory.total = hw.gpu.memory.used + hw.gpu.memory.free`
on a metrics pipeline ([upstream
README](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config)).
Per RFC-0014 §Alternatives, the metrics→logs primitive does not
exist in contrib at v0.130 either. The total scalar lives at the
bridge layer (RFC-0014 PR-B `WithMetrics` extension to
`patterndetectorprocessor`, tracked under #260). The recipe pins
the load-bearing wire format the bridge MUST honor so PR-B lands
without a contract change.

## Adopt-over-build posture

Every new OTTL statement uses upstream functions only (`set`,
`==` equality). No new transformprocessor extension. Mirrors the
existing #1/#3/#4/#5 stanzas.

## Test plan

- [x] `make build` — clean
- [x] `./scripts/validator-recipe.sh` — 9 validated, 3 skipped
      (non-linux), 0 fail; prometheus-scrape example passes
      `tracecore validate`.
- [x] `./scripts/doc-check.sh` — 721 markdown links resolve, all
      new cross-links + anchor refs included; banned-phrase lint
      clean; recipe markers (`tested-against`, `last-verified`)
      present.
- [x] `./scripts/attribute-namespace-check.sh` — 67/67 attribute
      literals documented (no new undocumented attrs introduced).
- [x] golangci-lint, go vet, go mod verify via commit hook — clean.
- [ ] CI linux runner exercises journald + k8sobjects skip-paths
      we couldn't run locally (validator-recipe ubuntu job).

```release-notes
recipe(ottl): project DCGM `DCGM_FI_DEV_FB_USED` / `DCGM_FI_DEV_FB_FREE` onto the customer-stable `hw.gpu.memory.{used,free}` namespace and pin the metrics-to-logs bridge log-shape spec the pattern #10 (CUDA OOM) detector consumes via `projectFBMemoryRecord`. `hw.gpu.memory.total = used + free` derivation is deferred to the RFC-0014 PR-B `WithMetrics` bridge layer because OTTL `transformprocessor` v0.130 has no cross-series arithmetic on metrics pipelines.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Ships pattern-#9 (NCCL bootstrap timeout) detector end-to-end — the
first job-start-time pattern in the library (sibling to pattern #8 which
fires mid-run). A training-job cohort whose pods are Ready past
`BootstrapDeadline` (default 5min) but where at least one rank never
emitted any NCCL FlightRecorder record is stuck in NCCL bootstrap; a
same-namespace K8s CNI / network-readiness event in the correlation
window promotes the verdict to `confidence=full` and stamps
`discriminator=cni_error`.

Spec:
[`docs/patterns/09-nccl-bootstrap-timeout.md`](docs/patterns/09-nccl-bootstrap-timeout.md).
Status flipped from `planned` → `shipped`; `Implementation notes`
section captures how each spec open-question resolved with the
most-conservative reading.

## What landed

- `module/pkg/patterns/nccl_bootstrap.go` — detector +
`TrainingPodRecord` / `CNINetworkEventRecord` /
`NCCLBootstrapTimeoutVerdict` types. Reuses `NCCLFRRecord` from
`nccl_hang.go`.
- `module/pkg/patterns/nccl_bootstrap_test.go` — 11 detector tests +
schema-conformance + 10-falsifier drift battery. Covers:
full-correlation fires, partial-when-no-CNI, normal-startup-no-fire,
deadline-not-yet-reached, heterogeneous failure, multi-job cohorts don't
merge, namespace-only fallback, cross-namespace CNI doesn't join,
deadline-configurable, deterministic ordering, max(ReadyAt) drives age.
- `module/pkg/patterns/testdata/nccl_bootstrap_verdict.schema.json` —
JSON Schema with `additionalProperties:false` and full enum guards.
- `module/processor/patterndetectorprocessor/nccl_bootstrap.go` —
projections (`projectTrainingPodRecord` gates on `k8s.pod.ready_time` +
`gen_ai.training.rank`; `projectCNINetworkEventRecord` gates on
`k8s.event.reason` ∈ `{FailedCreatePodSandBox, NetworkNotReady,
CNIError}`), verdict writer with promoted scalars (issue #270 contract),
and runner that consumes NCCL FR records from the existing cross-cutting
`collectInputs` (no double-projection).
- `module/processor/patterndetectorprocessor/nccl_bootstrap_test.go` — 6
wiring tests (full verdict, partial verdict, partial-suppressed-by-flag,
normal-startup-no-fire, sub-1s deadline rejection, sub-1s window
rejection).
- `Config.NCCLBootstrapDeadline` +
`Config.NCCLBootstrapCorrelationWindow` with Validate guards (≥1s) and
`withDefaults` / `defaultConfig` wiring; `example_config.yaml` updated.
- `docs/ATTRIBUTES.md` — 3 new
`tracecore.alert.nccl_bootstrap_timeout.*` rows, new
`k8s.pod.ready_time` row, updated `gen_ai.training.job_id` row (now
consumed with fallback), new per-pattern matrix row for
`nccl_bootstrap`.

## Design calls (load-bearing)

- **Cohort key.** `(gen_ai.training.job_id, k8s.namespace.name)` when
stamped; `(k8s.namespace.name)`-only fallback when job_id is absent
(spec open question #1). Empty `gen_ai.training.job_id` on the verdict
signals the fallback path to operators.
- **Bootstrap-failed-rank index key.** `(node, rank)` not `(namespace,
rank)` — avoids cross-cohort contamination when two jobs in the same
namespace land on different nodes. FR records with empty Node are
skipped from the index (a wiring gap should NOT cause false-negatives —
i.e. mask real bootstrap failures — even at the cost of cross-job
false-positives that are unlikely in practice).
- **CNI vocab.** v0 ships the K8s-control-plane vocabulary only
(`FailedCreatePodSandBox` / `NetworkNotReady` / `CNIError`). Per-CNI
raw-error parsing (Cilium / Calico / multus distinct strings) is the
discriminator-branch follow-up that lights up `socket_ifname_mismatch` /
`rendezvous_unreachable`.
- **Cohort size.** Count of distinct ranks the detector observed
pod-Ready signals for. Pods that never reached Ready (image-pull stuck)
don't enter the cohort — they belong to pattern #15. Per the spec's edge
case "slow image pull" no false-positive.
- **`max(ReadyAt)` drives deadline.** A late-joining rank pushes the
effective ready timestamp forward, preventing false-positives during
rolling pod-Ready scenarios on cold-cache clusters.

## Test plan

- [x] `cd module && go test ./pkg/patterns/...
./processor/patterndetectorprocessor/...` — clean
- [x] `cd module && go vet ./...` — clean
- [x] Pre-commit hook: `golangci-lint run ./...` — 0 issues;
`attribute-namespace-check` — 72/72 documented
- [x] TDD discipline: `test(nccl-boot): RED` → `feat(nccl-boot): GREEN`
commits
- [ ] CI green on PR (full matrix)

```release-notes
feat(patterns): pattern-9 (NCCL bootstrap timeout) detector — fires when a training-job cohort has at least one rank with no NCCL FR record past `BootstrapDeadline` from pod-ready (default 5min); a same-namespace `FailedCreatePodSandBox` / `NetworkNotReady` / `CNIError` event promotes to `confidence=full` with `discriminator=cni_error`. New YAML knobs: `nccl_bootstrap_deadline` (default 5m), `nccl_bootstrap_correlation_window` (default 10m). Verdict shape pinned by `nccl_bootstrap_verdict.schema.json`.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
The patterndetector ships 11 detectors with 14 time-bounded knobs, but
the join shape varies across patterns and the rationale lived only
in code comments + PR review threads. Operators tuning windows had to
read source per detector.

Audit finding: five distinct shapes are load-bearing (chosen by the
causal physics of each signal), not bugs:

- One-sided lookback (#1 #3 #5 #6 #7 #10): cause precedes effect.
- Asymmetric two-sided (#11): pre-stall covers concurrent-start
  checkpoints; post-stall covers OTTL-bridge logger latency.
- Symmetric two-sided (#9 CNI-event leg): cohort-ready ±window
  could be cause OR consequence.
- Job-window bounded (#13): SDC counter rise must fall in the
  bounded eval-cycle's owning job; no operator knob is meaningful.
- Trailing-window rate / freshness (#2 #4 #8): rolling window
  anchored at `now` or the most-recent record.

Decision: document the existing reality, do not converge. Forcing
every detector to the asymmetric two-knob form would silently
zero one leg for the one-sided detectors (footgun on clock skew)
and would not apply to #13 at all.

Adds:
- 'Why this correlation shape' section in docs/patterns/07, 11, 13
  (the three shapes the issue called out by name).
- 'Correlation-window semantics' table in docs/patterns/README.md
  covering ALL 11 detectors with the predicate, anchor, and shape
  rationale, plus cross-links to the per-pattern sections.

No code changes; no detector behavior changes.

Closes #367.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 2, 2026
…ard) (#477)

## Summary

Closes the `docs/MILESTONES.md` §M6 carry-forward: *"every fenced block
in `docs/getting-started.md` is exercised by `scripts/smoke.sh`"*.

The ≤5-count gate shipped with the M6 wave; the binding half was tracked
carry-forward because `smoke.sh` ran a parallel hand-written
hostmetrics→debug config rather than the doc's actual YAML.

## Root cause

Two scripts owned the "first OTLP byte" config — `smoke.sh` rendered one
inline, `docs/getting-started.md` carried another. They happened to
agree, but nothing forced them to. The carry-forward existed because the
binding was *correct by inspection*, not *correct by construction*.

The fix is to make the doc the single source: `smoke.sh` extracts the
YAML from `docs/getting-started.md`'s `## Walkthrough` heredoc at
runtime. If the doc grows a typo, a renamed receiver, or a different
scraper, `smoke.sh` exercises the change automatically. If the heredoc
disappears, the extractor fails loud with a named error.

## Changes

- `scripts/smoke.sh` — extracts the Walkthrough heredoc via a perl
one-liner, writes it to a tempfile, then runs `tracecore validate
--config=` + `tracecore --config=` against it (Walkthrough steps 3 + 4).
Lifecycle-log assertions retained, with `"Shutdown complete"` now
load-bearing against the doc's post-walkthrough prose.
- `scripts/doc-check.sh` — new gate (right after the existing ≤5-count
gate) asserts the smoke↔doc binding with four mutation-verified clauses:
Walkthrough scope, `"$BIN" validate --config=` invocation, `"$BIN"
--config=` run invocation, `docs/getting-started.md` path reference.
- `scripts/smoke_test.sh` — new mutation-verify harness mirroring the
gate at runtime, plus an inline mutant-doc test that proves the
extractor exits 1 and the wrapper emits the named error when the heredoc
is removed.
- `Makefile` — `make smoke` now also runs `smoke_test.sh`; wired into
`ci-full` alongside the existing `smoke-quickstart` target.
- `docs/MILESTONES.md` — §M6 status `⧗ partial` → `☑ delivered`;
getting-started rubric `⧗` → `☑`; carry-forward bullet rewritten
(remaining work is operator-config branch-protection only).

## Runtime

End-to-end `bash scripts/smoke.sh` on darwin/arm64: **~2.2s** (extract +
validate + 1.5s run window + lifecycle-log assertions). Well under the
120s ci-fast budget. No hardware required — uses the `hostmetrics` load
scraper, portable across linux/darwin/windows.

## Test plan

```release-notes
ci(smoke): scripts/smoke.sh now extracts its YAML config from docs/getting-started.md '## Walkthrough' instead of carrying a parallel hand-written config; doc-check.sh gates the doc↔smoke binding with four mutation-verified clauses. Closes the M6 carry-forward.
```

- [x] `bash scripts/smoke.sh` exits 0 on clean main (verified locally,
~2.2s).
- [x] `bash scripts/smoke_test.sh` all assertions pass.
- [x] `bash scripts/doc-check.sh` reports `scripts/smoke.sh binds to
docs/getting-started.md (M6: every block exercised by smoke.sh)`.
- [x] Mutation test #1: `sed -i 's/"$BIN" validate --config=/"$BIN" XXX
--config=/' scripts/smoke.sh` → doc-check exits 1 naming "validate
--config= invocation (Walkthrough step 3)".
- [x] Mutation test #2: `sed -i 's/"$BIN" --config=/"$BIN" XXX=/'
scripts/smoke.sh` → doc-check exits 1 naming "run invocation
(Walkthrough step 4)".
- [x] Mutation test #3: `sed -i 's/Walkthrough/Section/'
scripts/smoke.sh` → doc-check exits 1 naming "extraction scope lost".
- [x] Mutation test #4: `sed -i
's/docs/getting-started.md/docs/SOMEWHERE-ELSE.md/' scripts/smoke.sh` →
doc-check exits 1 naming "binding source missing".
- [x] Mutation test #5: getting-started.md with no `## Walkthrough`
heredoc → smoke.sh exits 1 with named error message (covered by
`smoke_test.sh`).
- [x] `make lint` 0 issues; `make vet` clean; `make doc-check` clean
(all 18 gates pass).
- [x] `make smoke` end-to-end including `smoke_test.sh` passes.

## Related

- Refs `docs/MILESTONES.md` §M6 (Documentation scaffold).
- Sibling #460 (`fix(doc-check): drop unconditional exit 0`) made this
carry-forward visible — before #460, the new gate would have been
silently skipped by the line-99 short-circuit.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
## Summary

Adds a kubelet-probe ingress rule to the chart's `NetworkPolicy`
template, closing **M5b chart opportunistic #1**
(`docs/followups/M5b.md`).

**Root cause.** Kubelet liveness/readiness probes originate from the
node IP via the host-network namespace. NetworkPolicy v1 cannot match
host-network traffic with `namespaceSelector` or `podSelector` peers —
only with `ipBlock`. The existing chart `NetworkPolicy` (issue #301)
carved ingress for in-namespace pods (scrape-in) but had no rule the
kubelet matched. Result: a `networkPolicy.enabled: true` install would
flip every DaemonSet pod `NotReady` within one `failureThreshold` window
— the chart would render its own DaemonSet inoperable.

**Fix.** New `networkPolicy.kubeletProbes.{enabled,cidr,except}` block.
When enabled (default `true` when the policy is enabled), the template
renders an `ipBlock` ingress rule on the `health` port (chart default
`:13133`). Default `cidr: 0.0.0.0/0` is permissive on source IP but
L4-scoped to the healthcheckextension port, so the telemetry + OTLP
receiver ports stay locked down. Operators with a fixed node CIDR
tighten it in their overlay.

Production preset (`values-production.yaml`) inherits the default-on
posture. Schema (`values.schema.json`) extended with
`additionalProperties: false` so typos fail at `helm install`.

```release-notes
chart: NetworkPolicy now carves a port-scoped `ipBlock` ingress rule for kubelet liveness/readiness probes (`networkPolicy.kubeletProbes.*`), so `networkPolicy.enabled: true` no longer breaks the DaemonSet's own readiness flow. Closes M5b chart opportunistic #1.
```

## Cross-references

- `docs/followups/M5b.md` — opportunistic-deferral list, item #1 ticked.
- `docs/threat-model.md` §6.G — network-surface audit scope this
template satisfies (listener inventory + default-deny verification).
- `install/kubernetes/tracecore/README.md` §security — operator-facing
values walkthrough updated.
- Builds on `#301` (initial scrape-in + OTLP-out scope).

## Files changed

- `install/kubernetes/tracecore/templates/networkpolicy.yaml` — new
`ipBlock` ingress rule + load-bearing comment block explaining why
`0.0.0.0/0` stays narrow.
- `install/kubernetes/tracecore/values.yaml` — new
`networkPolicy.kubeletProbes` defaults + comment.
- `install/kubernetes/tracecore/values-production.yaml` — inherits
defaults explicitly with production-context comment.
- `install/kubernetes/tracecore/values.schema.json` — schema for the new
block, `additionalProperties: false`.
- `install/kubernetes/tracecore/README.md` — three new values-table rows
+ updated NetworkPolicy section with threat-model cross-link.
- `docs/followups/M5b.md` — item #1 ticked with implementation pointer.

## Test plan

- [x] `helm lint install/kubernetes/tracecore` — exit 0.
- [x] `helm lint install/kubernetes/tracecore -f values-production.yaml`
— exit 0.
- [x] `helm template install/kubernetes/tracecore` — exit 0;
NetworkPolicy NOT rendered (default `enabled: false`).
- [x] `helm template install/kubernetes/tracecore -f
values-production.yaml` — exit 0; NetworkPolicy rendered with
kubelet-probe ingress rule.
- [x] **Mutation: enabled with empty `allowedEgressEndpoints`** —
renders correctly (no DNS / probe rule loss).
- [x] **Mutation: `kubeletProbes.enabled: false`** — probe rule omitted;
scrape-in rule unchanged.
- [x] **Mutation: tightened `cidr: 10.0.0.0/16` with `except:
[10.0.99.0/24]`** — renders `ipBlock.cidr` + `ipBlock.except` correctly.
- [x] `conftest test --policy policies/conftest/tracecore.rego` on
default render — 52/52 passed.
- [x] `conftest test --policy policies/conftest/tracecore.rego` on
production render — 91/91 passed.
- [x] `kubeconform -strict -ignore-missing-schemas -kubernetes-version
1.30.0` on default render — 4 valid, 0 invalid.
- [x] `kubeconform -strict -ignore-missing-schemas -kubernetes-version
1.30.0` on production render — 6 valid, 0 invalid, 1 skipped
(ServiceMonitor CRD).
- [x] commit-msg hook gates: golangci-lint clean, go vet clean, go mod
verify clean, attribute-namespace-check clean.

## Grade

**A+** — root-cause fix, mutation-verified, conftest + kubeconform +
helm-lint all clean, cross-linked to threat-model.md §6.G, explicit
`policyTypes: [Ingress, Egress]` deny-all baseline documented inline,
M5b checklist item ticked.

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
M19 carry-forward #1 — ship the infrastructure that lets operators
contribute anonymized pod_evicted captures under
`module/pkg/replay/pod_evicted/_real_world/<anon-name>/`.

* `scripts/anonymize-pod-evicted-fixture.sh` — deterministic sha8
  rewrite of event_uid / regarding.{namespace,name,uid} /
  reporting_instance / node_{name,uid}; verifier flags surviving
  IPv4 / email / cloud-instance-node / image-ref shapes in note +
  message prose.
* `scripts/anonymize-pod-evicted-fixture_test.sh` — mutation tests:
  baseline-clean passes; IPv4 / email / EC2 / GKE / ECR shapes
  fail verify; `v1.28.4`-style version strings do NOT false-positive;
  rewrite is deterministic (two passes byte-identical) and strips
  every raw input string.
* `synthetic-2026-06-multi-rank-disk-pressure/` — synthetic-but-
  real-world-shaped fixture exercising multi-rank disk-pressure
  burst with mixed full+partial confidence (third eviction at T+35s
  falls outside the 30s join window, partial-remediation path
  inferring disk pressure from note).
* `TestPodEvictedReplay_RealWorldGroupLoaderSafe` — asserts the
  loader walks `_real_world/` identically to `_negative/`; the
  synthetic fixture is the load-bearing proof of the loader path.
* README polished with the explicit PII-field map + cross-link to
  `docs/threat-model.md`; threat-model row updated to reflect the
  partial-shipped enforcement.
* `make ci-full` + `make verify` gain
  `anonymize-pod-evicted-fixture-check` so a PR that drops raw PII
  into `_real_world/` fails before merge.

```release-notes
feat: pod_evicted replay fixtures gain a deterministic PII anonymizer
(`scripts/anonymize-pod-evicted-fixture.sh`) and a synthetic
multi-rank disk-pressure fixture under
`module/pkg/replay/pod_evicted/_real_world/`, closing M19 carry-forward
welcome.
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
## Summary

- Replace the `ErrPending` stub at `tools/failure-inject/ncclhang/` with
a deterministic wrapper over `module/pkg/nccl/fr_parser.Synthesize`.
Output is one of the canonical M11 hang fixtures (`nccl-2.29.x-hang` /
`nccl-2.30.x-hang`), selected by `--seed mod 2`; bytes round-trip
through `frparser.Parse` and a re-synthesize is byte-identical — closes
**M4b carry-forward #1**.
- Pin the new SHA in `tools/failure-inject/testdata/golden.sha256` so
`chaos.yml`'s `harness-determinism` job (matrix `linux/amd64` +
`linux/arm64`) replays the same argv on both arches and enforces
cross-arch SHA equality — closes **M4b carry-forward #2**.
- Flip ⧗ → ☑ on the two M4b functional rubrics (round-trip,
safe-opcodes) and the M4b determinism non-functional rubric, plus the
M11 synthetic-fixture-generator rubric. Remove the `failure-inject
nccl-hang` follow-up from `docs/followups/M4b.md` and from M11's
carry-forward list.

## Root cause

M4b shipped at v0.1 with the `nccl-hang` subcommand stubbed
(`ErrPending`, exit 70) because `pkg/nccl/fr_parser/synthesize.go` was
still pending under M11. M11 landed the synthesizer plus the canonical
hang fixtures (`fixture229Hang`, `fixture230Hang`) in
`module/pkg/nccl/fr_parser/`. The CLI shim was carry-forward — this PR
is the wiring.

## What's in the diff

- `tools/failure-inject/ncclhang/ncclhang.go` — `Options{Seed uint64}`;
`Run` selects a hang variant by `Seed % len(hangVariants)`, calls
`FixtureSpec.Bytes()` (which delegates to `frparser.Synthesize`), writes
to `w`. `ErrPending` deleted; `ctx.Err()` honoured before any write.
- `tools/failure-inject/main.go` — pass `Options{Seed: *c.flagSeed}`
through to `ncclhang.Run`; drop the `errors.Is(err, ncclhang.ErrPending)
→ exit 70` branch.
- `tools/failure-inject/ncclhang/ncclhang_test.go` — RED → GREEN:
`TestRun_RoundTrip` (synthesize → parse → re-synthesize byte-identical),
`TestRun_SeedDeterminism` (same seed → same bytes, 4 seeds),
`TestRun_SafeOpcodesOnly` (delegates to `frparser.Parse` as the
safe-opcode oracle — a naive byte scan false-positives on opcode bytes
inside `SHORT_BINUNICODE` string literals), `TestRun_CtxCancelled`.
- `tools/failure-inject/main_test.go` — replace
`TestRun_NCCLHangReturnsNotImplemented` with `TestRun_NCCLHangRoundTrip`
+ `TestRun_NCCLHangSeedDeterminism` so the contract is pinned through
the actual argv path too.
- `tools/failure-inject/testdata/golden.sha256` — add `failure-inject
--seed=0 nccl-hang → e6f49920…`. The existing `TestRun_GoldenSHA` loop
in `main_test.go` and the `Golden SHA pin` step in `chaos.yml` pick it
up automatically.
- `docs/MILESTONES.md` — flip §M4b rubrics ⧗ → ☑ (round-trip,
safe-opcodes, cross-arch determinism) and §M11 synthetic-fixture rubric;
trim carry-forward list.
- `docs/followups/M4b.md` — mark the `nccl-hang` entry closed with the
wiring-PR pointer.
- `tools/failure-inject/README.md` — add a `nccl-hang` section; remove
`nccl-hang` from carve-outs (now only `pod-evict --allow-cluster-write`
carves).
- `module/receiver/ncclfrreceiver/README.md` — replace stale `tracecore
failure-inject` invocation with the actual `go run
./tools/failure-inject` path.

## Test plan

- [x] `go test -race -count=1 ./tools/failure-inject/...` — green (4
packages).
- [x] `(cd module && go test -race -count=1 ./pkg/nccl/fr_parser/...)` —
green (no semantic change here, gate against accidental drift).
- [x] `go build ./... && (cd module && go build ./...)` — clean.
- [x] Pre-commit gates: `golangci-lint`, `go vet`, `go mod verify`,
`attribute-namespace-check` — all 0 issues.
- [x] End-to-end determinism: `failure-inject --seed=0 nccl-hang |
sha256sum` reproduces the pinned SHA (`e6f49920…`) twice in a row.
- [x] Seed variance: `--seed=1` produces a distinct SHA (`2788a726…`);
`--seed=42` (42 mod 2 = 0) matches `--seed=0` per the documented modulo
mapping.
- [x] `failure-inject nccl-hang --help` documents `--seed` and `--out`
and the round-trip-through-`fr_parser` purpose.

## Self-grade

**A+**: round-trip green, determinism golden-SHA pinned, safe-opcode set
verified via parser oracle, cross-arch SHA equality wired into existing
`chaos.yml` matrix, MILESTONES.md flipped on four ⧗ rubrics, `M4b.md`
follow-up closed with a pointer, doc drift swept.

```release-notes
tools(failure-inject): `nccl-hang` subcommand now produces parseable byte-deterministic NCCL FlightRecorder bytes via `pkg/nccl/fr_parser` (was a stub returning `ErrPending`). `--seed` flag selects variant + deterministic synthesis; cross-arch SHA enforced in `chaos.yml` (linux/amd64 + linux/arm64). Closes M4b carry-forward #1 + #2.
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
#484)

## Summary

Closes the M19 carry-forward #1 *infrastructure* obligation: real-world
`pod_evicted` replay captures can now be safely contributed.

- **Deterministic PII anonymizer**:
`scripts/anonymize-pod-evicted-fixture.sh` (`--rewrite` rewrites
`event_uid` / `regarding.{namespace,name,uid}` / `reporting_instance` /
`node_{name,uid}` to `<prefix>-<sha8(value)>` while preserving `-rank-N`
suffixes; `--verify` refuses any fixture still carrying IPv4, email,
EC2/GKE/AKS, or AWS-ECR/GCR-style image-ref shapes in prose).
- **Mutation tests**: `scripts/anonymize-pod-evicted-fixture_test.sh`
proves the verifier catches every PII shape it claims to catch, the
rewrite is byte-deterministic across two passes, and false-positives
stay quiet on innocent inputs (`v1.28.4`-style version strings).
- **Synthetic real-world-shaped fixture**:
`module/pkg/replay/pod_evicted/_real_world/synthetic-2026-06-multi-rank-disk-pressure/`
exercises a 3-pod disk-pressure burst with two full-confidence joins
(per-condition cache reuse) + one partial-remediation eviction at T+35s
(outside the default 30s `JoinWindow` → note-inferred pressure path).
- **Loader-symmetry test**:
`TestPodEvictedReplay_RealWorldGroupLoaderSafe` now asserts the loader
walks `_real_world/` exactly like `_negative/` and would catch a future
refactor that broke either group walk.
- **Threat-model + MILESTONES** updated: the §7 audit row references the
anonymizer; the M19 carry-forward bullet reflects what's shipped vs
still pending (operator captures).

## Root cause being fixed

M19 carry-forward #1 was "no captures contributed yet" — but the deeper
blocker was that **no operator could safely contribute** without (a) a
deterministic anonymizer they could rerun on their side, (b) a verifier
strong enough to use as a CI gate, and (c) loader proof that
`_real_world/` actually walks. This PR ships all three. Future captures
plug in without code changes.

## Test plan

- [x] `go test ./module/pkg/replay/... -count=1` → all green; new
`synthetic-2026-06-multi-rank-disk-pressure` subtest runs.
- [x] `bash scripts/anonymize-pod-evicted-fixture_test.sh` → 11
assertions pass (baseline clean, IPv4 / email / EC2 / GKE / ECR shapes
flagged, version-string false-positive guarded, deterministic-rewrite
byte-equal, every raw input string stripped, shipped fixture clean).
- [x] `make anonymize-pod-evicted-fixture-check` → wires verify +
mutation tests together; exits 0.
- [x] `bash scripts/doc-check.sh` → unaffected, still clean.
- [x] `shellcheck` clean on both new scripts.
- [x] `go vet ./module/...` clean.

## Follow-up

- `cuda_oom`, `nccl_hang`, `hbm_ecc` and the other pattern detectors
don't yet have `_real_world/` slots. The anonymizer is shaped to
generalize (the structured-field map is the only pattern-specific bit;
the prose-PII regex set is universal). Tracked as a follow-up issue once
a second operator capture justifies the rule-of-three lift.

```release-notes
feat: pod_evicted replay fixtures gain a deterministic PII anonymizer
(`scripts/anonymize-pod-evicted-fixture.sh`) and a synthetic
multi-rank disk-pressure fixture under
`module/pkg/replay/pod_evicted/_real_world/`, closing M19
carry-forward #1's infrastructure obligation. Operator-contributed
captures still welcome.
```

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
## Summary

Removes `.github/workflows/policy-matrix.yml`. Engine-specific admission
validation (PSA-restricted × Kyverno × Gatekeeper × default+production)
delivered negative ROI at rc1.

## Root cause

4 PRs blocked or chasing this workflow's flakes (#475 introduction,
#481, #498, #501). Caught zero real regressions; only its own infra
bugs:
- ServiceMonitor CRD bootstrap race (#494)
- AppArmor host-capability mismatch (#481#493)
- kubectl wait .status.conditions nil race (#500#501)

## Coverage retained (without policy-matrix)

- `conftest` — offline PSS-baseline + restricted validation.
- `helm lint` — chart structural validation.
- `kubeconform` — K8s API conformance.
- `kubectl apply --dry-run=server` (chart.yml install/upgrade jobs) —
API-level breakage on generic kind cluster.

## What stays in tree

- `scripts/policy-matrix-smoke.sh` + Gatekeeper/Kyverno bundle refs —
cheap reactivation when GA triggers fire.
- `install/kubernetes/tracecore/policies/conftest/**` — offline policy
bundle (still active).

## Re-enable triggers (tracked in #502)

- GA criterion #1 (third-party audit) requests engine-specific compat
validation.
- First operator running under Kyverno/Gatekeeper reports admission rot.
- CRD-bootstrap pattern stabilises across other workflows.

## Test plan

- [x] `make doc-check` exit 0 (post comment-edit in kind-cluster-setup
action.yml).
- [x] No remaining policy-matrix.yml references in repo (verified by
grep).
- [x] Pre-commit hooks green (lint/vet/mod-verify/attribute-namespace).
- [x] README + install-bench stale refs scrubbed (follow-up commit).

```release-notes
ci: defer engine-specific policy-matrix workflow (PSA × Kyverno × Gatekeeper admission validation) to GA. Coverage retained via conftest + helm lint + kubeconform + kubectl apply --dry-run=server. Re-enable tracked in #502.
```

Refs #502 #475 #494 #500.

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants