ci(deps): bump the gh-actions group with 5 updates#1
Closed
dependabot[bot] wants to merge 1 commit into
Closed
Conversation
Bumps the gh-actions group with 5 updates: | Package | From | To | | --- | --- | --- | | [actions/checkout](https://github.com/actions/checkout) | `4` | `6` | | [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` | | [golangci/golangci-lint-action](https://github.com/golangci/golangci-lint-action) | `6` | `9` | | [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` | | [github/codeql-action](https://github.com/github/codeql-action) | `3` | `4` | Updates `actions/checkout` from 4 to 6 - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v4...v6) Updates `actions/setup-go` from 5 to 6 - [Release notes](https://github.com/actions/setup-go/releases) - [Commits](actions/setup-go@v5...v6) Updates `golangci/golangci-lint-action` from 6 to 9 - [Release notes](https://github.com/golangci/golangci-lint-action/releases) - [Commits](golangci/golangci-lint-action@v6...v9) Updates `actions/upload-artifact` from 4 to 7 - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@v4...v7) Updates `github/codeql-action` from 3 to 4 - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](github/codeql-action@v3...v4) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: gh-actions - dependency-name: actions/setup-go dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: gh-actions - dependency-name: golangci/golangci-lint-action dependency-version: '9' dependency-type: direct:production update-type: version-update:semver-major dependency-group: gh-actions - dependency-name: actions/upload-artifact dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major dependency-group: gh-actions - dependency-name: github/codeql-action dependency-version: '4' dependency-type: direct:production update-type: version-update:semver-major dependency-group: gh-actions ... Signed-off-by: dependabot[bot] <support@github.com>
Contributor
Author
LabelsThe following labels could not be found: Please fix the above issues or remove invalid values from |
Contributor
Author
|
Looks like these dependencies are updatable in another way, so this is no longer needed. |
5 tasks
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Closes PR-13 review #1: assembly was in cmd/tracecore where it could only be exercised by spawning the binary. Now it lives in its own package + can be reused by anyone building pipelines (future plugin surface, `tracecore validate` as a library, etc.). Why a sibling and not under internal/pipeline: internal/config already imports internal/pipeline (it returns pipeline.Signal / pipeline.NewType). Putting the builder INSIDE internal/pipeline would create a cycle (pipeline → config → pipeline). pipelinebuilder sibling sidesteps it; both directions stay one-way. Move scope: - cmd/tracecore/build.go → internal/pipelinebuilder/builder.go - cmd/tracecore/signalops.go → internal/pipelinebuilder/signalops.go - cmd/tracecore/fuzz_test.go → internal/pipelinebuilder/fuzz_test.go - buildPipelines (unexported) → BuildPipelines (exported entry point) - helpers stay package-private inside pipelinebuilder - cmd/tracecore/{collect,validate}.go call pipelinebuilder.BuildPipelines cmd/tracecore main.go remains the place where kingpin wires CLI → runCollect/runValidate → pipelinebuilder + components(). Generated components() stays in cmd/tracecore because it's the binary's registry-of-choice. Coverage tooling fixes that follow from the move: - `make coverage` now uses -coverpkg=./cmd/...,./components/..., ./internal/... so cross-package coverage is correctly attributed (cmd/tracecore tests exercise pipelinebuilder; coverage credits pipelinebuilder, not cmd/tracecore). - tools/coverage-check now deduplicates duplicate file:range entries in coverage.out (Go writes one row per test run per instrumented line when -coverpkg is active; raw sum would multiply the denominator by run-count). Test coverage holds: pipelinebuilder 74%, pipeline 94.5%, fanout 100%, config 94.4% - internal/pipelinebuilder/builder_test.go added: a processor-stage test using fake echoReceiver / noopProcessor / sinkExporter factories. No in-tree component exercises buildProcessors today; without this, that 80+ lines of code would be uncovered. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Reviewer-C (security + failure modes) returned 3 blockers + 13 strong;
Reviewer-D (docs + adoption) returned 4 strong. Dispositions in
docs/loops/m9-review-notes.md.
Blocker fixes:
- Attribute value sanitization (parser.go:289). Every attribute value
passes through sanitizeAttrValue: strip non-printable control
bytes (<0x20 except \t \n \r, plus 0x7f DEL); cap length at
4 KiB with `...` truncation. Defends against attacker-controlled
syslog/kmsg payloads breaking downstream JSON/Loki/Elastic
parsers (research § Pass-3 B5.2).
- kmsg `bufio.ErrTooLong` no longer crashes the source. Scanner
errors now distinguish `kmsg_oversized` (record exceeded the
1 MiB ceiling) from `kmsg_overflow` (EPIPE/ENOTRECOV from the
ring buffer) — operators alert on the right kind.
- decodeJournaldMessage range-checks byte-array values (0-255);
out-of-range now returns a parse error instead of silently
truncating high bytes to byte() — data integrity invariant.
Strong fixes:
- journalctl --version probe at supervise start; degrade once with
an actionable message when systemd<200 lacks --output=json
support.
- journald arg-building sorts map keys before emitting Matches
entries — argv is now deterministic (PRINCIPLES.md §12).
- JournalctlPath rejected at Validate if not absolute.
- parseKmsgRecord now errors on a malformed sequence number.
- Source goroutines wrap their hot loop in safeRun / safeSupervise
with defer/recover + telemetry.IncError("panic") + markDegraded.
- Removed dead `var _ = errors.Is; _ = io.EOF` block from kmsg.go.
- example_config.yaml default min_severity changed from `warning`
to `info` so NVLink-down notes (priority 6, the canonical Xid 79
signal for Pattern #1) are not silently filtered out.
Deferred (FOLLOWUPS / Carry-forward M9): subprocess env scrubbing,
journalctl stderr capture, facilityNumToName O(1) reverse map,
field map cap before attribute construction, maxRetries variable
rename, clean-exit-as-crash recovery, goroutine close-race
tightening.
Coverage stays >70% with new sanitization + truncation + bad-
sequence + rejected-byte-array + version-probe tests.
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
The previous AggregateSLOSource computed lifetime cumulative failure ratio. After a single failure on call #1, the gauge stayed > 0 forever — useless for SLO alerting that targets a recent window. Replace with a sliding-window source: maintain a ring of timestamped (success, failure) snapshots; on each scrape, find the latest sample ≥ window-old as the anchor and compute (Δfailure / Δtotal) since. Returns 0 while warming up (no anchor yet) and on zero in-window calls. DefaultSLOWindow is 60s — matches typical k8s probe cadence. API change: AggregateSLOSource gains state and a constructor (NewAggregateSLOSource). cmd/tracecore updated; tests rewritten to exercise the windowing semantics: - TestAggregateSLOSource_WindowedRate: signal in window appears as the expected rate; subsequent signal at the same in-window ratio stays at the same rate. - TestAggregateSLOSource_WindowedRate_LifetimeRatioNotReflected: the bug-driving case — a long-ago single failure doesn't pin the gauge above 0 once it rolls out of the window. Ring buffer is pruned to 2× window of samples per scrape so memory stays bounded under fast scrape cadence. Coverage: internal/telemetry up to 83.6%. make ci clean. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
8 tasks
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Address item #1: the 292-line internal/selftelemetry/impl.go mixed three concerns (receiver impl + exporter impl + init-error tracking), forcing readers to context-switch across responsibilities. Split: receiver_impl.go (~200 lines) — NewReceiver, receiverImpl, the five-method binding, degraded- seconds bookkeeping exporter_impl.go (~80 lines) — NewExporter, exporterImpl, FailureRateReader satisfaction init_errors.go (~40 lines) — RecordInitError Common state hoisted to receiver_impl.go: ErrNilMeterProvider sentinel and `instrumentationScope` constant (the package-stable Meter scope name shared across all three call sites). No API change; tests pass without modification. Each file is now a single coherent unit and a future maintainer reading "what does NewExporter do?" doesn't have to scroll past Receiver internals. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Closes 4 of 7 new A+ criteria from the recursive self-review: #1 — e2e-otelcontrib now verifies the collector PARSED the record, not just that it accepted bytes. Workflow rewritten to docker-run otelcol-contrib with a custom config (file + debug exporters, detailed verbosity). After the e2e POST, the bash step greps /tmp/otelout/logs.json for the canonical body, the kernelevents.xid attribute, and the gpu.id attribute. Empty file or missing attributes → workflow fails. #2 — TestIntegration_KmsgWriteReadBehavioral (//go:build linux) writes a synthetic <6>NVRM Xid 79 line to /dev/kmsg, uses a marker string in a regex_filter to isolate from ring-buffer noise, then asserts the receiver emits a plog.LogRecord with kernelevents.xid=79 + gpu.id=0000:65:00.0 within 3s. A regression in parse/build/emit fails this on Linux CI. #3 — prometheus_alerts_test.go validates the alert YAML structure (every group has interval, every rule has expr/severity/summary/ description) AND cross-references the metric + label-filter names against the receiver's actual SelfTelemetry surface. A typo in the alert would silently never fire; this catches it before merge. #5 — runbook_test.go executes the RUNBOOK's "First 15 minutes" step 1 (`tracecore validate --config=...`) and step 2 (`tracecore debug dump`) as real commands. Documentation rot becomes a test failure, not a silent SRE-time discovery. #4 — sustained_test.go (`//go:build sustained`) feeds 1000 events/sec for 5 minutes (300k records), samples heap every 30s, asserts ≤10 MiB growth and p99 emit latency tail bounded. New `sustained-load` workflow job runs it on push-to-main + schedule (not PR — 5 minutes is too slow for the inner loop). The seventh criterion (two-week soak + external operator) requires elapsed time + a human; nothing in-session can close it. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Round-3 review (two passes) caught 5 strongs I shipped in the
round-2 fix wave. This commit closes them AND adds a test gate
per bug class so the same class can't re-ship silently.
N1 — CAS-pair memory-model claim was incorrect:
- Earlier RECEIVER-PATTERNS entry claimed Start's CAS publishes
the subsequent `r.cancel = cancel` write via the Go memory
model. It doesn't — the CAS HB edge only covers writes
sequenced-BEFORE the CAS. In practice this worked because the
OTel runtime serializes Start→Shutdown, but that's a runtime
contract, not memory-model coverage, and the pattern doc
would have taught M9/M11 authors the wrong invariant.
- Fix: `r.cancel` is now `atomic.Pointer[context.CancelFunc]`.
Store in Start, Load in Shutdown. This makes the publish
memory-model-correct in all contexts (not just OTel-runtime
ones). Pattern doc rewritten honestly: CAS pairs are for
*idempotence*; the cancel publish is its own atomic.
- Gate: `TestReceiver_CancelIsAtomicPointer` parses receiver.go
via go/ast and refuses any non-atomic.Pointer shape on the
cancel field. Future refactors that revert to bare
CancelFunc fail at CI.
N2 — Example contradicts its own header:
- `docs/agents/examples/non_blocking_start.go` used
`IncError(Kind("panic"))` casts even though the file's header
claims typos are caught at compile time. `Kind("typoo")`
compiles fine — defeating the entire point of the typed Kind.
- Fix: declared per-receiver `const KindConnect Kind = "connect"`
etc. in the example body; replaced all `Kind("…")` casts with
the constants.
- Gate: `TestExamples_NoUntypedKindCasts` walks
`docs/agents/examples/*.go` and refuses (a) bare string
literals to IncError AND (b) `Kind("literal")` casts. M9+
contributors can't accidentally copy the broken shape.
N3 — Alert #1 still had the for+increase pairing B5 fixed on
alert #2:
- `DCGMReceiverDegraded` had `for: 5m` paired with
`increase(...[5m])`, doubling its effective window to ~10m.
Same bug class as B5; I only fixed one of the two alerts.
- Fix: dropped `for: 5m` on DCGMReceiverDegraded with the same
comment explaining the rationale.
- Gate: `TestPrometheusAlerts_NoDwellDoubling` parses the
alerts YAML and asserts no rule pairs `increase(...[N])` with
`for: N` without an explicit allowlist label. The future
alert author proposing both must opt in deliberately.
N5 — `warnOnce` lost kind-transition breadcrumbs:
- The previous shape `if r.degraded { return }` suppressed
ALL warn-level logs after first failure, including a
different failure kind on the next tick (connect→watch
transition mid-degraded-cycle). Operators lose the
breadcrumb trail.
- Fix: `warnOnce(kind, msg, args...)` keys on
`(degraded, kind)` — log fresh when the kind changes, even
if still degraded. Threaded the kind through all 7 callers.
- Gate: `TestWarnOnce_RelogsOnKindTransition` exercises the
helper directly: first kind=K1 logs; repeat-K1 silenced;
kind=K2 logs fresh. The exact behavior an operator cares
about, pinned by a unit test.
N4 — K8s manifest in README was broken multiple ways:
- telemetry default-off → probes fail → CrashLoop on apply
- "DaemonSet + anti-affinity" was contradictory
- SYS_ADMIN/hostPID claimed required for standalone mode (not
needed; only embedded mode needs them)
- only `/dev/nvidia0` mounted (need nvidiactl + nvidia-uvm +
per-GPU device files)
- Fix: section now ships a paired ConfigMap that enables
telemetry and binds on 0.0.0.0; DaemonSet drops the
unnecessary privileges; the section is marked
"illustrative — not production-ready" and explicitly defers
workload-specific privilege layering to the Helm chart (M6).
- Gate: `TestReadme_K8sExampleParsesAndEnablesTelemetry`
extracts the YAML block, parses both docs (ConfigMap +
DaemonSet), asserts (a) `enabled: true` AND `0.0.0.0` in the
config, (b) both liveness + readiness probes exist pointing
at /healthz + /readyz. A future doc author can't ship a
manifest that would CrashLoop on apply.
Nits:
- N6: reverted `watchUpdateDivisor` / `watchKeepForMultiplier`
to untyped consts (the canonical Go shape for unitless
ratios; typing them as time.Duration was dimensionally
confused).
- N9: anchored regex `\b` on the metric-value match in the M2
wiring test — `} 1` was accidentally matching `} 12` /
`} 100`.
- N10: clarified `client_cgo.go` comment that Close() returns
nil (consistent with stub, but the previous comment misled
casual readers).
- Cgo placeholder operator-deception risk: variant string now
`cgo-placeholder` not `cgo` until the real binding lands.
`tracecore receivers list` shows `dcgm [cgo-placeholder]`
so operators on a real GPU host can't deploy a stub binary
thinking it's the real one. Legend in the receivers-list
output explains the three values.
S19 partial (wire build-tags into make ci):
- `make ci` now depends on `build-tags`. Every `make ci` run
(local + GitHub Actions) gates on the cgo vs default build
compiling cleanly. Pre-existing target now actually fires in
the standard CI surface.
FOLLOWUPS additions (deferred but tracked with trigger predicates):
- S18 `pkg/dcgm.Probe(…)` library helper — when a second
external consumer materializes.
- N7 AST walker resolve-map by reflection — when selftelemetry
adds a new canonical Kind.
- N8 AST walker globs *.go non-test — paired with the
receiver.go split FOLLOWUP.
- Promote `make build-tags` into the pr-validation shortcut
workflow — opportunistic next CI sweep.
`make ci` passes; dcgm coverage steady at 86.0%; the build-tag
matrix is now part of every CI run.
Assisted-by: Claude Opus 4.7
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
## Summary Lands the M8 DCGM receiver scaffold — vendor-SDK isolation, the receiver itself, full operator surface, and a documented path to A+ on the universal receiver rubric. The cgo Client (`pkg/dcgm/client_cgo.go`) and the hardware integration test runs are deferred to a follow-up PR on a Linux GPU runner (this PR was authored on a macOS host without `libdcgm-dev`). The full sequencing of follow-up work lives in [`docs/M8-NEXT.md`](docs/M8-NEXT.md). ## Release notes ```release-notes [FEATURE] DCGM receiver (alpha). Ships in build-tag-isolated stub-only mode for safe cross-platform deployment; cgo path lands in a Linux+GPU follow-up. `tracecore receivers list` shows the deployed variant as `dcgm [stub]` / `dcgm [cgo]` so operators can verify the binary's hardware-binding without reading go.mod. ``` ## What landed **Core receiver:** - `components/receivers/dcgm/` — config (19-case Validate), factory (mirrors clockreceiver), receiver lifecycle (non- blocking Start, reconnect loop, idempotent double-Start + double-Shutdown, panic-recovery on the scrape goroutine), Sample→pmetric emission for all 13 metric families with cardinality cap + deterministic drop order + NaN/Inf guard, kind-aware resource attribution (GPU vs MIG Instance vs NVSwitch). - `pkg/dcgm/` — `Client` interface (no build tag), sentinel errors, pure-Go types (Entity, Sample, FieldGroup, EntityKind, Version), `client_stub.go` (`//go:build !dcgm`) returning `ErrDCGMUnavailable`. - Centralised metric NAMES + attribute KEYS + well-known values as constants in `metric_names.go` — rename is one-file. - Wired into the binary via `components.yaml` + `make generate`; `tracecore validate` accepts a dcgm config; `tracecore receivers list` prints `dcgm [stub]`; `make smoke` boots- degrades-shuts-down end-to-end. **Operator docs:** - `README.md` (configuration table, Configuration-errors table, what-it-emits with PROPOSED-extension flags, lifecycle state diagram, SLI/SLO targets marked "target, not measured", cardinality budget with the worst-case math, backend compatibility matrix incl. Prom `.→_` rewrite caveat, Quickstart, Privacy + data residency considerations, "Want to add a sibling receiver?") - `example_config.yaml` (minimal) + `example_config_full.yaml` (every knob) - `RUNBOOK.md` keyed by alert name + per-kind triage table - `prometheus-alerts.example.yaml` (3 starter alerts; thresholds chosen so they're reachable at default 15s tick) - `HARDWARE-TESTING.md` (Linux+GPU walkthrough) - `.github/ISSUE_TEMPLATE/component-bug-dcgm.yml` - `docs/patterns/` — four pattern walkthroughs (NVLink degradation, HBM ECC, thermal throttle, PCIe AER) with PromQL alerts and replay tests - `docs/rfcs/0005-dcgm-receiver-scope.md` **Process artifacts:** - `docs/AGRADE-RECEIVER-RUBRIC.md` — universal A+ rubric (37 criteria, 6 lenses) for vendor-SDK receivers - `docs/M8-AGRADE-GAP.md` — scoring vs the rubric + what's gated - `docs/M8-NEXT.md` — consolidated index of all 30 deferred / follow-up items - `docs/retros/M8-fourloop.md` — what the four-loop process caught (9 blockers) vs missed (3 self-review finds) with 5 concrete process changes for M9+ - `docs/proposals/semconv-hw-gpu-extensions.md` — staged upstream PR body for the four PROPOSED semconv extensions - `docs/FOLLOWUPS.md` — opportunistic + skipped items with falsifiable trigger predicates (M8 section absorbed from the formerly-separate repo-root file) - `MILESTONES.md` M8 row updated; STRATEGY.md gets four new divergence rows (one resolved via M2 self-telemetry landing) - `docs/agents/RECEIVER-PATTERNS.md` gets six patterns + Pattern-selection table; `docs/agents/examples/` ships six runnable Go files (`//go:build ignore`) per pattern **Tests:** - Lifecycle: non-blocking Start, idempotent double-Start + double-Shutdown, panic recovery, recover-from-degraded, MIG re-enumerate, ConnectionLost reset, healthy-end-to-end, ConsumerGPU partial-field path. - Per-sentinel fault injection: 4 error sentinels table-driven via `injectingClient`; `StatusStale` now surfaces as `IncError(KindRead)` (was a silent drop); `StatusNoData` stays silent (transient by spec); panic injection via `panicClient`. - Metric emission: per-family pin, kind-aware resource decoration, NaN/Inf guard, group-by-metric-name, cardinality cap determinism, fuzz-based invariants over 200 random inputs. - Stress: 100-cycle Start/Shutdown asserts no goroutine leak; 10-repeat Shutdown asserts idempotence. - End-to-end: capturingConsumer asserts the full downstream-visible shape (Resource attrs, scope name, metric kinds, units, OTLP/JSON marshalling). - Coexistence: `exporterPreemptedClient` proves the dcgm-exporter co-deployment constraint is test-pinned. - Pattern replay: 4 tests reproducing each NORTHSTARS Appendix A pattern's signature. - Docs parity: README references every shipped ancillary; example configs cover every README-documented knob; Validate's error substrings appear in README/RUNBOOK; alerts in YAML have RUNBOOK headings. - **`TestRUNBOOK_KindsMatchEmitted`** — new structural test walks every emitted IncError/failedTick kind against the RUNBOOK per-kind triage table in both directions. Closes the drift bug class (`consume` vs `downstream`) at CI time. - **`TestReceiver_M2WiringFromMeterProvider`** — new test pins the M2 canonical self-telemetry wiring; a future refactor that deletes the 6-line wiring block would not be caught without this (noop fallback hides regressions). - Symmetric drop-order pins: every emitter group in dropOrder; every dropOrder entry has an emitter or an allowlisted placeholder. - Performance budget: `TestEmit_StaysUnderBudget` fails the build if emit() regresses past 1ms (today: ~165µs under -race). Coverage on `components/receivers/dcgm/`: ~86%. ## Carry-forward (must land before alpha → beta) Single index: [`docs/M8-NEXT.md`](docs/M8-NEXT.md). High points: - `pkg/dcgm/client_cgo.go` via `NVIDIA/go-dcgm` - `//go:build dcgm,hardware` integration test running in CI - Linux GPU runner provisioned; `.github/workflows/ci-hardware.yml.staged` renamed `.yml` - Cardinality cap validated against three reference fleets - Measured overhead numbers in the README's SLI/SLO table - Upstream OTel semconv PR for the four PROPOSED extensions - External operator pilots the receiver in production ## Loops - Loop 1 (Research, 5 passes): 4 parallel research agents → citation-backed Findings + Candidate Designs A/B/C → Design C (mode-toggle, default standalone) chosen via scoring matrix. - Loop 2 (Scrutinization, 3 passes): 18 questions; key revision reversed the cardinality drop order to preserve NVLink profiling for pattern-#1 diagnosis. - Loop 3 (Coding): atomic commits, one per work item + fix-up commits. - Loop 4 (Review): 6 reviewer subagents across 3 passes surfaced 9 blockers + 26 strong + nits. Every finding dispositioned in `docs/loops/m8-review-notes.md` (worktree- local). - Post-merge passes after M2 landed: typed `selftelemetry.Kind` refactor catches the kind-rename bug class at compile time; external review findings (operator-drift fixes, double-Close bug, log-storm gating, StatusStale signalling) addressed in the final commits with a structural drift test. - A+ rubric scored M8 at composite ~3.85 / 5 (A-). Real A+ requires hardware + future-milestone evidence. ## Test plan - [x] `make ci` clean (cmd/tracecore integration tests flake under -race on macOS-arm64 in parallel; retry-once pattern logged in `docs/FLAKY-TESTS.md`) - [x] `make smoke` runs in CI — validates example config, boots binary, asserts lifecycle log lines - [x] `tracecore validate --config=example_config.yaml` accepts - [x] `tracecore receivers list` shows `dcgm [stub]` (or `dcgm [cgo]` when built with `-tags dcgm`) + `clockreceiver` - [x] Coverage ≥60% per components/ floor (actual: ~86%) - [x] Goroutine-leak stress (100 cycles), cardinality fuzz (200 trials), end-to-end shape pinning - [x] Every Go file carries the SPDX-License-Identifier header - [ ] Hardware path — gated by the cgo follow-up PR on a Linux GPU runner --------- Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
Four R1 findings folded into one commit (docs/CI surface). #1 — README config table missed the top-level `enabled *bool` kill-switch. Added the row at the top of the table with its nil-means-active semantics so operators can grep the table for the field and find it (config.go:27 has been there since the initial M9 work; the README just didn't surface it). #2 — README forward-reference to "the container realities section above" pointed at nothing. Added the actual section ("Container realities") with four operator-actionable bullets: mount the host /dev/kmsg (not the empty pod-local one), CAP_SYSLOG instead of root, multi-tenant blast-radius warning, and the namespaced-kmsg 5.10+ posture. Section anchors a follow-on ready-to-paste DaemonSet manifest (see commit F). TOC updated; threat-model table now links by anchor instead of prose. R1.S3 — alert-check.sh regex too narrow. The previous regex required a suffix in {Receiver,Source,Pipeline,Exporter,Processor} and would miss future alerts named after a domain (e.g. `KernelEventsXidBurst`). Broadening to "any TitleCase identifier ≥12 chars" produced false positives (Go identifiers like `OTLPRoundTrip`, `AmbientCapabilities`). Final shape: drop direction-2 lexicon-based extraction entirely, keep only direction-1 (alerts-yaml is source of truth → MUST appear in the runbook). Direction-2 ("stale runbook reference to a deleted alert") is rare and self-revealing (the alert just doesn't fire), so the cost of false positives outweighs the benefit of catching it pre-merge. #7 — RUNBOOK preamble for receiver-local error kinds. The C commit already added the per-kind triage section; this commit ties it into the error-message index and explicitly states the "why no page alert" rationale so a reviewer doesn't ask the question again. Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
The previous gate exited at the sha256 mismatch, which left no diagnostic trail for triaging which bytes diverged between Build #1 and Build #2. Inverting the control flow: run diffoscope on a mismatch, capture its text report, then exit non-zero. On a match, run diffoscope --exit-code as the load-bearing assertion. Either way diffoscope output ends up in the job log. Also upload both binaries as a "failed-build-pair" artifact when the job fails — needed for offline triage when the on-runner diff isn't enough (e.g. comparing across two failed runs). Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
Diffoscope on test tag v0.0.0-m3test-2 surfaced the actual delta: two runtime/debug.BuildInfo entries differed across builds — vcs.modified flipped from false to true, and the +dirty suffix appeared in the embedded module version. Cascading: that fed a different action-ID into the Go linker, which changed NT_GNU_BUILD_ID, which changed the file hash. Root cause: Build #1 created build1/ inside the worktree and moved the binary into it. By the time Build #2 ran `go build`, the worktree contained untracked files (build1/tracecore_linux_amd64 + .sha256), so `git status --porcelain` was non-empty. `go build -buildvcs=true` (default) reads that and sets vcs.modified=true for Build #2. Fix: build each iteration into `mktemp -d` outside the source tree. The worktree stays clean; Go's VCS probe sees identical state on both runs; build IDs match; binaries match. The canonical artifact is then staged from BUILD1_DIR into ./release/ for the rest of the workflow. Failure-triage upload still grabs both builds when the gate trips. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
…rage Four parallel reviews landed seven actionable changes: - Cold rebuild: both builds now use isolated $(mktemp -d) GOCACHE dirs so build #2 can't pass by replaying build #1's cached object files. The assertion we want is cold-vs-cold byte-equality — which is what a third party with a fresh checkout reproduces. - Cosign cert-identity-regexp tightened to pin this exact workflow file on a tag-ref. The previous `^https://github.com/<repo>/` regex would have accepted a Sigstore bundle minted by any workflow on any branch in the same repo; the new pattern rejects sibling workflows. - SBOM coverage gate now walks every `Indirect != true` entry in go.mod and asserts a matching `pkg:golang/<path>@…` purl exists in the CycloneDX components[]. M3's "covers every module" rubric and M21's "≥1 component per direct module" rubric now have a falsifiable check; the previous `components ≥ 1` gate was a placeholder. - Recipe step 6 switched from `slsa-verifier verify-artifact` (legacy slsa-github-generator format) to `gh attestation verify` (the reference verifier for actions/attest-build-provenance's Sigstore bundle output). slsa-verifier ≥ 2.7.0 with `verify-github-attestation` is documented as the alternate path; earlier versions don't parse Bundle v0.3 and would have failed silently or noisily. - Recipe step 4 dropped `--exit-code` to match the CI fix; step 5 inherits the tightened cert-identity-regexp; the diffoscope-failure diagnostic row points at Go-toolchain drift (the actual common cause) rather than "compiler upgrade or -trimpath regression". - CHANGELOG entry added under [Unreleased] / Added; MILESTONES.md M3 flipped from ☐ to ⧗ with a flip-to-☑-on-merge note; top-level README.md routing table grew a row for auditors / supply-chain verifiers pointing at docs/reproducibility.md. - Dropped two unused job-level outputs (source_date_epoch, build_date) that no downstream job consumed; removed a vestigial `make clean` between builds (does nothing when artifacts live in mktemp dirs). Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
Five P1 items from the second-pass parallel review:
- Recipe step 6 now reads the bundle from disk (--bundle "$ATTEST")
instead of pulling it from GitHub's attestation API, and pins
--signer-workflow + --predicate-type. Two practical wins: the
verification works offline / air-gapped, and a sibling workflow
elsewhere in the repo cannot mint an attestation that passes the
documented command — cosign step 5 and gh-attest step 6 now anchor
to the same workflow-on-tag-ref identity.
- "If a step fails" row 6 label switched from `slsa-verifier` to
`gh attestation verify` so the diagnostic table matches the verb
the walkthrough uses.
- Recipe prerequisites paragraph dropped its dangling `slsa-verifier
≥ 2.7.0` alternate-path promise. The walkthrough never showed the
alternate command; adding it would have doubled the recipe surface
for marginal benefit. `gh attestation verify` is the single
documented verifier.
- SBOM job's checkout now pins to ${{ github.sha }} (the commit that
triggered the workflow) instead of the tag. A force-push to the tag
between the build and sbom jobs cannot produce an SBOM for a
different tree than was signed.
- MILESTONES.md M3 status line dropped the m3test-4 reference (stale
after subsequent test tags landed); replaced with "across the
v0.0.0-m3test-* series" so future test-tag iterations don't restale
the line. docs/FOLLOWUPS.md gains an M21 release-asset-shape
reconciliation bullet (raw binary vs tar.gz, .cosign.bundle vs .sig,
the .intoto.jsonl extension on Sigstore bundle JSON).
- Build #1 comment trimmed from an essay block to two sentences;
rationale lives in the commit history.
Deferred from Pass-2 (P2/L, not M3-blocking):
- diffoscope local exit-status wrapping (verifier copy-pasting one
block at a time can miss a non-zero exit; recipe polish, not gate
break)
- Repo tag-protection ruleset (org-policy decision, not PR scope)
- `go mod verify` in the build job (cheap hardening; defer to
separate supply-chain PR)
- Rekor log-index URL in release notes (post-fact audit polish)
- Caching `diffoscope-minimal` apt install (~10s, marginal on a
tag-triggered workflow)
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
This was referenced May 15, 2026
trilamsr
added a commit
that referenced
this pull request
May 18, 2026
) ## What this PR does Bundles 8 ready-now items from `docs/FOLLOWUPS.md` whose triggers were satisfied; closes 2 more as already-shipped or already-satisfied. Three themes: CI/release-pipeline hardening, code-quality sweeps, and FOLLOWUPS hygiene. **CI / release-pipeline hardening:** - SHA-pin every GitHub Actions ref across all workflows. Dependabot's `github-actions` group keeps these bumped weekly as one grouped PR. - Reconcile `actions/upload-artifact` major-version drift: callsites were split between `@v5` and `@v7.0.1`. Unified on v7.0.1. - Tighten `cosign verify-blob` smoke check with `--certificate-github-workflow-ref refs/tags/$TAG` and `--trigger push`. Strictly tighter than the prior `IDENTITY_REGEXP`-only check. - Mirror tightened flags in `docs/reproducibility.md` step 5; add `--source-ref` / `--source-digest` to step 6's `gh attestation verify`. - Emit Rekor `logIndex` URL into release notes so transparency-log audits don't require bundle archaeology. - Wire `make mod-verify` into `make ci`. **Code-quality sweeps:** - Convert ~49 C-style `for i := 0; i < N; i++` loops to Go 1.22+ `for i := range N` (or `for range N` when the index is unused). 6 holdouts have non-convertible conditions (compound `&&`, `i += 2`, or non-`i` predicate). - Backfill 18 raw `"Normal"`/`"Warning"` sites in k8sevents tests to use `EventTypeNormal` / `EventTypeWarning` constants. - Export `k8sevents.ComponentType = "k8s_events"`; convert 8 test callsites. - Lock the no-`Server`-header invariant in `internal/telemetry` with a test. (Audit finding: Go's `net/http` does not emit a default `Server` header in any path; the FOLLOWUPS row had nothing to strip — the test prevents future regression.) **Closed-as-stale (no code change, FOLLOWUPS updated with rationale):** - Next-up #1 `make doc-check`: already shipped (Makefile:192, in `make ci` chain). - M8 opportunistic "promote build-tags to pr-validation.yml": `ci.yml:37` already runs `make build-tags` directly; no `pr-validation.yml` exists. ## Linked issue(s) _No linked issue._ ## Release notes ```release-notes [SECURITY] All GitHub Actions are now SHA-pinned; cosign and gh attestation verification flags are tightened to bind to the exact release tag and `push` trigger. [ENHANCEMENT] Release notes include a Rekor transparency-log entry URL for after-the-fact audit. ``` ## Checklist - [x] Tests added or updated (`TestServer_NoServerHeader`; existing tests continue to pass) - [x] `make ci` passes on the worktree branch (exit 0) - [x] Commits are signed off - [x] No new components; existing component STYLE.md layout untouched ## Test plan - [x] `make ci` exit 0 (coverage above floor, govulncheck clean, doc-check + alert-check pass, vet clean across default + `dcgm` build tags) - [ ] CI green on this PR - [ ] Release dry-run not exercised — release workflow only fires on tag push; flag-tightening + Rekor URL emission verified by inspection rather than e2e. Worth a manual `workflow_dispatch` once merged or at next tag cut. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Tri Lam <trilamsr@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…pe; reject framing of bench-correction as regression Phase-3 adversarial deep review (2 fresh subagents, independent of the 8 lens reviews). The author's completion claim was treated as a hypothesis to falsify. Adversarial #1: APPROVED, no falsifiable findings. Adversarial #2: returned CONCERNS-REQUIRE-FIX with two findings. After the validation cycle: Findings table: | ID | Lens | Beneficiary | Severity | Finding | Proof | Contradict | TDD record | Rubric+ | Action | |----|------|-------------|----------|---------|-------|------------|------------|---------|--------| | P3.1 | adversarial-2 | repo-long-term | BLOCKER → DEFER | "k8sevents BenchmarkEmitOne allocs jumped 21→28; not gated by bench-check." | Read Makefile:40-44 — bench-check is scoped to ./internal/telemetry/. Confirmed k8sevents has no baseline. | The 21→28 jump is the WHOLE POINT of group F: the previous bench reused one plog chain across iters and under-reported production cost. `git diff origin/main...HEAD -- components/receivers/k8sevents/receiver.go components/receivers/k8sevents/emit.go` shows production allocation paths in r.emit are unchanged from main; only the bench measurement shape changed. Reviewer conflated bench-output change with production regression. | n/a — no production change to test | no — finding rejected as framed, but underlying observation kept | deferred FOLLOWUPS.md (Component-level benchmarks ungated by `make bench-check`) | | P3.2 | adversarial-2 | repo-long-term | NIT | Missing explicit symlink-to-directory test for kubeconfig path. | A new TestConfig_RejectsSymlinkToDirectoryAsKubeconfigPath would pass without code change. | Reviewer themselves note "would pass with the current code." TestConfig_RejectsDirectoryAsKubeconfigPath already exercises the IsDir() path; symlinks go through the same code (os.Stat follows symlinks intentionally). No unique coverage added. | n/a | no | explicitly-skipped (taste-call; redundant coverage) | Reproducibility: $ grep -n "components" Makefile | grep bench # only internal/telemetry covered $ git diff origin/main..HEAD -- components/receivers/k8sevents/receiver.go components/receivers/k8sevents/emit.go # zero production-allocation changes Validation-cycle stats: Findings rejected during contradict (framing of BLOCKER as regression): 1 Findings that survived as DEFERRED to FOLLOWUPS: 1 Findings explicitly-skipped (taste-call): 1 Beneficiary: repo-long-term. The underlying gap (component benches ungated) is real and worth a follow-up; the immediate framing as a regression in this PR is not. Signed-off-by: Tri Lam <tree@lumalabs.ai> Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…l ordering rationale Phase-4 A+ aspiration review (2 fresh subagents). Reviewer #1 graded B+ with 7 documentation-of-already-true-invariants criteria; reviewer #2 graded A with 3 falsifiable proposals. Two surviving load-bearing criteria after validation cycle: Findings table: | ID | Lens | Beneficiary | Severity | Finding | Proof | Contradict | TDD record | Rubric+ | Action | |----|------|-------------|----------|---------|-------|------------|------------|---------|--------| | P4.1 | aplus-2 | repo-long-term | CONCERN | populateAttributes / attrPutter cap check (`attrs.Len() >= maxAttrs`) is exercised only at production maxAttrs floor (9). The exported BuildLogRecordForBench helper can be called with arbitrary values; a future refactor flipping `>=` to `>` would silently allow one attribute through at maxAttrs=0 and slip past every existing test. | TestBuildLogRecord_BoundaryMaxAttrs covers maxAttrs=0 and maxAttrs=-1; mutation-verified red→green: changing `>=` to `>` in attrPutter.putStr/putInt fails the maxAttrs=0 subtest, then restoration passes. | Production Validate floors maxAttrs at 9 (TestConfig_RejectsTooLowMaxAttributes pins this). But internal callers (bench, future refactor) can bypass Validate. | red (mutation) → green → mutation-verify recorded in this commit | yes — P4-aplus-2 in .claude/ralph-loop.local.md | applied this commit | | P4.2 | aplus-2 | repo-long-term | NIT | validateKubeconfigPath ordering rationale lives only in the Phase-1 commit body and FOLLOWUPS closure; a future maintainer reordering Validate's pipeline would break TestConfig_AmbiguousAuth_* tests without warning at the call site. | Added the rationale to the validateKubeconfigPath docstring (source-level). | n/a — comment-only; existing tests catch a bad reorder regardless. | n/a | no | applied this commit (config.go) | Rejected/deferred: - P4.3 (aplus-1 #1) — "Bench allocs/op ≤30 threshold gate." Already covered by Phase-3 deferred FOLLOWUPS entry on component-bench scope. DEFER (duplicate). - P4.4 (aplus-2 #2) — Cross-receiver SchemaURL pattern lint. Out of scope; trigger is third in-tree schema URL. DEFER to FOLLOWUPS. - P4.5 (aplus-1 #2-7) — Document already-met invariants. Per feedback_anti_bureaucracy, criteria that document truths without a falsifiable hook are bloat. REJECT. Reproducibility: $ go test -run TestBuildLogRecord_BoundaryMaxAttrs -v ./components/receivers/k8sevents/ # passes $ sed -i.bak 's/a.attrs.Len() >= a.maxAttrs/a.attrs.Len() > a.maxAttrs/g' components/receivers/k8sevents/emit.go && \ go test -run TestBuildLogRecord_BoundaryMaxAttrs/maxAttrs=0 -v ./components/receivers/k8sevents/ # fails $ mv components/receivers/k8sevents/emit.go.bak components/receivers/k8sevents/emit.go # restore Letter-grade outcome: Reviewer #1 starting grade: B+ → target A+ via documentation Reviewer #2 starting grade: A → target A+ via P4.1 + P4.2 After this commit: A+ on the falsifiable axis (every C1-C6 + F change has a mutation-catching test; the boundary cap is now explicitly pinned; ordering rationale lives at source). Beneficiary: repo-long-term. Falsifiable tests survive refactors; documentation-of-truths does not. Signed-off-by: Tri Lam <tree@lumalabs.ai> Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…+ threat-root trace on go-mod-verify Phase-4 A+ aspiration review (2 fresh subagents; both graded A, diverged on which gates to apply). Validation cycle: Findings table: | ID | Lens | Beneficiary | Severity | Finding | Proof | Contradict | TDD record | Rubric+ | Action | |----|------|-------------|----------|---------|-------|------------|------------|---------|--------| | P4.1 (aplus-1 #2, also P2.6) | aplus | operator | CONCERN | A workflow_dispatch run with `inputs.tag` set but `github.ref` ≠ refs/tags/$INPUT_TAG passes Build and fails the OIDC smoke check 15-30 minutes later. Operator wastes runner time and sees the misuse late. | New "Verify dispatch ref matches tag (pre-flight)" step exit-1s within seconds with the documented workaround. | Reviewer noted the smoke check already enforces this — but at job-end, not at job-start. Fail-fast IS the load-bearing property. | n/a — workflow YAML, actionlint clean | yes — P4-aplus-1 | applied this commit; closes P2.6 deferral. | | P4.4 (aplus-2 #2) | aplus | repo-long-term | NIT | go-mod-verify comment says "defense in depth against a compromised GOPROXY mirror" but doesn't name the trust root or the orthogonal threat (a poisoned go.sum itself). | Comment now states "Trust root: the go.sum at this tag commit" and cross-references the tag-protection FOLLOWUPS entry. | A future maintainer might over-attribute the protection. | n/a | no | applied this commit | Rejected/deferred: - P4.2 (aplus-1 #4) — Structured diff lint for release.yml ↔ docs/reproducibility.md. DEFER to FOLLOWUPS.md (real value, but manual review caught both drift directions in Phases 2 + 3; automate when next edit happens). - P4.3 (aplus-1 #6) — Release artifact manifest validation before upload. REJECT. Per anti-bureaucracy: reviewer concedes `needs:` dependency already gates malformed artifacts from reaching the release job. Adding defensive validation against a CI-bug scenario is bloat. - P4.5 (aplus-1 #3) — docs/SUPPLY-CHAIN-IDENTITY.md consolidated reference. DEFER to FOLLOWUPS.md; ~30-min write, scope creep beyond release.yml. M21 release-checklist is the natural trigger. - P4.6 (aplus-1 #5, aplus-2 #3) — Formal threat-model document + M21 alignment narrative. DEFER to M21. - P4.7 (aplus-2 #5) — Cross-link health lint. Duplicate of P4.2; same deferral. Reproducibility: $ make actionlint zizmor # exit 0 $ grep -A1 "workflow_dispatch with inputs.tag" .github/workflows/release.yml # pre-flight gate present Letter-grade outcome: Reviewer #1 starting: A → A+ via criteria 2, 4, 6 (we applied 2 + threat-model comment) Reviewer #2 starting: A → APPROVED-AS-IS (already strong) After this commit: A on the falsifiable axis (one operator-UX gate + one comment clarification), with the broader doc/lint work scoped to follow-ups. Beneficiary: operator. The pre-flight gate cites a specific operator-facing surface (15-30 minute waste on workflow_dispatch misuse) and turns it into a seconds-fast named error. Signed-off-by: Tri Lam <tree@lumalabs.ai> Signed-off-by: Tri Lam <trilamsr@gmail.com>
4 tasks
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…ask + gh attestation verify (#69) ## Summary Release-pipeline supply-chain hardening + a workflow_dispatch pre-flight gate. No operator-visible release-artifact shape change; the gates fail loudly at tag-push time before any artifact is signed and published. **Hardening:** - `go mod download && go mod verify` step before the reproducible-build pair. Catches a poisoned GOPROXY mirror returning module bytes that don't match `go.sum`. Trust root: the `go.sum` at the tag commit; a poisoned `go.sum` itself is tracked separately under M3 tag-protection. - `LC_ALL=C` + `TZ=UTC` env + `umask 022` inside the run script of both Build #1 and Build #2. Canonical reproducible-builds.org stanza; today's `-trimpath`+`SOURCE_DATE_EPOCH` carry the load for Go output, but the stanza is cheap insurance against future cgo or non-Go release artifacts. - New "Smoke-check `gh attestation verify`" step in the provenance job. Local-bundle mode (offline trust chain — cert + SCT + Rekor proof are embedded). Flag set matches `docs/reproducibility.md` step 6: `--signer-workflow` + `--predicate-type` + `--repo` + `--source-ref` + `--source-digest`. Pins the OIDC subject path so a different workflow in the repo with `attestations: write` cannot satisfy it; pins the source claims so an attestation from a non-tag dispatch is refused. - `docs/reproducibility.md` step 6 tightened from `--owner` (org-wide) to `--repo` (org/repo). Adopters following the documented walkthrough now exercise the same scope CI enforces. - New "Verify dispatch ref matches tag" pre-flight step. On `workflow_dispatch` with `inputs.tag` set, asserts `github.ref == refs/tags/$INPUT_TAG` and fails fast with the named workaround. Saves 15-30 minutes of runner time on misuse. **FOLLOWUPS hygiene:** Closed five rows: `go mod verify`, build-env sanitization, cosign+gh-attestation flag tightening (cosign half had already shipped), Rekor log-index URL (already shipped), and workflow_dispatch pre-flight gate. Opened three rows: flag-parity lint between release.yml and reproducibility.md; consolidated `docs/SUPPLY-CHAIN-IDENTITY.md` reference; component-bench gating scope (tracked from the parallel k8sevents review). ## Verification - `make actionlint zizmor` clean on the head commit (zizmor: 0 findings). - `gh attestation verify --bundle` + `--repo` + `--source-ref` + `--source-digest` combination verified end-to-end against a public sigstore bundle (`github/codeql-action v2.25.4`); gh CLI source maps the flags to Fulcio cert OIDs 1.3.6.1.4.1.57264.1.14 / .13, populated from OIDC `ref` / `sha` claims at sign time. - Pre-flight gate is a stand-alone shell test; it exits 1 with a clear error and the named workaround when `github.ref` and `inputs.tag` disagree. ## Test plan - [ ] PR CI green on the head commit. - [ ] Next real release tag (M21) exercises all four new gates end-to-end against a real Sigstore bundle. - [ ] If `gh attestation verify --bundle` rejects the flag combination at release time, the failure is loud (job fails) and the fix is a one-line follow-up. ```release-notes Tightened release-workflow supply chain: defensive `go mod verify`, canonical LC_ALL / TZ / umask reproducible-build stanza, and a local-bundle `gh attestation verify` smoke check pinned to the source tag + commit SHA and the signing workflow. `docs/reproducibility.md` now uses `--repo` so adopter verification matches CI strictness. Workflow_dispatch with `inputs.tag` fails fast if the ref doesn't match. Operator-visible release shape unchanged. ``` --------- Signed-off-by: Tri Lam <tree@lumalabs.ai> Signed-off-by: Tri Lam <trilamsr@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
## Summary Adds two load-bearing lessons to `AGENTS.md` from this session's CI work. Both prevent a future contributor from repeating the same trap. **Aggregator bypass.** GitHub Actions short-circuits an aggregator job's `needs:` to SKIPPED on any sub-job failure, and treats SKIPPED required checks as satisfied. PR #73 silently merged past a failed `verify-test` because the aggregator from PR #72's verify split was SKIPPED rather than FAILURE. The fix shape (`if: always()` + `needs.*.result` check) shipped in PR #74; this lesson documents the trap and the fix so anyone splitting CI jobs in the future doesn't repeat it. **Perf-budget regex flake class.** `require.Regexp` with implicit upper bounds (e.g. `0\.0[0-9]+`) on values whose only invariant is `>0` flake on slow CI runners. Two of these hit in one session: `TestReceiver_SLIBudget` (emit-latency, observed 539ms) and `TestReceiver_SetDegraded` (degraded-seconds, observed 0.126s). The fix shape is the same in both — relax to any positive value (`\d+\.[0-9]*[1-9]`) or use baseline-relative comparisons. File goes from 128 to 148 lines (cap is 150, with 2 lines of remaining headroom — next addition should consider demoting an older entry to a topic note per the file's own promotion rule). ## Test plan - [x] `wc -l AGENTS.md` reports 148, under the 150-line cap. - [x] `make doc-check` clean (banned-phrase lint, 250 links resolve, `(unverified)` count = 7 baseline). - [x] Capture-flow format check (`learn-from-mistakes` skill): banned vocabulary absent, no first-person AI phrasing, no AI attribution, both entries carry `Anchor:` citations. - [ ] CI on this PR exercises the same gates plus the aggregator that's now itself an anchor of lesson #1. ```release-notes NONE — documentation only. Adds two load-bearing lessons to `AGENTS.md` covering GitHub Actions aggregator semantics and a recurring perf-budget regex flake class. No runtime behavior change. ``` Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
Two independent adversarial reviewers ran fresh against the post-
Phase-2 state. Convergent findings on three substantive issues plus
four smaller polish items. All findings survived contradiction.
Findings applied (7):
P3.F1 [CONCERN/applied] Version-skew "matrix" promised in evolved
rubric [P2-sre] was delivered only as a label, not a table; the
§Wire protocol paragraph ended on a dangling colon promising
content that never appeared. Now ships a four-row receiver-action
table covering (helper=receiver), (receiver_newer), (helper_newer),
(unknown), with the Phase 4 alert binding called out explicitly.
Beneficiary: operator (named: Phase 4 alert rule binds to a
sustained non-zero `helper_newer` rate after a chart rollout).
P3.F2 [CONCERN/applied] An orphaned "Length-prefix framing eliminates
any need for an in-band terminator" paragraph sat between the new
version-skew lead and the versioning detail, breaking the topic
flow. Hoisted into the framing-ceiling paragraph where it belongs.
Beneficiary: repo-long-term (cold-reader test).
P3.F3 [CONCERN/applied] Helper lifecycle reorder fixed the
filesystem-cleanup race but not the original "indeterminate UDS
state" race the section's lead sentence names: the accepted
connection fd was never closed during shutdown, only the listening
socket. Added explicit step 3 closing the accepted-conn fd via a
lock-protected `_active_conn` slot before join. Step ordering now
closes accepted-conn → unlink → chain-prior-handler → join, so
every operator-visible commitment survives a hung dump.
Beneficiary: operator (named: SIGTERM cleanup completes without
relying on the daemon-flag-forced-exit backstop; the indeterminate-
UDS-state hazard the section opens with is now actually closed).
P3.F4 [CONCERN/applied] §Wire protocol RSS reconciliation paragraph
enumerated three Phase-3 options without picking one. v0.1 alpha
now pins defaults at `max_threads_per_dump=64` and
`max_frames_per_stack=128` (worst-case ≈0.8 MiB, comfortable under
the 10 MB RSS budget). Operators with wider workloads raise the
caps explicitly with an acknowledged RSS waiver in site config.
The 32 MiB framing ceiling stands as a protocol-level upper bound
independent of the default caps. Beneficiary: repo-long-term
(the rubric "must reconcile" promise now matches what the RFC
delivers).
P3.F5 [CONCERN/applied] §Supply chain committed to four artifacts
with no Phase mapping. Phase 2 deliverable list now names PyPI
trusted-publisher OIDC config, PEP 740 attestation, and typosquat-
reservation stubs. §Supply chain itself opens with "all four
artifacts below are Phase 2 deliverables; none of the listed
registry actions, signing setup, or hash-pinning is in place
today" — making the forward-commitment status explicit for
reviewers who only read that section.
Beneficiary: repo-long-term (per NORTHSTARS O3).
P3.F6 [CONCERN/applied] §Design overview "Cadence pairing with M18"
paragraph self-described its own silent-breakage hazard ("if
either side moves, this derivation breaks silently") and deferred
to a Phase 3 fixture-test with no recorded deliverable. Phase 3
deliverable now explicitly carries "M13-cadence × M18-threshold
cross-link fixture asserting the 45s sustained-state derivation
holds at every build." The paragraph rewrite drops the hand-wave
framing. Beneficiary: repo-long-term.
P3.F7 [NIT/applied] Defensive "not by oversight" phrasing in the
`gen_ai.training.rank` attribute-table cell presupposed a prior
accusation a six-months-cold reader would not understand.
Rephrased to "Attributes carry this namespace per the NORTHSTARS
O4 shepherding commitment."
Findings explicitly-skipped (not deferred, judged):
P3.NIT.SR1-citation: Phase 2's P2.SR.1 cited "operator dashboard"
as a generic surface. Phase 3's new version-skew table now binds
the metric to a specific Phase 4 alert rule (sustained non-zero
`helper_newer` rate). Citation upgraded to specific via the
applied finding above.
P3.NIT.soft-triggers: Two of the new FOLLOWUPS rows ("first operator
report of X") have non-falsifiable triggers. Accepted as known
limitation for v0.1; tracecore has no inbound issue-label channel
to detect this in CI today. Revisit when the operator-feedback
channel exists.
Validation-cycle stats:
- Findings raised by adversarial #1: 6
- Findings raised by adversarial #2: 6
- Convergent (same defect, both reviewers): 1 (version-skew matrix)
- Findings rejected during contradict: 0
- Findings whose hard-proof did not reproduce: 0
- Findings applied: 7
- Findings explicitly-skipped: 2 (citation upgraded inline; soft-trigger known limitation)
- Findings deferred to FOLLOWUPS: 0 (everything load-bearing was applied)
TDD discipline stats:
- New code changes landed via failing-test-first: 0 (doc-only PR)
- Hard-proof commands executed during validation: 7
Rubric additions accepted in Phase 3 (.claude/ralph-loop.local.md):
- [P3] When promising a "matrix" or "must reconcile" in the evolved
rubric, deliver the artifact not the language.
- [P3] Lifecycle reorder fixes must close every named race in the
section's lead sentence, not just the named filesystem cleanup.
- [P3] Sections committing artifacts that span multiple Phases need
an explicit "all forward, none in place today" disclaimer at the
section head — the global scope-disambiguation paragraph at the
top of §Proposal only covers CI gates.
- [P3] Self-described silent-breakage hazards demand an explicit
enforcement deliverable in the same paragraph.
Adversarial verdicts:
- Adversarial #1: CONCERNS-REQUIRE-FIX (6 findings, 6 surviving)
- Adversarial #2: CONCERNS-REQUIRE-FIX (6 findings, 6 surviving)
Beneficiary tally (applied / skipped):
- Operator: 3 / 0
- Repo long-term: 4 / 0
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary The `docs/adrs/` directory held one file (`0001-metrics-to-logs-pattern-input.md`). A single-file decision-record directory sitting parallel to `docs/rfcs/` (13 RFCs + README + template) is taxonomy drift — operators and contributors had to learn two near-identical conventions for "load-bearing architectural decision in tree." This PR collapses the split. ### What changed - ADR-0001 → **RFC-0014** (`docs/rfcs/0014-metrics-to-logs-pattern-input.md`). Content reformatted to match the RFC template's section headings (Summary / Motivation / Proposal / Alternatives / Open questions / Migration / References). The substance (Option A vs Option B vs Option C analysis, the v0.130 contrib survey, PR-A landed + PR-B pending sequencing) is preserved verbatim — this is active design for issue #260 PR-B, not archaeology. - `docs/adrs/` directory removed. - 6 cross-references repointed at the new RFC path: - `docs/README.md` (subdirectories table — `adrs/` row removed) - `docs/ATTRIBUTES.md` (2 spots: `tracecore.alert.pcie_rate_collapse.*` row + "See also" link) - `docs/integrations/prometheus-scrape.md` (2 spots) - `docs/patterns/pattern-4-thermal-throttle.md` - `docs/patterns/pattern-5-pcie-aer.md` - 5 source-code comments repointed (`module/processor/patterndetectorprocessor/{patterndetector.go,thermal_throttle_test.go}`, `module/pkg/patterns/{pcie_aer.go,thermal_throttle.go}`). - `docs/rfcs/README.md` status-index gains an RFC-0014 row (`accepted`, 2026-05-31). ### Why convert (not delete) ADR-0001 was evaluated against the "delete if RFC-0013 already covers it" bias. It is not covered: - RFC-0013 §5 mentions a `metricthresholdconnector` as a contribution slot in one bullet. It does **not** evaluate Option A vs Option B vs Option C, cite the contrib v0.130 evidence, or sequence the PR-A/PR-B split for the metric-sourced detectors. - ADR-0001 is the binding design contract for patterns #1 / #3 / #4 / #5 (4 of the next NORTHSTAR detectors). Source code (`pcie_aer.go`, `thermal_throttle.go`, `patterndetector.go`) cites it as the reason for the staged-but-quiet wire-up. Delete would orphan 5 source comments and break the audit trail for an active design decision. Convert keeps the record load-bearing without preserving the parallel taxonomy. ### Verification - `grep -rn "docs/adrs\|adrs/0001"` returns 0 hits. - `grep -rn "ADR-0001\|ADR 0001"` returns 0 hits. - `ls docs/ | grep -i adr` returns empty. - pre-commit golangci-lint + go vet + go mod verify clean. ```release-notes docs: collapse single-file `docs/adrs/` into `docs/rfcs/`. ADR-0001 (metrics-sourced pattern inputs) is promoted to RFC-0014 verbatim; cross-references across docs and module source repointed. ``` ## Test plan - [x] `grep -rn "docs/adrs"` returns 0 hits - [x] `grep -rn "ADR-0001"` returns 0 hits - [x] `docs/adrs/` directory removed - [x] golangci-lint + go vet clean (pre-commit hook) - [ ] CI green Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Extends prometheus-scrape.md with the bridge attribute contract for the four metrics-derived patterns: - pattern #1 NVLink (#260) — the `hw.gpu.nvlink.io` OTTL transform already lands in commit 0baa557; this PR closes #260's recipe-half. - pattern #3 HBM ECC (#273) — `hw.errors.delta` + error.{type, subtype,persistence} + gpu.id contract. - pattern #4 thermal throttle (#282) — `hw.gpu.throttle.duration.delta` in integer seconds + reason=thermal + gpu.id contract. - pattern #5 PCIe AER Layer 2 (#284) — the `tracecore.alert. pcie_rate_collapse.*` namespace contract. OTTL metrics->logs emission stays upstream-blocked at OTel-contrib v0.130 (RFC-0014): no contrib processor or connector emits log records from a metrics pipeline. The bridge contract documented here is the load-bearing wire format any future emitter (an upstream metricthresholdconnector OR the WithMetrics extension to patterndetectorprocessor per RFC-0014 PR-B) MUST honor; the detector projections at module/processor/patterndetectorprocessor/ patterndetector.go gate on this contract today. last-verified marker bumped to 2026-06-01. Closes #260. Closes #273. Closes #282. Refs #284 (Layer 1 closed under #285 in a prior commit; Layer 2 contract documented here). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Pattern #10 - CUDA OOM, deceptive allocator - per NORTHSTARS Appendix A row #10 and the design spec at docs/patterns/10-cuda-oom-deceptive.md. Detector evaluation rule - per-OOM, look up most-recent same-GPU FB sample within CorrelationWindow (default 2min, forward-only - fb.Timestamp <= oom.Timestamp) - if fb_free_ratio >= FBFreeFragmentationThreshold (default 0.05) -> kind=fragmentation (raise max_split_size_mb, empty_cache) - if fb_free_ratio < threshold -> kind=true_oom (shrink batch, shard) - if no FB sample joins -> kind=unknown, confidence=partial Discriminator value - fragmentation vs true-OOM is the operator's #1 question on a CUDA OOM - without DCGM cross-check the operator retries with same batch, hits same OOM, wastes a slot - partial-confidence verdict surfaces the OOM even when DCGM scrape lags, so the operator branches on concurrent pod_evicted / xid_correlation rather than silence Files - module/pkg/patterns/cuda_oom.go - detector + verdict + records - module/processor/patterndetectorprocessor/cuda_oom.go - projections, collectCUDAOOMInputs, appendCUDAOOMVerdict, runCUDAOOMDetector - module/processor/patterndetectorprocessor/cuda_oom_test.go - 7 wiring tests + 2 Validate guards - module/processor/patterndetectorprocessor/example_config.yaml - cuda_oom_correlation_window + cuda_oom_fb_free_fragmentation_threshold knobs - docs/ATTRIBUTES.md - hw.gpu.memory.{free,total} namespace entries Scalar promotions per issue #270 contract: gpu.id, k8s.{pod,node}.*, cuda_oom.kind, cuda_oom.tried_alloc_bytes, cuda_oom.fb_free_bytes, cuda_oom.fb_free_ratio, pattern.confidence. Window-edge fenced both sides per PR #255 lesson. Threshold-boundary fenced inclusive per same lesson. Most-recent-pre-OOM rule mirrors xid_correlation / pcie_aer / hbm_ecc. Integration-gap follow-ups (tracked separately on PR body): - DCGM_FI_DEV_FB_USED/FREE OTTL recipe extension (sibling to #273) - filelogreceiver OTTL stanza for CUDA OOM regex parsing (sibling to #285) - metrics-path on patterndetectorprocessor per ADR-0001 (PR-B) Tests - 17 detector tests in module/pkg/patterns/cuda_oom_test.go (filed in red commit, now green) - 11 schema-drift falsifier sub-tests on CUDAOOMVerdict - 7 wiring + 2 Validate tests in processor cuda_oom_test.go - all 35 green; full ./pkg/patterns + ./processor/patterndetectorprocessor suites green with -race; make check + make build green Refs #303 Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Resolve 5 conflicts post-PR #310 / #312 / #313: - factory.go deleted on main (merged into patterndetector.go); port wave's selftel wiring (#261) into the merged createLogs - VerdictAttr* unexported per #310; rename 16 wave-added consts + all callers across cuda_oom + ib_link_flap + pcie_aer tests - docs/{MILESTONES,FOLLOWUPS,patterns/README}.md path + content reconcile after MILESTONES.md moved to docs/ Address reviewer findings before PR: - docs/THREAT-MODEL.md case-mismatch -> docs/threat-model.md (Linux CI is case-sensitive) - pattern.id schema drift: 8 specs said `ib_link_flap`/`cuda_oom`, code emits "2"/"10"/.../"13"; rewrite spec attribute tables to match shipped customer-stable namespace - pattern.confidence: 8 specs said `high|partial`, code emits `full|partial`; rewrite - 02-ib-link-flap.md attribute drift: spec said tracecore.alert.ib_link_flap.{hca_device,port}, code emits hw.network.ib.{device,port.num}; align spec to shipped code - v1-rc1-cut-criteria criterion #1 status stale-on-arrival ("6 patterns shipped" -> "8 patterns shipped, 4 remaining") - NetPol UX trap: NOTES.txt warning when networkPolicy.enabled=true with empty allowedEgressEndpoints (silently kills OTLP exporter) + warning when ServiceMonitor scraper in different namespace - File #337 for missing OTTL recipe projecting DCGM FB_USED/FREE -> hw.gpu.memory.{free,total} log shape (CUDA OOM detector consumes but recipe gap means it ships dark) Tests: ./module/processor/patterndetectorprocessor/... + ./module/pkg/patterns/... both ok. Signed-off-by: Tri Lam <tri@maydow.com>
9 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
…ts (#338) ## Summary 15-agent parallel wave bridging v1.0-rc1 knowledge gaps + closing horizon backlog. 31 commits, 81 files, +8650/-180. **Code (5 detectors / features):** - `feat(iblinkflap)` pattern #2 IB link flap detector — 13 tests, cross-rank helper extracted for reuse by patterns #7/#9 - `feat(cudaoom)` pattern #10 CUDA OOM detector + fragmentation-vs-true-OOM discriminator — 35 tests, 0/6 false-positive rate on fixture corpus (#303 wiring — recipe gap tracked at #337) - `feat(verdict)` deprecate EvictedPod, co-emit PodName + PodNamespace (#277) with regression-pinning test - `feat(chart)` opt-in default-deny NetworkPolicy + cert-manager mTLS reference (#301); ServiceMonitor + scrape annotations (#296); NOTES.txt UX warnings for empty-egress / cross-ns scraper traps - `feat(bench)` per-detector allocs/event harness + soft ratchet gate, graduation criterion documented (#302) - `feat(patterndetector)` verdict counter metric for dashboard panels (#261) - `fix(slo-rules)` correct otelcol_* label set + drop silent-no-op `unless on (instance)` join (#298) **8 pattern design specs (`docs/patterns/{02,07-13}-*.md`):** - Per pattern: symptom, layers crossed, signal sources, detector evaluation rule, verdict attrs, edge cases, open questions. - 7 load-bearing spec gaps flagged for future TDD red-test work (multi-vendor SDC signal, cohort grouping, processor metrics path, etc). **9 v1.0-rc1 audit / knowledge-gap docs:** - `docs/v1-rc1-cut-criteria.md` — 12 falsifiable cut gates derived from O1-O7 - `docs/v1-rc1-operational-gaps.md` — SLSA L3 + air-gap + upgrade-rollback audit (8 issues filed #314-#321) - `docs/v1-rc1-governance-gaps.md` — CODEOWNERS 0%, lint-principles 4/16, retros, `make ci` 148s (5 issues #322-#325, #327) - `docs/v1-rc1-test-audit.md` — 82.9% coverage, fuzz harness inventory (5 issues #328-#332) - `docs/v1-rc1-simplification-audit.md` — top deletion candidates ~9.6K LOC (3 issues #333-#335) - `docs/threat-model.md` — STRIDE per trust boundary + audit RFP scope (#336) - `docs/reference-environments.md` — Tier 1 kind + Tier 2 32×H100 binding spec for O2 hero KPI - `docs/adoption-pipeline.md` — S0-S3 funnel + comms templates for O5 hero KPI - `docs/standards-roadmap.md` — 10 `gen_ai.training.*` attributes proposed upstream (#326) **Doc-drift cleanup:** 11 issues closed (#265, #268, #269, #276, #283, #287, #292-295, #299). **OTTL recipe wiring:** 6 issues closed (#260, #261, #273, #282, #284, #285); #272 deferred to standards-roadmap. **Multi-cluster auth:** bearer-token + mTLS examples (#297). **Merge resolution + reviewer fixes:** - Resolved 5 conflicts post-PR #310/#312/#313 (factory.go delete, VerdictAttr* unexport, MILESTONES.md → docs/, FOLLOWUPS, patterns README) - Adversarial reviewer found 1 BLOCKER + 6 MAJOR; all addressed before push: - Renamed 16 `VerdictAttr*` → `verdictAttr*` per #310 convention - Re-ported selftel wiring (#261) into main's merged `createLogs` - Fixed case-mismatch `docs/THREAT-MODEL.md` → `docs/threat-model.md` (Linux CI is case-sensitive) - 8 pattern specs schema drift: `pattern.id` slug → numeric (`"2"`, `"7"`...`"13"`), `pattern.confidence` `high` → `full` - `02-ib-link-flap.md` attribute drift: spec said `tracecore.alert.ib_link_flap.{hca_device,port}`, code emits `hw.network.ib.{device,port.num}` - `v1-rc1-cut-criteria` criterion #1 status stale-on-arrival ("6 patterns shipped" → "8 patterns shipped, 4 remaining") - NetPol UX trap: NOTES.txt warns when `enabled=true` with empty `allowedEgressEndpoints` (silently kills OTLP) or cross-ns Prometheus - Filed #337 for missing OTTL recipe projecting `DCGM_FI_DEV_FB_*` → `hw.gpu.memory.{free,total}` (CUDA OOM detector consumes but recipe gap) - Post-merge stale-relative-path sweep: 6 wave docs + NORTHSTARS.md + MILESTONES.md (`docs/`, `../`, `docs/docs/` drift after MILESTONES + NORTHSTARS moved to docs/) - Documented 5 newly-emitted attributes in ATTRIBUTES.md (drop_ratio + IB tier — `attribute-namespace-check` now 67/67) ## Test plan - [x] `go test ./module/processor/patterndetectorprocessor/... ./module/pkg/patterns/...` — ok - [x] `make lint` (golangci-lint via goreleaser-style gate) — 0 issues - [x] `go vet ./...` — clean - [x] `make doc-check` — passes after stale-link sweep - [x] `scripts/attribute-namespace-check.sh` — 67/67 documented - [x] `helm lint install/kubernetes/tracecore` — 0 chart(s) failed - [x] `promtool check rules` on slo-rules.yaml — 13 rules / SUCCESS - [ ] CI compat-matrix (rc1 criterion #6) — gated on next wave - [ ] manual smoke install on real cluster — owner clearance pending ```release-notes Lands two new pattern detectors (#2 IB link flap, #10 CUDA OOM fragmentation-vs-true discriminator), 8 pattern design specs for the remaining v1.0 root-cause patterns, opt-in default-deny NetworkPolicy + Prometheus Operator ServiceMonitor on the Helm chart, the EvictedPod → PodName/PodNamespace verdict-attribute deprecation co-emit, per-detector allocs/event bench harness, SLO-rules label fix, and the v1.0-rc1 knowledge-gap audit set (cut criteria, ops gaps, governance gaps, test audit, simplification audit, threat model, reference envs, adoption pipeline, standards roadmap). ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Closes #337. The CUDA OOM detector (`projectFBMemoryRecord` at `module/processor/patterndetectorprocessor/cuda_oom.go:114`) gates on `hw.gpu.memory.{free,total}` log-record attributes, but nothing in the recipe layer produced them: `dcgm-exporter` emits `DCGM_FI_DEV_FB_USED` / `DCGM_FI_DEV_FB_FREE` as Prometheus gauges and no OTTL transform projected them onto the customer-stable namespace. Detector compiled, never fired on a real install — sibling gap to #273 (pattern #3), #282 (#4), #284 (#5). This PR closes the gap on the metric-side projection and pins the load-bearing log-shape contract the bridge layer MUST honor. - `docs/integrations/examples/prometheus-scrape.yaml`: - `DCGM_FI_DEV_FB_USED` → `hw.gpu.memory.used` (Gauge, unit `By`) - `DCGM_FI_DEV_FB_FREE` → `hw.gpu.memory.free` (Gauge, unit `By`) - Identity-preserving rename only; `hw.gpu.memory.total = used+free` deferred to the bridge layer per the named upstream limit (see below). - `docs/integrations/prometheus-scrape.md`: - New `### Pattern #10 — CUDA OOM (framebuffer)` metric-side projection section with raw-series → semconv table. - New `#### Pattern #10 — hw.gpu.memory.{free,total}` bridge- contract subsection with full log-record schema (yaml-shaped) matching what `projectFBMemoryRecord` reads, plus MIG caveat and unit-test cross-link. - Intro + bridge-contract header bumped to include pattern #10. - `docs/patterns/10-cuda-oom-deceptive.md`: - Signal-source line links to the recipe sections. - Open Question #1 (`DCGM_FI_DEV_FB_*` OTTL extension) struck through; resolution recorded. - `docs/ATTRIBUTES.md`: - `hw.gpu.memory.free` / `.total` rows updated to distinguish metric vs log shape and to cross-link to the recipe section. - New `hw.gpu.memory.used` row (now projected on the metrics pipeline by this PR — dashboard evidence context). ## Root cause + named upstream limit **Root cause (fixed in this PR):** the prometheus-scrape OTTL transform had no stanza projecting the DCGM FB series onto `hw.gpu.memory.*`. The detector's projection gate could not be satisfied on a real install. Fixed by adding the rename stanza in `transform/dcgm_to_hw_semconv` (same processor the #1/#3/#4/#5 projections already live in — no new processor surface). **Named upstream limit (NOT worked around — tracked):** OTel-contrib `transformprocessor` v0.130 `metric_statements` cannot perform cross-series arithmetic — there is no OTTL path to compute `hw.gpu.memory.total = hw.gpu.memory.used + hw.gpu.memory.free` on a metrics pipeline ([upstream README](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config)). Per RFC-0014 §Alternatives, the metrics→logs primitive does not exist in contrib at v0.130 either. The total scalar lives at the bridge layer (RFC-0014 PR-B `WithMetrics` extension to `patterndetectorprocessor`, tracked under #260). The recipe pins the load-bearing wire format the bridge MUST honor so PR-B lands without a contract change. ## Adopt-over-build posture Every new OTTL statement uses upstream functions only (`set`, `==` equality). No new transformprocessor extension. Mirrors the existing #1/#3/#4/#5 stanzas. ## Test plan - [x] `make build` — clean - [x] `./scripts/validator-recipe.sh` — 9 validated, 3 skipped (non-linux), 0 fail; prometheus-scrape example passes `tracecore validate`. - [x] `./scripts/doc-check.sh` — 721 markdown links resolve, all new cross-links + anchor refs included; banned-phrase lint clean; recipe markers (`tested-against`, `last-verified`) present. - [x] `./scripts/attribute-namespace-check.sh` — 67/67 attribute literals documented (no new undocumented attrs introduced). - [x] golangci-lint, go vet, go mod verify via commit hook — clean. - [ ] CI linux runner exercises journald + k8sobjects skip-paths we couldn't run locally (validator-recipe ubuntu job). ```release-notes recipe(ottl): project DCGM `DCGM_FI_DEV_FB_USED` / `DCGM_FI_DEV_FB_FREE` onto the customer-stable `hw.gpu.memory.{used,free}` namespace and pin the metrics-to-logs bridge log-shape spec the pattern #10 (CUDA OOM) detector consumes via `projectFBMemoryRecord`. `hw.gpu.memory.total = used + free` derivation is deferred to the RFC-0014 PR-B `WithMetrics` bridge layer because OTTL `transformprocessor` v0.130 has no cross-series arithmetic on metrics pipelines. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Ships pattern-#9 (NCCL bootstrap timeout) detector end-to-end — the first job-start-time pattern in the library (sibling to pattern #8 which fires mid-run). A training-job cohort whose pods are Ready past `BootstrapDeadline` (default 5min) but where at least one rank never emitted any NCCL FlightRecorder record is stuck in NCCL bootstrap; a same-namespace K8s CNI / network-readiness event in the correlation window promotes the verdict to `confidence=full` and stamps `discriminator=cni_error`. Spec: [`docs/patterns/09-nccl-bootstrap-timeout.md`](docs/patterns/09-nccl-bootstrap-timeout.md). Status flipped from `planned` → `shipped`; `Implementation notes` section captures how each spec open-question resolved with the most-conservative reading. ## What landed - `module/pkg/patterns/nccl_bootstrap.go` — detector + `TrainingPodRecord` / `CNINetworkEventRecord` / `NCCLBootstrapTimeoutVerdict` types. Reuses `NCCLFRRecord` from `nccl_hang.go`. - `module/pkg/patterns/nccl_bootstrap_test.go` — 11 detector tests + schema-conformance + 10-falsifier drift battery. Covers: full-correlation fires, partial-when-no-CNI, normal-startup-no-fire, deadline-not-yet-reached, heterogeneous failure, multi-job cohorts don't merge, namespace-only fallback, cross-namespace CNI doesn't join, deadline-configurable, deterministic ordering, max(ReadyAt) drives age. - `module/pkg/patterns/testdata/nccl_bootstrap_verdict.schema.json` — JSON Schema with `additionalProperties:false` and full enum guards. - `module/processor/patterndetectorprocessor/nccl_bootstrap.go` — projections (`projectTrainingPodRecord` gates on `k8s.pod.ready_time` + `gen_ai.training.rank`; `projectCNINetworkEventRecord` gates on `k8s.event.reason` ∈ `{FailedCreatePodSandBox, NetworkNotReady, CNIError}`), verdict writer with promoted scalars (issue #270 contract), and runner that consumes NCCL FR records from the existing cross-cutting `collectInputs` (no double-projection). - `module/processor/patterndetectorprocessor/nccl_bootstrap_test.go` — 6 wiring tests (full verdict, partial verdict, partial-suppressed-by-flag, normal-startup-no-fire, sub-1s deadline rejection, sub-1s window rejection). - `Config.NCCLBootstrapDeadline` + `Config.NCCLBootstrapCorrelationWindow` with Validate guards (≥1s) and `withDefaults` / `defaultConfig` wiring; `example_config.yaml` updated. - `docs/ATTRIBUTES.md` — 3 new `tracecore.alert.nccl_bootstrap_timeout.*` rows, new `k8s.pod.ready_time` row, updated `gen_ai.training.job_id` row (now consumed with fallback), new per-pattern matrix row for `nccl_bootstrap`. ## Design calls (load-bearing) - **Cohort key.** `(gen_ai.training.job_id, k8s.namespace.name)` when stamped; `(k8s.namespace.name)`-only fallback when job_id is absent (spec open question #1). Empty `gen_ai.training.job_id` on the verdict signals the fallback path to operators. - **Bootstrap-failed-rank index key.** `(node, rank)` not `(namespace, rank)` — avoids cross-cohort contamination when two jobs in the same namespace land on different nodes. FR records with empty Node are skipped from the index (a wiring gap should NOT cause false-negatives — i.e. mask real bootstrap failures — even at the cost of cross-job false-positives that are unlikely in practice). - **CNI vocab.** v0 ships the K8s-control-plane vocabulary only (`FailedCreatePodSandBox` / `NetworkNotReady` / `CNIError`). Per-CNI raw-error parsing (Cilium / Calico / multus distinct strings) is the discriminator-branch follow-up that lights up `socket_ifname_mismatch` / `rendezvous_unreachable`. - **Cohort size.** Count of distinct ranks the detector observed pod-Ready signals for. Pods that never reached Ready (image-pull stuck) don't enter the cohort — they belong to pattern #15. Per the spec's edge case "slow image pull" no false-positive. - **`max(ReadyAt)` drives deadline.** A late-joining rank pushes the effective ready timestamp forward, preventing false-positives during rolling pod-Ready scenarios on cold-cache clusters. ## Test plan - [x] `cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/...` — clean - [x] `cd module && go vet ./...` — clean - [x] Pre-commit hook: `golangci-lint run ./...` — 0 issues; `attribute-namespace-check` — 72/72 documented - [x] TDD discipline: `test(nccl-boot): RED` → `feat(nccl-boot): GREEN` commits - [ ] CI green on PR (full matrix) ```release-notes feat(patterns): pattern-9 (NCCL bootstrap timeout) detector — fires when a training-job cohort has at least one rank with no NCCL FR record past `BootstrapDeadline` from pod-ready (default 5min); a same-namespace `FailedCreatePodSandBox` / `NetworkNotReady` / `CNIError` event promotes to `confidence=full` with `discriminator=cni_error`. New YAML knobs: `nccl_bootstrap_deadline` (default 5m), `nccl_bootstrap_correlation_window` (default 10m). Verdict shape pinned by `nccl_bootstrap_verdict.schema.json`. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
The patterndetector ships 11 detectors with 14 time-bounded knobs, but the join shape varies across patterns and the rationale lived only in code comments + PR review threads. Operators tuning windows had to read source per detector. Audit finding: five distinct shapes are load-bearing (chosen by the causal physics of each signal), not bugs: - One-sided lookback (#1 #3 #5 #6 #7 #10): cause precedes effect. - Asymmetric two-sided (#11): pre-stall covers concurrent-start checkpoints; post-stall covers OTTL-bridge logger latency. - Symmetric two-sided (#9 CNI-event leg): cohort-ready ±window could be cause OR consequence. - Job-window bounded (#13): SDC counter rise must fall in the bounded eval-cycle's owning job; no operator knob is meaningful. - Trailing-window rate / freshness (#2 #4 #8): rolling window anchored at `now` or the most-recent record. Decision: document the existing reality, do not converge. Forcing every detector to the asymmetric two-knob form would silently zero one leg for the one-sided detectors (footgun on clock skew) and would not apply to #13 at all. Adds: - 'Why this correlation shape' section in docs/patterns/07, 11, 13 (the three shapes the issue called out by name). - 'Correlation-window semantics' table in docs/patterns/README.md covering ALL 11 detectors with the predicate, anchor, and shape rationale, plus cross-links to the per-pattern sections. No code changes; no detector behavior changes. Closes #367. Signed-off-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
…ard) (#477) ## Summary Closes the `docs/MILESTONES.md` §M6 carry-forward: *"every fenced block in `docs/getting-started.md` is exercised by `scripts/smoke.sh`"*. The ≤5-count gate shipped with the M6 wave; the binding half was tracked carry-forward because `smoke.sh` ran a parallel hand-written hostmetrics→debug config rather than the doc's actual YAML. ## Root cause Two scripts owned the "first OTLP byte" config — `smoke.sh` rendered one inline, `docs/getting-started.md` carried another. They happened to agree, but nothing forced them to. The carry-forward existed because the binding was *correct by inspection*, not *correct by construction*. The fix is to make the doc the single source: `smoke.sh` extracts the YAML from `docs/getting-started.md`'s `## Walkthrough` heredoc at runtime. If the doc grows a typo, a renamed receiver, or a different scraper, `smoke.sh` exercises the change automatically. If the heredoc disappears, the extractor fails loud with a named error. ## Changes - `scripts/smoke.sh` — extracts the Walkthrough heredoc via a perl one-liner, writes it to a tempfile, then runs `tracecore validate --config=` + `tracecore --config=` against it (Walkthrough steps 3 + 4). Lifecycle-log assertions retained, with `"Shutdown complete"` now load-bearing against the doc's post-walkthrough prose. - `scripts/doc-check.sh` — new gate (right after the existing ≤5-count gate) asserts the smoke↔doc binding with four mutation-verified clauses: Walkthrough scope, `"$BIN" validate --config=` invocation, `"$BIN" --config=` run invocation, `docs/getting-started.md` path reference. - `scripts/smoke_test.sh` — new mutation-verify harness mirroring the gate at runtime, plus an inline mutant-doc test that proves the extractor exits 1 and the wrapper emits the named error when the heredoc is removed. - `Makefile` — `make smoke` now also runs `smoke_test.sh`; wired into `ci-full` alongside the existing `smoke-quickstart` target. - `docs/MILESTONES.md` — §M6 status `⧗ partial` → `☑ delivered`; getting-started rubric `⧗` → `☑`; carry-forward bullet rewritten (remaining work is operator-config branch-protection only). ## Runtime End-to-end `bash scripts/smoke.sh` on darwin/arm64: **~2.2s** (extract + validate + 1.5s run window + lifecycle-log assertions). Well under the 120s ci-fast budget. No hardware required — uses the `hostmetrics` load scraper, portable across linux/darwin/windows. ## Test plan ```release-notes ci(smoke): scripts/smoke.sh now extracts its YAML config from docs/getting-started.md '## Walkthrough' instead of carrying a parallel hand-written config; doc-check.sh gates the doc↔smoke binding with four mutation-verified clauses. Closes the M6 carry-forward. ``` - [x] `bash scripts/smoke.sh` exits 0 on clean main (verified locally, ~2.2s). - [x] `bash scripts/smoke_test.sh` all assertions pass. - [x] `bash scripts/doc-check.sh` reports `scripts/smoke.sh binds to docs/getting-started.md (M6: every block exercised by smoke.sh)`. - [x] Mutation test #1: `sed -i 's/"$BIN" validate --config=/"$BIN" XXX --config=/' scripts/smoke.sh` → doc-check exits 1 naming "validate --config= invocation (Walkthrough step 3)". - [x] Mutation test #2: `sed -i 's/"$BIN" --config=/"$BIN" XXX=/' scripts/smoke.sh` → doc-check exits 1 naming "run invocation (Walkthrough step 4)". - [x] Mutation test #3: `sed -i 's/Walkthrough/Section/' scripts/smoke.sh` → doc-check exits 1 naming "extraction scope lost". - [x] Mutation test #4: `sed -i 's/docs/getting-started.md/docs/SOMEWHERE-ELSE.md/' scripts/smoke.sh` → doc-check exits 1 naming "binding source missing". - [x] Mutation test #5: getting-started.md with no `## Walkthrough` heredoc → smoke.sh exits 1 with named error message (covered by `smoke_test.sh`). - [x] `make lint` 0 issues; `make vet` clean; `make doc-check` clean (all 18 gates pass). - [x] `make smoke` end-to-end including `smoke_test.sh` passes. ## Related - Refs `docs/MILESTONES.md` §M6 (Documentation scaffold). - Sibling #460 (`fix(doc-check): drop unconditional exit 0`) made this carry-forward visible — before #460, the new gate would have been silently skipped by the line-99 short-circuit. Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
## Summary Adds a kubelet-probe ingress rule to the chart's `NetworkPolicy` template, closing **M5b chart opportunistic #1** (`docs/followups/M5b.md`). **Root cause.** Kubelet liveness/readiness probes originate from the node IP via the host-network namespace. NetworkPolicy v1 cannot match host-network traffic with `namespaceSelector` or `podSelector` peers — only with `ipBlock`. The existing chart `NetworkPolicy` (issue #301) carved ingress for in-namespace pods (scrape-in) but had no rule the kubelet matched. Result: a `networkPolicy.enabled: true` install would flip every DaemonSet pod `NotReady` within one `failureThreshold` window — the chart would render its own DaemonSet inoperable. **Fix.** New `networkPolicy.kubeletProbes.{enabled,cidr,except}` block. When enabled (default `true` when the policy is enabled), the template renders an `ipBlock` ingress rule on the `health` port (chart default `:13133`). Default `cidr: 0.0.0.0/0` is permissive on source IP but L4-scoped to the healthcheckextension port, so the telemetry + OTLP receiver ports stay locked down. Operators with a fixed node CIDR tighten it in their overlay. Production preset (`values-production.yaml`) inherits the default-on posture. Schema (`values.schema.json`) extended with `additionalProperties: false` so typos fail at `helm install`. ```release-notes chart: NetworkPolicy now carves a port-scoped `ipBlock` ingress rule for kubelet liveness/readiness probes (`networkPolicy.kubeletProbes.*`), so `networkPolicy.enabled: true` no longer breaks the DaemonSet's own readiness flow. Closes M5b chart opportunistic #1. ``` ## Cross-references - `docs/followups/M5b.md` — opportunistic-deferral list, item #1 ticked. - `docs/threat-model.md` §6.G — network-surface audit scope this template satisfies (listener inventory + default-deny verification). - `install/kubernetes/tracecore/README.md` §security — operator-facing values walkthrough updated. - Builds on `#301` (initial scrape-in + OTLP-out scope). ## Files changed - `install/kubernetes/tracecore/templates/networkpolicy.yaml` — new `ipBlock` ingress rule + load-bearing comment block explaining why `0.0.0.0/0` stays narrow. - `install/kubernetes/tracecore/values.yaml` — new `networkPolicy.kubeletProbes` defaults + comment. - `install/kubernetes/tracecore/values-production.yaml` — inherits defaults explicitly with production-context comment. - `install/kubernetes/tracecore/values.schema.json` — schema for the new block, `additionalProperties: false`. - `install/kubernetes/tracecore/README.md` — three new values-table rows + updated NetworkPolicy section with threat-model cross-link. - `docs/followups/M5b.md` — item #1 ticked with implementation pointer. ## Test plan - [x] `helm lint install/kubernetes/tracecore` — exit 0. - [x] `helm lint install/kubernetes/tracecore -f values-production.yaml` — exit 0. - [x] `helm template install/kubernetes/tracecore` — exit 0; NetworkPolicy NOT rendered (default `enabled: false`). - [x] `helm template install/kubernetes/tracecore -f values-production.yaml` — exit 0; NetworkPolicy rendered with kubelet-probe ingress rule. - [x] **Mutation: enabled with empty `allowedEgressEndpoints`** — renders correctly (no DNS / probe rule loss). - [x] **Mutation: `kubeletProbes.enabled: false`** — probe rule omitted; scrape-in rule unchanged. - [x] **Mutation: tightened `cidr: 10.0.0.0/16` with `except: [10.0.99.0/24]`** — renders `ipBlock.cidr` + `ipBlock.except` correctly. - [x] `conftest test --policy policies/conftest/tracecore.rego` on default render — 52/52 passed. - [x] `conftest test --policy policies/conftest/tracecore.rego` on production render — 91/91 passed. - [x] `kubeconform -strict -ignore-missing-schemas -kubernetes-version 1.30.0` on default render — 4 valid, 0 invalid. - [x] `kubeconform -strict -ignore-missing-schemas -kubernetes-version 1.30.0` on production render — 6 valid, 0 invalid, 1 skipped (ServiceMonitor CRD). - [x] commit-msg hook gates: golangci-lint clean, go vet clean, go mod verify clean, attribute-namespace-check clean. ## Grade **A+** — root-cause fix, mutation-verified, conftest + kubeconform + helm-lint all clean, cross-linked to threat-model.md §6.G, explicit `policyTypes: [Ingress, Egress]` deny-all baseline documented inline, M5b checklist item ticked. --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
M19 carry-forward #1 — ship the infrastructure that lets operators contribute anonymized pod_evicted captures under `module/pkg/replay/pod_evicted/_real_world/<anon-name>/`. * `scripts/anonymize-pod-evicted-fixture.sh` — deterministic sha8 rewrite of event_uid / regarding.{namespace,name,uid} / reporting_instance / node_{name,uid}; verifier flags surviving IPv4 / email / cloud-instance-node / image-ref shapes in note + message prose. * `scripts/anonymize-pod-evicted-fixture_test.sh` — mutation tests: baseline-clean passes; IPv4 / email / EC2 / GKE / ECR shapes fail verify; `v1.28.4`-style version strings do NOT false-positive; rewrite is deterministic (two passes byte-identical) and strips every raw input string. * `synthetic-2026-06-multi-rank-disk-pressure/` — synthetic-but- real-world-shaped fixture exercising multi-rank disk-pressure burst with mixed full+partial confidence (third eviction at T+35s falls outside the 30s join window, partial-remediation path inferring disk pressure from note). * `TestPodEvictedReplay_RealWorldGroupLoaderSafe` — asserts the loader walks `_real_world/` identically to `_negative/`; the synthetic fixture is the load-bearing proof of the loader path. * README polished with the explicit PII-field map + cross-link to `docs/threat-model.md`; threat-model row updated to reflect the partial-shipped enforcement. * `make ci-full` + `make verify` gain `anonymize-pod-evicted-fixture-check` so a PR that drops raw PII into `_real_world/` fails before merge. ```release-notes feat: pod_evicted replay fixtures gain a deterministic PII anonymizer (`scripts/anonymize-pod-evicted-fixture.sh`) and a synthetic multi-rank disk-pressure fixture under `module/pkg/replay/pod_evicted/_real_world/`, closing M19 carry-forward welcome. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
## Summary - Replace the `ErrPending` stub at `tools/failure-inject/ncclhang/` with a deterministic wrapper over `module/pkg/nccl/fr_parser.Synthesize`. Output is one of the canonical M11 hang fixtures (`nccl-2.29.x-hang` / `nccl-2.30.x-hang`), selected by `--seed mod 2`; bytes round-trip through `frparser.Parse` and a re-synthesize is byte-identical — closes **M4b carry-forward #1**. - Pin the new SHA in `tools/failure-inject/testdata/golden.sha256` so `chaos.yml`'s `harness-determinism` job (matrix `linux/amd64` + `linux/arm64`) replays the same argv on both arches and enforces cross-arch SHA equality — closes **M4b carry-forward #2**. - Flip ⧗ → ☑ on the two M4b functional rubrics (round-trip, safe-opcodes) and the M4b determinism non-functional rubric, plus the M11 synthetic-fixture-generator rubric. Remove the `failure-inject nccl-hang` follow-up from `docs/followups/M4b.md` and from M11's carry-forward list. ## Root cause M4b shipped at v0.1 with the `nccl-hang` subcommand stubbed (`ErrPending`, exit 70) because `pkg/nccl/fr_parser/synthesize.go` was still pending under M11. M11 landed the synthesizer plus the canonical hang fixtures (`fixture229Hang`, `fixture230Hang`) in `module/pkg/nccl/fr_parser/`. The CLI shim was carry-forward — this PR is the wiring. ## What's in the diff - `tools/failure-inject/ncclhang/ncclhang.go` — `Options{Seed uint64}`; `Run` selects a hang variant by `Seed % len(hangVariants)`, calls `FixtureSpec.Bytes()` (which delegates to `frparser.Synthesize`), writes to `w`. `ErrPending` deleted; `ctx.Err()` honoured before any write. - `tools/failure-inject/main.go` — pass `Options{Seed: *c.flagSeed}` through to `ncclhang.Run`; drop the `errors.Is(err, ncclhang.ErrPending) → exit 70` branch. - `tools/failure-inject/ncclhang/ncclhang_test.go` — RED → GREEN: `TestRun_RoundTrip` (synthesize → parse → re-synthesize byte-identical), `TestRun_SeedDeterminism` (same seed → same bytes, 4 seeds), `TestRun_SafeOpcodesOnly` (delegates to `frparser.Parse` as the safe-opcode oracle — a naive byte scan false-positives on opcode bytes inside `SHORT_BINUNICODE` string literals), `TestRun_CtxCancelled`. - `tools/failure-inject/main_test.go` — replace `TestRun_NCCLHangReturnsNotImplemented` with `TestRun_NCCLHangRoundTrip` + `TestRun_NCCLHangSeedDeterminism` so the contract is pinned through the actual argv path too. - `tools/failure-inject/testdata/golden.sha256` — add `failure-inject --seed=0 nccl-hang → e6f49920…`. The existing `TestRun_GoldenSHA` loop in `main_test.go` and the `Golden SHA pin` step in `chaos.yml` pick it up automatically. - `docs/MILESTONES.md` — flip §M4b rubrics ⧗ → ☑ (round-trip, safe-opcodes, cross-arch determinism) and §M11 synthetic-fixture rubric; trim carry-forward list. - `docs/followups/M4b.md` — mark the `nccl-hang` entry closed with the wiring-PR pointer. - `tools/failure-inject/README.md` — add a `nccl-hang` section; remove `nccl-hang` from carve-outs (now only `pod-evict --allow-cluster-write` carves). - `module/receiver/ncclfrreceiver/README.md` — replace stale `tracecore failure-inject` invocation with the actual `go run ./tools/failure-inject` path. ## Test plan - [x] `go test -race -count=1 ./tools/failure-inject/...` — green (4 packages). - [x] `(cd module && go test -race -count=1 ./pkg/nccl/fr_parser/...)` — green (no semantic change here, gate against accidental drift). - [x] `go build ./... && (cd module && go build ./...)` — clean. - [x] Pre-commit gates: `golangci-lint`, `go vet`, `go mod verify`, `attribute-namespace-check` — all 0 issues. - [x] End-to-end determinism: `failure-inject --seed=0 nccl-hang | sha256sum` reproduces the pinned SHA (`e6f49920…`) twice in a row. - [x] Seed variance: `--seed=1` produces a distinct SHA (`2788a726…`); `--seed=42` (42 mod 2 = 0) matches `--seed=0` per the documented modulo mapping. - [x] `failure-inject nccl-hang --help` documents `--seed` and `--out` and the round-trip-through-`fr_parser` purpose. ## Self-grade **A+**: round-trip green, determinism golden-SHA pinned, safe-opcode set verified via parser oracle, cross-arch SHA equality wired into existing `chaos.yml` matrix, MILESTONES.md flipped on four ⧗ rubrics, `M4b.md` follow-up closed with a pointer, doc drift swept. ```release-notes tools(failure-inject): `nccl-hang` subcommand now produces parseable byte-deterministic NCCL FlightRecorder bytes via `pkg/nccl/fr_parser` (was a stub returning `ErrPending`). `--seed` flag selects variant + deterministic synthesis; cross-arch SHA enforced in `chaos.yml` (linux/amd64 + linux/arm64). Closes M4b carry-forward #1 + #2. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
#484) ## Summary Closes the M19 carry-forward #1 *infrastructure* obligation: real-world `pod_evicted` replay captures can now be safely contributed. - **Deterministic PII anonymizer**: `scripts/anonymize-pod-evicted-fixture.sh` (`--rewrite` rewrites `event_uid` / `regarding.{namespace,name,uid}` / `reporting_instance` / `node_{name,uid}` to `<prefix>-<sha8(value)>` while preserving `-rank-N` suffixes; `--verify` refuses any fixture still carrying IPv4, email, EC2/GKE/AKS, or AWS-ECR/GCR-style image-ref shapes in prose). - **Mutation tests**: `scripts/anonymize-pod-evicted-fixture_test.sh` proves the verifier catches every PII shape it claims to catch, the rewrite is byte-deterministic across two passes, and false-positives stay quiet on innocent inputs (`v1.28.4`-style version strings). - **Synthetic real-world-shaped fixture**: `module/pkg/replay/pod_evicted/_real_world/synthetic-2026-06-multi-rank-disk-pressure/` exercises a 3-pod disk-pressure burst with two full-confidence joins (per-condition cache reuse) + one partial-remediation eviction at T+35s (outside the default 30s `JoinWindow` → note-inferred pressure path). - **Loader-symmetry test**: `TestPodEvictedReplay_RealWorldGroupLoaderSafe` now asserts the loader walks `_real_world/` exactly like `_negative/` and would catch a future refactor that broke either group walk. - **Threat-model + MILESTONES** updated: the §7 audit row references the anonymizer; the M19 carry-forward bullet reflects what's shipped vs still pending (operator captures). ## Root cause being fixed M19 carry-forward #1 was "no captures contributed yet" — but the deeper blocker was that **no operator could safely contribute** without (a) a deterministic anonymizer they could rerun on their side, (b) a verifier strong enough to use as a CI gate, and (c) loader proof that `_real_world/` actually walks. This PR ships all three. Future captures plug in without code changes. ## Test plan - [x] `go test ./module/pkg/replay/... -count=1` → all green; new `synthetic-2026-06-multi-rank-disk-pressure` subtest runs. - [x] `bash scripts/anonymize-pod-evicted-fixture_test.sh` → 11 assertions pass (baseline clean, IPv4 / email / EC2 / GKE / ECR shapes flagged, version-string false-positive guarded, deterministic-rewrite byte-equal, every raw input string stripped, shipped fixture clean). - [x] `make anonymize-pod-evicted-fixture-check` → wires verify + mutation tests together; exits 0. - [x] `bash scripts/doc-check.sh` → unaffected, still clean. - [x] `shellcheck` clean on both new scripts. - [x] `go vet ./module/...` clean. ## Follow-up - `cuda_oom`, `nccl_hang`, `hbm_ecc` and the other pattern detectors don't yet have `_real_world/` slots. The anonymizer is shaped to generalize (the structured-field map is the only pattern-specific bit; the prose-PII regex set is universal). Tracked as a follow-up issue once a second operator capture justifies the rule-of-three lift. ```release-notes feat: pod_evicted replay fixtures gain a deterministic PII anonymizer (`scripts/anonymize-pod-evicted-fixture.sh`) and a synthetic multi-rank disk-pressure fixture under `module/pkg/replay/pod_evicted/_real_world/`, closing M19 carry-forward #1's infrastructure obligation. Operator-contributed captures still welcome. ``` --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
This was referenced Jun 2, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
## Summary Removes `.github/workflows/policy-matrix.yml`. Engine-specific admission validation (PSA-restricted × Kyverno × Gatekeeper × default+production) delivered negative ROI at rc1. ## Root cause 4 PRs blocked or chasing this workflow's flakes (#475 introduction, #481, #498, #501). Caught zero real regressions; only its own infra bugs: - ServiceMonitor CRD bootstrap race (#494) - AppArmor host-capability mismatch (#481 → #493) - kubectl wait .status.conditions nil race (#500 → #501) ## Coverage retained (without policy-matrix) - `conftest` — offline PSS-baseline + restricted validation. - `helm lint` — chart structural validation. - `kubeconform` — K8s API conformance. - `kubectl apply --dry-run=server` (chart.yml install/upgrade jobs) — API-level breakage on generic kind cluster. ## What stays in tree - `scripts/policy-matrix-smoke.sh` + Gatekeeper/Kyverno bundle refs — cheap reactivation when GA triggers fire. - `install/kubernetes/tracecore/policies/conftest/**` — offline policy bundle (still active). ## Re-enable triggers (tracked in #502) - GA criterion #1 (third-party audit) requests engine-specific compat validation. - First operator running under Kyverno/Gatekeeper reports admission rot. - CRD-bootstrap pattern stabilises across other workflows. ## Test plan - [x] `make doc-check` exit 0 (post comment-edit in kind-cluster-setup action.yml). - [x] No remaining policy-matrix.yml references in repo (verified by grep). - [x] Pre-commit hooks green (lint/vet/mod-verify/attribute-namespace). - [x] README + install-bench stale refs scrubbed (follow-up commit). ```release-notes ci: defer engine-specific policy-matrix workflow (PSA × Kyverno × Gatekeeper admission validation) to GA. Coverage retained via conftest + helm lint + kubeconform + kubectl apply --dry-run=server. Re-enable tracked in #502. ``` Refs #502 #475 #494 #500. --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bumps the gh-actions group with 5 updates:
4656694734Updates
actions/checkoutfrom 4 to 6Release notes
Sourced from actions/checkout's releases.
... (truncated)
Changelog
Sourced from actions/checkout's changelog.
... (truncated)
Commits
de0fac2Fix tag handling: preserve annotations and explicit fetch-tags (#2356)064fe7fAdd orchestration_id to git user-agent when ACTIONS_ORCHESTRATION_ID is set (...8e8c483Clarify v6 README (#2328)033fa0dAdd worktree support for persist-credentials includeIf (#2327)c2d88d3Update all references from v5 and v4 to v6 (#2314)1af3b93update readme/changelog for v6 (#2311)71cf226v6-beta (#2298)069c695Persist creds to a separate file (#2286)ff7abcdUpdate README to include Node.js 24 support details and requirements (#2248)08c6903Prepare v5.0.0 release (#2238)Updates
actions/setup-gofrom 5 to 6Release notes
Sourced from actions/setup-go's releases.
... (truncated)
Commits
4a36011docs: fix Microsoft build of Go link (#734)8f19afcfeat: add go-download-base-url input for custom Go distributions (#721)27fdb26Bump minimatch from 3.1.2 to 3.1.5 (#727)def8c39Rearrange README.md, add advanced-usage.md (#724)4b73464Fix golang download url to go.dev (#469)a5f9b05Update default Go module caching to use go.mod (#705)7a3fe6cBump qs from 6.14.0 to 6.14.1 (#703)b9adafdBump actions/checkout from 5 to 6 (#686)d73f6bcREADME.md: correct to actions/checkout@v6 (#683)ae252eeBump@actions/cacheto v5 (#695)Updates
golangci/golangci-lint-actionfrom 6 to 9Release notes
Sourced from golangci/golangci-lint-action's releases.
... (truncated)
Commits
1e7e51ebuild(deps): bump yaml from 2.8.1 to 2.8.2 in the dependencies group (#1324)5256ff0build(deps-dev): bump the dev-dependencies group with 3 updates (#1323)13fed6fchore: update workflows7afe8ffchore: update workflows5a92899chore: move samples into fixtures (#1321)aa6fad0feat: add version-file option (#1320)a6071aabuild(deps): bump actions/checkout from 5 to 6 (#1318)6e36c84build(deps-dev): bump the dev-dependencies group with 2 updates (#1317)e7fa5acfeat: automatic module directories (#1315)f3ae99fdocs: organize options (#1314)Updates
actions/upload-artifactfrom 4 to 7Release notes
Sourced from actions/upload-artifact's releases.
... (truncated)
Commits
043fb46Merge pull request #797 from actions/yacaovsnc/update-dependency634250cInclude changes in typespec/ts-http-runtime 0.3.5e454baaReadme: bump all the example versions to v7 (#796)74fad66Update the readme with direct upload details (#795)bbbca2dSupport direct file uploads (#764)589182cUpgrade the module to ESM and bump dependencies (#762)47309c9Merge pull request #754 from actions/Link-/add-proxy-integration-tests02a8460Add proxy integration testb7c566aMerge pull request #745 from actions/upload-artifact-v6-releasee516bc8docs: correct description of Node.js 24 support in READMEUpdates
github/codeql-actionfrom 3 to 4Release notes
Sourced from github/codeql-action's releases.
... (truncated)
Changelog
Sourced from github/codeql-action's changelog.
... (truncated)
Commits
68bde55Merge pull request #3885 from github/update-v4.35.4-803d9e8c39739ad2Update changelog for v4.35.4803d9e8Merge pull request #3883 from github/mbg/test/macro-wrapper0fd9c7dMerge pull request #3882 from github/dependabot/github_actions/dot-github/wor...922d6fbUsemakeMacroinstead oftest.macrodf77e87Update test macro snippet6e3f985Add wrapper fortest.macroe7a347dMerge pull request #3881 from github/update-bundle/codeql-bundle-v2.25.417eabb2Rebuildaaef09cBump ruby/setup-rubyDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting
@dependabot rebase.Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebasewill rebase this PR@dependabot recreatewill recreate this PR, overwriting any edits that have been made to it@dependabot show <dependency name> ignore conditionswill show all of the ignore conditions of the specified dependency@dependabot ignore <dependency name> major versionwill close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself)@dependabot ignore <dependency name> minor versionwill close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself)@dependabot ignore <dependency name>will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself)@dependabot unignore <dependency name>will remove all of the ignore conditions of the specified dependency@dependabot unignore <dependency name> <ignore condition>will remove the ignore condition of the specified dependency and ignore conditions