Skip to content

[ci] Sync branch-protection.yml with relaxed review rule#8

Merged
trilamsr merged 1 commit into
mainfrom
chore/relax-review-yml
May 13, 2026
Merged

[ci] Sync branch-protection.yml with relaxed review rule#8
trilamsr merged 1 commit into
mainfrom
chore/relax-review-yml

Conversation

@trilamsr

Copy link
Copy Markdown
Contributor

What this PR does

The Wave 1 spec set required_approving_review_count: 1 and require_codeowners_review: true. With a sole maintainer who authors every PR, that combination is unsatisfiable — GitHub doesn't allow authors to approve their own PRs, so every PR gets blocked waiting on an approval that can never come.

The live protection on main was relaxed via gh api PATCH immediately before this commit:

required_approving_review_count: 0
require_code_owner_reviews:       false

This PR brings .github/branch-protection.yml in sync so the next run of scripts/apply-branch-protection.sh doesn't bounce the live rule back to "1 review required."

All other gates remain in force: linear history, signed commits, required status checks (verify, builds, Analyze go, govulncheck, release-notes block, no AI co-author trailer, dco sign-off), no force-push, no deletions, conversation resolution, enforce_admins true.

Inline comment names the bump-back condition explicitly: restore to 1 / true when NORTHSTARS.md §O7's M18 target (≥1 non-employee maintainer active) is met.

PRINCIPLES.md §4 backs the relaxation: "don't police what you don't have."

Linked issue(s)

N/A — governance-spec follow-up to the live-protection PATCH.

Release notes

NONE

Checklist

  • Tests added or updated — N/A (governance doc only)
  • make ci passes locally — N/A (no Go change)
  • Commits are signed off (git commit -s)
  • Commits cryptographically signed (SSH)
  • For new components, follows the layout required by STYLE.md — N/A (no new components)

The Wave 1 spec set required_approving_review_count: 1 and
require_codeowners_review: true. With a sole maintainer who
authors every PR, that combination is unsatisfiable — GitHub
won't let authors approve their own PRs, so every PR gets
stuck waiting for an approval that can never come.

The live protection on `main` was relaxed via gh api PATCH
just prior to this commit:

  required_approving_review_count: 0
  require_code_owner_reviews:       false

This commit brings the YAML checklist in sync so the next run
of scripts/apply-branch-protection.sh doesn't bounce the live
rule back to "1 review required."

All other gates remain in force:
- linear history
- signed commits
- required status checks (verify, builds, Analyze go,
  govulncheck, release-notes, no AI co-author, dco sign-off)
- no force-push, no deletions, conversation resolution
- enforce_admins (admin held to same rules as anyone else)

The inline comment names the bump-back condition explicitly:
restore to `1` / `true` when NORTHSTARS.md §O7's M18 target
of ≥1 non-employee maintainer is met. PRINCIPLES §4 backs
the relaxation — don't police what you don't have.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr enabled auto-merge (squash) May 13, 2026 18:56
@trilamsr trilamsr merged commit 7fdaf9e into main May 13, 2026
9 checks passed
@trilamsr trilamsr deleted the chore/relax-review-yml branch May 13, 2026 19:04
trilamsr added a commit that referenced this pull request May 13, 2026
## What this PR does

Companion to PR #8. With `strict: true` (the prior live setting), GitHub
required every open PR's branch to be physically rebased onto current
`main` before the merge button worked. Every upstream merge invalidated
all open PRs and forced a manual "Update branch" click — which created
an extra merge commit on the PR (the same one that caused the DCO
false-fail on PR #7 before the `dco-check.sh` fix landed).

With `strict: false` and `require_linear_history: true` already in
force, GitHub rebases the PR onto current `main` at merge time. No
manual update-branch dance; no spurious merge commits in PR history.

The live protection has already been PATCHed via `gh api`; this PR keeps
the YAML checklist in sync so the next run of
`scripts/apply-branch-protection.sh` doesn't bounce it back to `true`.

**Flip back to `true`** when there are concurrent contributors who could
produce semantically-conflicting passes (each PR green individually, the
combination broken). Sequential solo-maintainer work doesn't have that
failure mode.

## Linked issue(s)

N/A — governance-spec follow-up to the live-protection PATCH.

## Release notes

```release-notes
NONE
```

## Checklist

- [ ] Tests added or updated — N/A (governance doc only)
- [ ] `make ci` passes locally — N/A (no Go change)
- [x] Commits are signed off (`git commit -s`)
- [x] Commits cryptographically signed (SSH)
- [ ] For new components, follows the layout required by `STYLE.md` —
N/A (no new components)

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Address item #8: operators setting `telemetry.enabled: true` and
nothing else couldn't easily see what port tracecore would actually
bind to. The full config is in their YAML mentally, but the
default-resolution rules live in code. Surface them.

New `tracecore validate --config=foo.yaml --explain`:

  config valid: 1 pipeline(s)
    metrics/primary: receivers=1 processors=0 exporters=1

  resolved configuration (after defaults applied):
    telemetry.enabled:       true
    telemetry.listen:        localhost:8888
    telemetry.paths.metrics: /metrics
    telemetry.paths.healthz: /healthz
    telemetry.paths.readyz:  /readyz
    pipelines:
      metrics/primary:
        receivers:  [clockreceiver]
        exporters:  [stdoutexporter]

When telemetry is disabled the block notes "HTTP surface OFF; no
port bound" — so operators see telemetry.enabled defaulting to
false even without the YAML key present.

Two integration tests:
- TestIntegration_ValidateExplain_DumpsResolvedConfig: defaults
  resolve correctly for an "enabled: true with nothing else" YAML.
- TestIntegration_ValidateExplain_TelemetryOff: disabled branch
  surfaces the off state explicitly.

Behavior unchanged when --explain is absent; exit codes identical.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
Closes R2.N1, R2.S1, R2.S2, R2.S5, B6 gate-bypass, and documents
the R2.S3 warnOnce-key-as-string decision with rationale.

R2.N1 — doc.go:21 still referenced RFC-0005 after the Commit A
rename. Updated to RFC-0007 with the renumber-history note.
Stale-reference class: the Commit A's grep ran on .md files
only; doc.go was a *.go reference my regex missed.

R2.S1 — example-daemonset.yaml: M2 /healthz + /readyz surface
exists but the example couldn't probe it. Added a commented-out
`ports:` + `livenessProbe:` + `readinessProbe:` block paired
with a `service.telemetry.metrics.address` config-side
prerequisite. Operators uncomment both ends together. The
defaults-OFF stance is intentional: a probe pointing at an
unbound port would CrashLoop the pod on first apply, exactly
what the M8 first-attempt review caught.

R2.S2 — imagePullPolicy flipped IfNotPresent → Always for the
:latest placeholder. The worst-of-both combination (drift on
first deploy + no auto-pull after) is replaced with predictable
dev-rollout behavior. Comment now points operators at the
SHA-digest production-pinning path.

R2.S5 — TestPrometheusAlerts_ReferenceWiredCounters now uses
word-boundary regex (`\btracecore_receiver_errors_total\b`)
instead of `Contains`, so a future metric named with one of
these as a prefix (e.g. `_total_foo`) can't satisfy the gate.

R2.S4 B6 gate-bypass — same test now also pins:
- `clamp_min` is present in the HighParseErrorRate expr (the
  quiet-host divisor floor R1.B6 fixed)
- An absolute-rate floor of the shape `> 0.<digits>` is also
  present (the second half of the two-part gate)
A future "simplification" that removes either floor and
re-introduces the quiet-host flap now fails at PR time.

R2.S3 — disagreed with the reviewer's suggestion to type the
warnOnce `key` as `selftelemetry.Kind`. Documented the disagreement
in the godoc with rationale: warnOnce keys are per-source
log-rate-limit buckets, separate from the counter-label Kind.
Both kmsg and journald sources `IncError(KindParse)` but each
needs its OWN log-storm gate — collapsing under one Kind would
silence the journald-side first-error log because the kmsg-side
fired first. Honest review pushback: this is a real semantic
distinction the reviewer flagged in good faith.

Three new FOLLOWUPS rows added with falsifiable triggers:
- #7 warnOnce recovery re-arm (design call deferred)
- #8 CI PR-body AI-mention scan
- #9 live-cluster k8s manifest validation in CI

Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
Two AGRADE floor items closed.

Floor #8 — argv-injection defense-in-depth on journald config.
exec.CommandContext doesn't invoke a shell, so values flowing
into the journalctl argv (Units, Identifiers, Matches) are
already shell-safe today. But a future refactor that reintroduced
a shell-quoted code path would bypass that safety; rejecting
shell-metacharacter-adjacent bytes at config-load is
defense-in-depth.

New validateJournalctlArgvSafety rejects values containing
NUL/\n/\r bytes and values starting with `-` (which journalctl
may mis-parse as a flag). TestConfig_Validate_RejectsJournaldArgvSafetyBytes
covers NUL/newline/CR/dash-prefix across units, identifiers,
matches (both keys and values), plus a positive case asserting
legitimate config still passes.

Floor #18 — predeclared lint enabled in .golangci.yml. Caught two
immediate shadows in dcgm test code:
- cardinality_fuzz_test.go: `cap := ...` (Go built-in) → renamed
  to `capacity`
- docs_parity_test.go: `close := ...` (Go built-in) → renamed to
  `closeIdx`
Both are test-only and have no API impact; cross-receiver fix
limited to two identifier renames. Going forward, any new code
that shadows a builtin fails the lint at PR time.

Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
Reviewer 1 (R3) flagged calibration drift in M9-AGRADE-GAP.md
(self ≈ 4.35 vs reviewer-anchored ≈ 4.0) plus tractable items
that close all remaining lens-below-4.0 gaps. This commit
closes all of them.

R2.S4 — pr-validation.yml's no-AI-co-author job now ALSO scans
the PR body for AI-attribution patterns (🤖, Generated with
Claude/Copilot, Generated by GPT/Gemini/Claude). The commit-
trailer scan can't see body content; squash-merge folds the
body into the merge commit, so AGENTS.md's no-AI-mention rule
needs enforcement at both layers. FOLLOWUPS #8 closed.

R3.N1 — config.go:17 stale RFC-0005 reference. The Commit A
grep walked .md files only; this round greps .go too and
caught three more stale references (config.go:17 + config.go:56
+ kernelevents.go:258). All updated to RFC-0007; the deliberate
breadcrumb in doc.go:23 remains as renumber history.

P2 — TestSourceTemplate_ParsesAsGo. Parser-based gate on
source_template.go.example confirms the file stays valid Go
syntax and retains its scaffolding identifiers (yourSource,
YourConfig, sourceYour, Start/Shutdown/run signatures). Parse-
only, no type checking, because the template references
internal-package types a copy-paste author renames; full type
check would defeat the rename safety. The class of bug an
author would actually hit (syntax error, broken func signature,
deleted scaffolding identifier) is caught.

Self-score recalibration in M9-AGRADE-GAP.md: now reports both
self (4.35) and reviewer-anchored (4.05) composites, with
explicit note that A/A- is the honest read until C8
(independent reviewer) closes. Per the honest-review-pushback
rule, when the author scores above the reviewer, the reviewer's
number is the one a third party should size effort against.

FOLLOWUPS #11 added: AST walker `resolveIncErrorCall` handles
SelectorExpr + Ident but not CallExpr cast form
(`selftelemetry.Kind("foo")`). Zero live call sites today
(grep-verified); future-tracked.

R3 nit on TestRUNBOOK_KindsMatchEmitted_PinsExpectedSet
docstring: updated to direct future authors to update `want`
DELIBERATELY (i.e. only after writing a RUNBOOK row), not
reflexively to silence the gate.

Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 22, 2026
…ha) (#158)

## Summary

Implements M15 (RFC-0010): Kubernetes container stdout tail receiver.
Tails `/var/log/pods/**/*.log` per node, parses CRI text format,
attributes records to training rank via Pod informer + body-match
fallback, joins with M19 pod-evicted via EvictionMatchWindow, emits to
consumer.Logs.

Alpha - opt-in via `receivers.containerstdout.enabled=true` (default off
in Helm values).

## Phases

- **Phase 1:** doc.go, kind.go (7 Kinds), record.go (typed Record +
SchemaURLv0 + wire attrs), pattern_consumer_test
- **Phase 2:** Config + Validate (RE2 + ceiling + rank_source whitelist)
- **Phase 3:** Factory + components.yaml registration
- **Phase 4:** CRI parser + Stitcher + path resolver
- **Phase 5:** AttributionCache LRU + InformerSource
- **Phase 5.3:** process_rank_regex body-match fallback
- **Phase 6:** DataloaderExtractor (data_time_s / iter_time_s named
captures)
- **Phase 7:** Egress rate-limit (token bucket, LRU, namespace budgets)
- **Phase 8:** Cursor persistence (atomic write + fsync)
- **Phase 10:** Stdlib file Tailer (rotation + truncation detection)
- **Phase 11:** Pod informer (node + namespace scoped via
filteredIndexer)
- **Phase 12:** Receiver lifecycle wiring (lc.Add + recomputeDegraded OR
aggregator)
- **Phase 13:** 15 failure-mode test identifier pins (AGENTS.md lesson
#8)
- **Phase 13.5:** 4 hot-path benchmarks (alloc/lookup/regex/rate-limit)
- **Phase 13.6:** 10 falsifying failure-mode tests backfilled
- **Phase 14:** End-to-end pipeline integration (emit, JSON splat,
tailer pool, lines/s flush)
- **Phase 15:** Helm chart (RBAC + DaemonSet + values) + RUNBOOK +
FAILURE-MODES + prometheus alerts + conftest policy

## Deferred to FOLLOWUP

- **Phase 17:** Vendored `pkg/stanza/fileconsumer` swap - currently uses
stdlib Tailer (RFC-0010 §FOLLOWUPS)
- **Bench targets:** HotPath 5 vs 4 allocs/op; RegexExtraction 698 vs
500 ns/op - flagged for Phase 17 optimization

## Test plan

- [ ] CI: `make check` passes (fmt + tidy + lint + race tests across
full repo)
- [ ] Unit: `go test -race -count=2
./components/receivers/containerstdout/...` - all green
- [ ] Integration: 8 `TestPipeline_E2E_*` end-to-end tests pass
- [ ] Failure-mode pins: 10 PASS, 5 SKIP-deferred (Phase 14 integration
not yet exercised end-to-end via the SKIP'd ones)
- [ ] Helm: `helm template --set receivers.containerstdout.enabled=true`
renders DaemonSet + RBAC + NODE_NAME env
- [ ] Helm: default render does NOT include containerstdout resources
(opt-in verified)
- [ ] Conftest: 3 operational-invariant policies (RBAC, volumes,
NODE_NAME env)
- [ ] Prometheus: 8 alerts cover every Kind + composite degraded
- [ ] RUNBOOK: 7 Kind sections + recovery/rollback procedures

## Rollback

`helm upgrade --set receivers.containerstdout.enabled=false` removes
DaemonSet, RBAC, and config - verified via template diff.

```release-notes
Add Kubernetes container-stdout tail receiver (M15, RFC-0010): tails /var/log/pods, parses CRI text format, attributes lines to training rank via Pod informer + body-match fallback, joins with M19 pod-evicted. Alpha — opt-in via receivers.containerstdout.enabled.
```

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Eight 1-page pattern-design specs covering #2 IB link flap, #7
dataloader hang, #8 NCCL timeout no-HW, #9 NCCL bootstrap timeout,
#10 CUDA OOM deceptive allocator, #11 checkpointer hang, #12 loss
spike NaN, #13 silent data corruption. Each carries the standard
detector-design shape (symptom, layers, signal sources, evaluation
rule, verdict attrs, edge cases, status, open questions) so the next
contributor can write a TDD red test directly off the spec.

Status: all 8 marked planned. #10 already has issue #303; the spec
frames the design alongside.

NORTHSTARS Appendix A gains a Spec column; docs/README + patterns
README link the new specs.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Reviewer cleanup pass on yellow-tier findings:

* Schema: tighten gen_ai.training.job_id to minLength:1 and document
  the fallback-grouping semantic explicitly — the field is OMITTED
  (not empty-string) on the namespace-only fallback path. Downstream
  consumers must treat ABSENCE as the explicit fallback signal, not
  silent exclusion. processor already uses putStrIfSet to suppress
  empty-string variants.

* Spec eval rule: clarify Pattern #8 (NCCL hang) vs Pattern #9 (NCCL
  bootstrap) trigger disjoint-ness — hang fires on PRESENCE of
  non-completed FR records mid-run; bootstrap fires on ABSENCE of
  any FR record past deadline. Both can fire on the same cohort
  during a heterogeneous bootstrap by design.

* Spec impl-notes: add note #5 documenting the min(ReadyAt) /
  max(ReadyAt) split with the late-joiner-masks-stuck-rank
  scenario as the rationale.

* Empty-Node skip comment in detector: previous comment claimed the
  skip biases toward false-negatives; in fact it STRENGTHENS the
  absence signal (rank stays "no FR seen" → counted as failed),
  biasing toward false-positives. Correct the directionality and
  call out the fallback-to-(namespace, rank) escape hatch.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Closes #331. Replaces the single `pattern-pod-evicted` chaos job with a
matrix-strategy `pattern-detectors` job covering every detector named
in v1-rc1 cut-criterion §1: `hbm_ecc`, `thermal_throttle`, `pcie_aer`,
`pod_evicted`, `nccl_hang`, `xid_correlation`.

Each row runs the detector's positive + negative + schema `*_test.go`
prefix under `-race`:

- **Positive / Edge asserts** — verdict emitted with correct
  `pattern.id` within bounded time.
- **Negative / SchemaRejectsDrift asserts** — zero verdicts on
  baseline / confounder input (the false-positive gate the chaos
  matrix exists to enforce).

The `pod_evicted` row additionally runs the on-disk replay corpus under
`module/pkg/replay/pod_evicted/`; sibling rows skip that step until
their `replay/<pattern>/` fixtures land — tracked as Path A in
[`docs/v1-rc1-test-audit.md`](../blob/main/docs/v1-rc1-test-audit.md)
§4.

## Why this matrix shape, not new `failure-inject` modes

Audit §4 explicitly recommends Path A (matrix over existing
hermetic detector tests) for rc1 and defers Path B (per-pattern
`failure-inject` CLI modes) to v1.x. Path A reuses the
race-detector + golden-SHA shape already validated by
`pattern-pod-evicted` and costs ~60 LoC of workflow YAML instead of
five new CLI subcommands.

## Wall-time budget

Observed per-row runtime under `-race` (measured locally on this
worktree):

| Row | runtime |
|---|---|
| hbm_ecc | 1.35s |
| thermal_throttle | 1.33s |
| pcie_aer | 1.30s |
| pod_evicted (incl. replay corpus) | 1.35s + 1.34s |
| nccl_hang | 1.31s |
| xid_correlation | 1.32s |

Per-row `timeout-minutes: 5` is ~3× headroom; matrix parallelism
keeps total wall-time well inside the chaos-gate budget.

## Test plan

- [x] `actionlint .github/workflows/chaos.yml` — clean.
- [x] `python3 -c "import yaml;
yaml.safe_load(open('.github/workflows/chaos.yml'))"` — parses, 3 jobs
(`harness-determinism`, `cpu-steal-mpstat`, `pattern-detectors`), 6-row
matrix.
- [x] For every matrix row, `cd module && go test -race -count=1 -run
'<regex>' ./pkg/patterns/...` — all green (counts: HBMECC=16,
ThermalThrottle=17, PCIeAER=15, PodEvicted+PressureFromNote=12,
NCCLHang=10, XidCorrelation=14).
- [x] `cd module && go test -race -count=1 ./pkg/replay/...`
(pod_evicted corpus step) — green.
- [x] Pre-commit hook: golangci-lint + go vet + go mod verify +
attribute-namespace-check — all green.
- [ ] Workflow runs green on `main` after merge (verifiable only
post-merge — nightly cron + push-to-main).

```release-notes
ci: chaos.yml now matrix-tests all six shipped detectors (#14
pod_evicted, #8 nccl_hang, #3 hbm_ecc, #4 thermal_throttle, #5
pcie_aer, xid_correlation) instead of just pod_evicted. Closes the
chaos-coverage gap called out in v1-rc1-test-audit §4. No
operator-visible behavior change.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Closes #314, #315.

- Adds `step-security/harden-runner@v2.19.4` (SHA-pinned) in `audit`
mode to the `package`, `ko-publish`, and `chart-publish` jobs. Every
release-tag run now uploads a network-trace artifact verifiers can
cross-check against the SLSA provenance bundle (issue #315 acceptance
criterion).
- Documents the SLSA L3 posture explicitly: workflow header,
`docs/threat-model.md`, and `docs/v1-rc1-operational-gaps.md` all
reflect the generic-generator-vs-Go-builder decision with the upstream
blocker named.
- Wires an SLSA L3 verification step into the
`docs/RELEASE-CHECKLIST.md` tag-cut flow (`gh attestation verify` +
`slsa-verifier` + harden-runner audit artifact inspection).

## Why not migrate to `builder_go_slsa3.yml` (#314 acceptance gap)

Per the user prompt's deliverable 1, the explicit fallback
`generator_generic_slsa3.yml` is permitted "if our build isn't a
straight `go build`". Tracecore's build is OCB-driven: `make build`
invokes the OpenTelemetry Collector Builder to regenerate
`./_build/{main.go,components.go,go.mod,go.sum}` per-arch. There is no
top-level Go module that produces the binary, so the Go builder reusable
workflow — which reads `.slsa-goreleaser.yml` and runs `go build` inside
its controlled env — has nothing to compile.

Root cause: upstream `slsa-framework/slsa-github-generator` has no
pre-build hook for ecosystem builders to run a generator-style step
(`ocb generate`) before `go build`. Tracked at
[slsa-framework/slsa-github-generator#3033](slsa-framework/slsa-github-generator#3033)
and now called out in:

- `.github/workflows/release.yml` header comment (lines 51-68)
- `docs/v1-rc1-operational-gaps.md` §1 step 1 (marked
deferred-upstream-blocked, not "to-do")
- `docs/threat-model.md` row 9 (build-provenance asset)

What we ship instead, all in this PR:

| Track | Status |
|---|---|
| `slsa-github-generator@v2.1.0` generic-generator path | already in
place; SLSA3+ provenance attestation (Sigstore-signed by Fulcio under
workflow OIDC identity, recorded in Rekor) |
| `actions/attest-build-provenance@v4.1.0` redundant L3-grade trail |
already in place |
| `step-security/harden-runner@v2.19.4` egress audit on every
release-tag run | **new in this PR** (covers #315 fully) |

The *attestation infrastructure* is L3-grade; the *build platform* is
what's L2 in spec-strict terms because the package job runs outside the
generator's controlled env. The harden-runner audit artifact gives
verifiers a network-trace they can cross-check against the predicate's
declared inputs — the next-best evidence short of a hermetic in-env
build.

## Files changed

- `.github/workflows/release.yml` — three
`step-security/harden-runner@9af89fc71515a100421586dfdb3dc9c984fbf411 #
v2.19.4` steps (one per job: `package`, `ko-publish`, `chart-publish`);
header-comment block expanded with the SLSA L3 rationale.
- `docs/RELEASE-CHECKLIST.md` — new tag-cut step #8 with three
verification commands (`gh attestation verify`, `slsa-verifier
verify-artifact`, harden-runner audit-artifact inspection).
- `docs/threat-model.md` — supply-chain actor row + build-provenance
asset row now name the actual control stack.
- `docs/v1-rc1-operational-gaps.md` — §1 remediation steps 1/2/3 updated
with done/deferred status linked to specific issue numbers.

## Validation

```
$ python3 -c "import yaml; yaml.safe_load(open('.github/workflows/release.yml'))"
# clean

$ actionlint .github/workflows/release.yml
# clean

$ zizmor .github/workflows/release.yml
9 findings (3 ignored, 2 suppressed, 4 unsafe fixes): 0 informational, 4 low, 0 medium, 0 high
```

Zizmor: same 9 findings as pre-PR baseline (verified via `git stash` +
re-run). The 4 low-confidence `artipacked` findings are on pre-existing
`actions/checkout` steps not touched by this PR.

Pre-commit hooks (`make check`: fmt, tidy, golangci-lint, vet,
mod-verify, attribute-namespace-check) green.

## Per repo memory `feedback_goreleaser_pro_features`

All workflow changes use OSS-syntax actions only:
- `step-security/harden-runner` is OSS Apache-2.0 (Tier-1 + Tier-2
dashboards are free for public repos)
- `slsa-github-generator` is OSS Apache-2.0
- No Pro-only `prebuilt:` / `include:` / `nightly:` / `changelog:`
config introduced

## Test plan

- [ ] Cut a `v0.x.y-rc` tag against this branch in a sandbox fork and
confirm:
- [ ] `step-security/harden-runner` artifact appears in the workflow run
summary for each of `package`, `ko-publish`, `chart-publish` jobs
- [ ] Audit artifact lists egress to the expected endpoints only
(`proxy.golang.org`, `sum.golang.org`, `ghcr.io`, `gcr.io`,
`fulcio.sigstore.dev`, `rekor.sigstore.dev`)
- [ ] `gh attestation verify --predicate-type
https://slsa.dev/provenance/v1 tracecore_<ver>_linux_amd64.tar.gz`
succeeds against the published archive
- [ ] `slsa-verifier verify-artifact --provenance-path
tracecore_<ver>_linux_amd64.intoto.jsonl --source-uri
github.com/TraceCoreAI/tracecore --source-tag v<ver>` succeeds

```release-notes
ci(release): SLSA L3 provenance verification + step-security/harden-runner egress audit added to release.yml. Every release-tag run now uploads a network-trace audit artifact alongside the SLSA provenance bundle. docs/RELEASE-CHECKLIST.md gains an explicit L3 verification step.
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Ships pattern-#9 (NCCL bootstrap timeout) detector end-to-end — the
first job-start-time pattern in the library (sibling to pattern #8 which
fires mid-run). A training-job cohort whose pods are Ready past
`BootstrapDeadline` (default 5min) but where at least one rank never
emitted any NCCL FlightRecorder record is stuck in NCCL bootstrap; a
same-namespace K8s CNI / network-readiness event in the correlation
window promotes the verdict to `confidence=full` and stamps
`discriminator=cni_error`.

Spec:
[`docs/patterns/09-nccl-bootstrap-timeout.md`](docs/patterns/09-nccl-bootstrap-timeout.md).
Status flipped from `planned` → `shipped`; `Implementation notes`
section captures how each spec open-question resolved with the
most-conservative reading.

## What landed

- `module/pkg/patterns/nccl_bootstrap.go` — detector +
`TrainingPodRecord` / `CNINetworkEventRecord` /
`NCCLBootstrapTimeoutVerdict` types. Reuses `NCCLFRRecord` from
`nccl_hang.go`.
- `module/pkg/patterns/nccl_bootstrap_test.go` — 11 detector tests +
schema-conformance + 10-falsifier drift battery. Covers:
full-correlation fires, partial-when-no-CNI, normal-startup-no-fire,
deadline-not-yet-reached, heterogeneous failure, multi-job cohorts don't
merge, namespace-only fallback, cross-namespace CNI doesn't join,
deadline-configurable, deterministic ordering, max(ReadyAt) drives age.
- `module/pkg/patterns/testdata/nccl_bootstrap_verdict.schema.json` —
JSON Schema with `additionalProperties:false` and full enum guards.
- `module/processor/patterndetectorprocessor/nccl_bootstrap.go` —
projections (`projectTrainingPodRecord` gates on `k8s.pod.ready_time` +
`gen_ai.training.rank`; `projectCNINetworkEventRecord` gates on
`k8s.event.reason` ∈ `{FailedCreatePodSandBox, NetworkNotReady,
CNIError}`), verdict writer with promoted scalars (issue #270 contract),
and runner that consumes NCCL FR records from the existing cross-cutting
`collectInputs` (no double-projection).
- `module/processor/patterndetectorprocessor/nccl_bootstrap_test.go` — 6
wiring tests (full verdict, partial verdict, partial-suppressed-by-flag,
normal-startup-no-fire, sub-1s deadline rejection, sub-1s window
rejection).
- `Config.NCCLBootstrapDeadline` +
`Config.NCCLBootstrapCorrelationWindow` with Validate guards (≥1s) and
`withDefaults` / `defaultConfig` wiring; `example_config.yaml` updated.
- `docs/ATTRIBUTES.md` — 3 new
`tracecore.alert.nccl_bootstrap_timeout.*` rows, new
`k8s.pod.ready_time` row, updated `gen_ai.training.job_id` row (now
consumed with fallback), new per-pattern matrix row for
`nccl_bootstrap`.

## Design calls (load-bearing)

- **Cohort key.** `(gen_ai.training.job_id, k8s.namespace.name)` when
stamped; `(k8s.namespace.name)`-only fallback when job_id is absent
(spec open question #1). Empty `gen_ai.training.job_id` on the verdict
signals the fallback path to operators.
- **Bootstrap-failed-rank index key.** `(node, rank)` not `(namespace,
rank)` — avoids cross-cohort contamination when two jobs in the same
namespace land on different nodes. FR records with empty Node are
skipped from the index (a wiring gap should NOT cause false-negatives —
i.e. mask real bootstrap failures — even at the cost of cross-job
false-positives that are unlikely in practice).
- **CNI vocab.** v0 ships the K8s-control-plane vocabulary only
(`FailedCreatePodSandBox` / `NetworkNotReady` / `CNIError`). Per-CNI
raw-error parsing (Cilium / Calico / multus distinct strings) is the
discriminator-branch follow-up that lights up `socket_ifname_mismatch` /
`rendezvous_unreachable`.
- **Cohort size.** Count of distinct ranks the detector observed
pod-Ready signals for. Pods that never reached Ready (image-pull stuck)
don't enter the cohort — they belong to pattern #15. Per the spec's edge
case "slow image pull" no false-positive.
- **`max(ReadyAt)` drives deadline.** A late-joining rank pushes the
effective ready timestamp forward, preventing false-positives during
rolling pod-Ready scenarios on cold-cache clusters.

## Test plan

- [x] `cd module && go test ./pkg/patterns/...
./processor/patterndetectorprocessor/...` — clean
- [x] `cd module && go vet ./...` — clean
- [x] Pre-commit hook: `golangci-lint run ./...` — 0 issues;
`attribute-namespace-check` — 72/72 documented
- [x] TDD discipline: `test(nccl-boot): RED` → `feat(nccl-boot): GREEN`
commits
- [ ] CI green on PR (full matrix)

```release-notes
feat(patterns): pattern-9 (NCCL bootstrap timeout) detector — fires when a training-job cohort has at least one rank with no NCCL FR record past `BootstrapDeadline` from pod-ready (default 5min); a same-namespace `FailedCreatePodSandBox` / `NetworkNotReady` / `CNIError` event promotes to `confidence=full` with `discriminator=cni_error`. New YAML knobs: `nccl_bootstrap_deadline` (default 5m), `nccl_bootstrap_correlation_window` (default 10m). Verdict shape pinned by `nccl_bootstrap_verdict.schema.json`.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Bundle three post-wave-audit docs-cleanup issues
(`docs/v1-rc1-post-wave-audit.md` findings #5, #8, #15). One commit per
issue for clean revert / history. Fourth audit cleanup (#380)
intentionally deferred — precondition unmet.

## Per-issue scope + delta

### `chore(module): trim stale PR-I genesis-tag rationale` (closes #381)
Audit finding #8. `module/doc.go` referenced PR-I.1a/1b/2 as "Contents
land in…" and explained the genesis-tag proxy rationale. All three PRs
landed long ago; the module is alive. Trimmed to package-doc only.
- **Before:** 14 lines (8-line historical-artifact block).
- **After:** 6 lines (`package module` declaration only, retained for
proxy compat).
- `go list -m github.com/tracecoreai/tracecore/module` still resolves.

### `docs(rc1): sweep strikethrough resolved-in-issue blocks` (closes
#383)
Audit finding #15. The two specific examples called out by the issue:
- `docs/patterns/10-cuda-oom-deceptive.md:74` — `~~DCGM_FI_DEV_FB_* OTTL
recipe extension.~~` referenced closed #337 (verified via `gh issue view
337` → CLOSED).
- `docs/v1-rc1-simplification-audit.md:185-186` —
`~~components/exporters/otlphttp/~~` +
`~~components/exporters/stdoutexporter/~~` referenced closed #333 / #334
(both verified CLOSED).

Resolution text retained as cross-link history; only the strike-through
visual noise removed.

**Explicitly NOT touched:** `docs/followups/M*.md`,
`docs/MILESTONES.md`, `docs/rfcs/0003-*.md` strikethroughs — these are
documented convention markers per `MILESTONES.md:65,69` ("A PR that
completes a follow-up strikes it through (`~~…~~`)"). Sweeping them
would delete a load-bearing convention.

### `chore(config): document pattern-prefixed knob naming convention`
(closes #379)
Audit finding #5. `config.go` (540 lines) carries three knob styles. The
post-wave prefixed shape is right; renaming bare→prefixed pre-RC1 would
break every existing `values.yaml`. Take option 1 from the issue:
document the convention at top-of-file so future detectors follow it.
- **Before:** No top-of-file knob-naming guidance.
- **After:** 12-line `// Knob-naming convention` block before `package
patterndetectorprocessor`, naming bare-name knobs explicitly and
pointing forward to v2.0 nesting.

## Why #380 is NOT in this bundle

#380 sweeps 12 "future PR-B" comments referencing the RFC-0014
`WithMetrics` bridge. Its own acceptance criteria say "Closes with
PR-B's merge commit."

Precondition check today:
- `git log --all --grep=WithMetrics` → only wave-3 port commits, no
bridge.
- `docs/rfcs/0014-metrics-to-logs-pattern-input.md:3` → status
`accepted, blocks issue #260 PR-B`.
- `docs/integrations/prometheus-scrape.md:381` → "PR-B has NOT yet
shipped."

All 12 comments are factually accurate today. Sweeping prematurely would
either delete load-bearing forward-looking documentation, or replace it
with stale post-PR-B prose. Issue stays open as tracker per its design;
commented on issue noting the deferral.

## Verification

- `go build ./...` (module) — clean.
- `go vet ./...` — clean (via pre-commit).
- `go list -m github.com/tracecoreai/tracecore/module` — resolves.
- `golangci-lint run ./...` — 0 issues (via pre-commit).
- `attribute-namespace-check` — 100/100 (via pre-commit).
- `gh issue view 337 333 334` — all CLOSED (justifies #383 sweep).

## Test plan

- [x] Module builds + vets.
- [x] Referenced closed-issue states verified.
- [x] Strikethrough convention markers in `docs/followups/*` /
`MILESTONES.md` / RFC-0003 left intact (deliberate scope-limit).
- [x] `#380` deferred with on-issue rationale; not in `Closes` trailer.
- [ ] CI green.

```release-notes
NONE
```

Closes #383
Closes #381
Closes #379

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
The patterndetector ships 11 detectors with 14 time-bounded knobs, but
the join shape varies across patterns and the rationale lived only
in code comments + PR review threads. Operators tuning windows had to
read source per detector.

Audit finding: five distinct shapes are load-bearing (chosen by the
causal physics of each signal), not bugs:

- One-sided lookback (#1 #3 #5 #6 #7 #10): cause precedes effect.
- Asymmetric two-sided (#11): pre-stall covers concurrent-start
  checkpoints; post-stall covers OTTL-bridge logger latency.
- Symmetric two-sided (#9 CNI-event leg): cohort-ready ±window
  could be cause OR consequence.
- Job-window bounded (#13): SDC counter rise must fall in the
  bounded eval-cycle's owning job; no operator knob is meaningful.
- Trailing-window rate / freshness (#2 #4 #8): rolling window
  anchored at `now` or the most-recent record.

Decision: document the existing reality, do not converge. Forcing
every detector to the asymmetric two-knob form would silently
zero one leg for the one-sided detectors (footgun on clock skew)
and would not apply to #13 at all.

Adds:
- 'Why this correlation shape' section in docs/patterns/07, 11, 13
  (the three shapes the issue called out by name).
- 'Correlation-window semantics' table in docs/patterns/README.md
  covering ALL 11 detectors with the predicate, anchor, and shape
  rationale, plus cross-links to the per-pattern sections.

No code changes; no detector behavior changes.

Closes #367.

Signed-off-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant