[ci] Sync branch-protection.yml with relaxed review rule#8
Merged
Conversation
The Wave 1 spec set required_approving_review_count: 1 and require_codeowners_review: true. With a sole maintainer who authors every PR, that combination is unsatisfiable — GitHub won't let authors approve their own PRs, so every PR gets stuck waiting for an approval that can never come. The live protection on `main` was relaxed via gh api PATCH just prior to this commit: required_approving_review_count: 0 require_code_owner_reviews: false This commit brings the YAML checklist in sync so the next run of scripts/apply-branch-protection.sh doesn't bounce the live rule back to "1 review required." All other gates remain in force: - linear history - signed commits - required status checks (verify, builds, Analyze go, govulncheck, release-notes, no AI co-author, dco sign-off) - no force-push, no deletions, conversation resolution - enforce_admins (admin held to same rules as anyone else) The inline comment names the bump-back condition explicitly: restore to `1` / `true` when NORTHSTARS.md §O7's M18 target of ≥1 non-employee maintainer is met. PRINCIPLES §4 backs the relaxation — don't police what you don't have. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
5 tasks
trilamsr
added a commit
that referenced
this pull request
May 13, 2026
## What this PR does Companion to PR #8. With `strict: true` (the prior live setting), GitHub required every open PR's branch to be physically rebased onto current `main` before the merge button worked. Every upstream merge invalidated all open PRs and forced a manual "Update branch" click — which created an extra merge commit on the PR (the same one that caused the DCO false-fail on PR #7 before the `dco-check.sh` fix landed). With `strict: false` and `require_linear_history: true` already in force, GitHub rebases the PR onto current `main` at merge time. No manual update-branch dance; no spurious merge commits in PR history. The live protection has already been PATCHed via `gh api`; this PR keeps the YAML checklist in sync so the next run of `scripts/apply-branch-protection.sh` doesn't bounce it back to `true`. **Flip back to `true`** when there are concurrent contributors who could produce semantically-conflicting passes (each PR green individually, the combination broken). Sequential solo-maintainer work doesn't have that failure mode. ## Linked issue(s) N/A — governance-spec follow-up to the live-protection PATCH. ## Release notes ```release-notes NONE ``` ## Checklist - [ ] Tests added or updated — N/A (governance doc only) - [ ] `make ci` passes locally — N/A (no Go change) - [x] Commits are signed off (`git commit -s`) - [x] Commits cryptographically signed (SSH) - [ ] For new components, follows the layout required by `STYLE.md` — N/A (no new components) Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Address item #8: operators setting `telemetry.enabled: true` and nothing else couldn't easily see what port tracecore would actually bind to. The full config is in their YAML mentally, but the default-resolution rules live in code. Surface them. New `tracecore validate --config=foo.yaml --explain`: config valid: 1 pipeline(s) metrics/primary: receivers=1 processors=0 exporters=1 resolved configuration (after defaults applied): telemetry.enabled: true telemetry.listen: localhost:8888 telemetry.paths.metrics: /metrics telemetry.paths.healthz: /healthz telemetry.paths.readyz: /readyz pipelines: metrics/primary: receivers: [clockreceiver] exporters: [stdoutexporter] When telemetry is disabled the block notes "HTTP surface OFF; no port bound" — so operators see telemetry.enabled defaulting to false even without the YAML key present. Two integration tests: - TestIntegration_ValidateExplain_DumpsResolvedConfig: defaults resolve correctly for an "enabled: true with nothing else" YAML. - TestIntegration_ValidateExplain_TelemetryOff: disabled branch surfaces the off state explicitly. Behavior unchanged when --explain is absent; exit codes identical. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
Closes R2.N1, R2.S1, R2.S2, R2.S5, B6 gate-bypass, and documents the R2.S3 warnOnce-key-as-string decision with rationale. R2.N1 — doc.go:21 still referenced RFC-0005 after the Commit A rename. Updated to RFC-0007 with the renumber-history note. Stale-reference class: the Commit A's grep ran on .md files only; doc.go was a *.go reference my regex missed. R2.S1 — example-daemonset.yaml: M2 /healthz + /readyz surface exists but the example couldn't probe it. Added a commented-out `ports:` + `livenessProbe:` + `readinessProbe:` block paired with a `service.telemetry.metrics.address` config-side prerequisite. Operators uncomment both ends together. The defaults-OFF stance is intentional: a probe pointing at an unbound port would CrashLoop the pod on first apply, exactly what the M8 first-attempt review caught. R2.S2 — imagePullPolicy flipped IfNotPresent → Always for the :latest placeholder. The worst-of-both combination (drift on first deploy + no auto-pull after) is replaced with predictable dev-rollout behavior. Comment now points operators at the SHA-digest production-pinning path. R2.S5 — TestPrometheusAlerts_ReferenceWiredCounters now uses word-boundary regex (`\btracecore_receiver_errors_total\b`) instead of `Contains`, so a future metric named with one of these as a prefix (e.g. `_total_foo`) can't satisfy the gate. R2.S4 B6 gate-bypass — same test now also pins: - `clamp_min` is present in the HighParseErrorRate expr (the quiet-host divisor floor R1.B6 fixed) - An absolute-rate floor of the shape `> 0.<digits>` is also present (the second half of the two-part gate) A future "simplification" that removes either floor and re-introduces the quiet-host flap now fails at PR time. R2.S3 — disagreed with the reviewer's suggestion to type the warnOnce `key` as `selftelemetry.Kind`. Documented the disagreement in the godoc with rationale: warnOnce keys are per-source log-rate-limit buckets, separate from the counter-label Kind. Both kmsg and journald sources `IncError(KindParse)` but each needs its OWN log-storm gate — collapsing under one Kind would silence the journald-side first-error log because the kmsg-side fired first. Honest review pushback: this is a real semantic distinction the reviewer flagged in good faith. Three new FOLLOWUPS rows added with falsifiable triggers: - #7 warnOnce recovery re-arm (design call deferred) - #8 CI PR-body AI-mention scan - #9 live-cluster k8s manifest validation in CI Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
Two AGRADE floor items closed. Floor #8 — argv-injection defense-in-depth on journald config. exec.CommandContext doesn't invoke a shell, so values flowing into the journalctl argv (Units, Identifiers, Matches) are already shell-safe today. But a future refactor that reintroduced a shell-quoted code path would bypass that safety; rejecting shell-metacharacter-adjacent bytes at config-load is defense-in-depth. New validateJournalctlArgvSafety rejects values containing NUL/\n/\r bytes and values starting with `-` (which journalctl may mis-parse as a flag). TestConfig_Validate_RejectsJournaldArgvSafetyBytes covers NUL/newline/CR/dash-prefix across units, identifiers, matches (both keys and values), plus a positive case asserting legitimate config still passes. Floor #18 — predeclared lint enabled in .golangci.yml. Caught two immediate shadows in dcgm test code: - cardinality_fuzz_test.go: `cap := ...` (Go built-in) → renamed to `capacity` - docs_parity_test.go: `close := ...` (Go built-in) → renamed to `closeIdx` Both are test-only and have no API impact; cross-receiver fix limited to two identifier renames. Going forward, any new code that shadows a builtin fails the lint at PR time. Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
Reviewer 1 (R3) flagged calibration drift in M9-AGRADE-GAP.md (self ≈ 4.35 vs reviewer-anchored ≈ 4.0) plus tractable items that close all remaining lens-below-4.0 gaps. This commit closes all of them. R2.S4 — pr-validation.yml's no-AI-co-author job now ALSO scans the PR body for AI-attribution patterns (🤖, Generated with Claude/Copilot, Generated by GPT/Gemini/Claude). The commit- trailer scan can't see body content; squash-merge folds the body into the merge commit, so AGENTS.md's no-AI-mention rule needs enforcement at both layers. FOLLOWUPS #8 closed. R3.N1 — config.go:17 stale RFC-0005 reference. The Commit A grep walked .md files only; this round greps .go too and caught three more stale references (config.go:17 + config.go:56 + kernelevents.go:258). All updated to RFC-0007; the deliberate breadcrumb in doc.go:23 remains as renumber history. P2 — TestSourceTemplate_ParsesAsGo. Parser-based gate on source_template.go.example confirms the file stays valid Go syntax and retains its scaffolding identifiers (yourSource, YourConfig, sourceYour, Start/Shutdown/run signatures). Parse- only, no type checking, because the template references internal-package types a copy-paste author renames; full type check would defeat the rename safety. The class of bug an author would actually hit (syntax error, broken func signature, deleted scaffolding identifier) is caught. Self-score recalibration in M9-AGRADE-GAP.md: now reports both self (4.35) and reviewer-anchored (4.05) composites, with explicit note that A/A- is the honest read until C8 (independent reviewer) closes. Per the honest-review-pushback rule, when the author scores above the reviewer, the reviewer's number is the one a third party should size effort against. FOLLOWUPS #11 added: AST walker `resolveIncErrorCall` handles SelectorExpr + Ident but not CallExpr cast form (`selftelemetry.Kind("foo")`). Zero live call sites today (grep-verified); future-tracked. R3 nit on TestRUNBOOK_KindsMatchEmitted_PinsExpectedSet docstring: updated to direct future authors to update `want` DELIBERATELY (i.e. only after writing a RUNBOOK row), not reflexively to silence the gate. Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <trilamsr@gmail.com>
9 tasks
trilamsr
added a commit
that referenced
this pull request
May 22, 2026
…ha) (#158) ## Summary Implements M15 (RFC-0010): Kubernetes container stdout tail receiver. Tails `/var/log/pods/**/*.log` per node, parses CRI text format, attributes records to training rank via Pod informer + body-match fallback, joins with M19 pod-evicted via EvictionMatchWindow, emits to consumer.Logs. Alpha - opt-in via `receivers.containerstdout.enabled=true` (default off in Helm values). ## Phases - **Phase 1:** doc.go, kind.go (7 Kinds), record.go (typed Record + SchemaURLv0 + wire attrs), pattern_consumer_test - **Phase 2:** Config + Validate (RE2 + ceiling + rank_source whitelist) - **Phase 3:** Factory + components.yaml registration - **Phase 4:** CRI parser + Stitcher + path resolver - **Phase 5:** AttributionCache LRU + InformerSource - **Phase 5.3:** process_rank_regex body-match fallback - **Phase 6:** DataloaderExtractor (data_time_s / iter_time_s named captures) - **Phase 7:** Egress rate-limit (token bucket, LRU, namespace budgets) - **Phase 8:** Cursor persistence (atomic write + fsync) - **Phase 10:** Stdlib file Tailer (rotation + truncation detection) - **Phase 11:** Pod informer (node + namespace scoped via filteredIndexer) - **Phase 12:** Receiver lifecycle wiring (lc.Add + recomputeDegraded OR aggregator) - **Phase 13:** 15 failure-mode test identifier pins (AGENTS.md lesson #8) - **Phase 13.5:** 4 hot-path benchmarks (alloc/lookup/regex/rate-limit) - **Phase 13.6:** 10 falsifying failure-mode tests backfilled - **Phase 14:** End-to-end pipeline integration (emit, JSON splat, tailer pool, lines/s flush) - **Phase 15:** Helm chart (RBAC + DaemonSet + values) + RUNBOOK + FAILURE-MODES + prometheus alerts + conftest policy ## Deferred to FOLLOWUP - **Phase 17:** Vendored `pkg/stanza/fileconsumer` swap - currently uses stdlib Tailer (RFC-0010 §FOLLOWUPS) - **Bench targets:** HotPath 5 vs 4 allocs/op; RegexExtraction 698 vs 500 ns/op - flagged for Phase 17 optimization ## Test plan - [ ] CI: `make check` passes (fmt + tidy + lint + race tests across full repo) - [ ] Unit: `go test -race -count=2 ./components/receivers/containerstdout/...` - all green - [ ] Integration: 8 `TestPipeline_E2E_*` end-to-end tests pass - [ ] Failure-mode pins: 10 PASS, 5 SKIP-deferred (Phase 14 integration not yet exercised end-to-end via the SKIP'd ones) - [ ] Helm: `helm template --set receivers.containerstdout.enabled=true` renders DaemonSet + RBAC + NODE_NAME env - [ ] Helm: default render does NOT include containerstdout resources (opt-in verified) - [ ] Conftest: 3 operational-invariant policies (RBAC, volumes, NODE_NAME env) - [ ] Prometheus: 8 alerts cover every Kind + composite degraded - [ ] RUNBOOK: 7 Kind sections + recovery/rollback procedures ## Rollback `helm upgrade --set receivers.containerstdout.enabled=false` removes DaemonSet, RBAC, and config - verified via template diff. ```release-notes Add Kubernetes container-stdout tail receiver (M15, RFC-0010): tails /var/log/pods, parses CRI text format, attributes lines to training rank via Pod informer + body-match fallback, joins with M19 pod-evicted. Alpha — opt-in via receivers.containerstdout.enabled. ``` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced May 22, 2026
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Eight 1-page pattern-design specs covering #2 IB link flap, #7 dataloader hang, #8 NCCL timeout no-HW, #9 NCCL bootstrap timeout, #10 CUDA OOM deceptive allocator, #11 checkpointer hang, #12 loss spike NaN, #13 silent data corruption. Each carries the standard detector-design shape (symptom, layers, signal sources, evaluation rule, verdict attrs, edge cases, status, open questions) so the next contributor can write a TDD red test directly off the spec. Status: all 8 marked planned. #10 already has issue #303; the spec frames the design alongside. NORTHSTARS Appendix A gains a Spec column; docs/README + patterns README link the new specs. Signed-off-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Reviewer cleanup pass on yellow-tier findings: * Schema: tighten gen_ai.training.job_id to minLength:1 and document the fallback-grouping semantic explicitly — the field is OMITTED (not empty-string) on the namespace-only fallback path. Downstream consumers must treat ABSENCE as the explicit fallback signal, not silent exclusion. processor already uses putStrIfSet to suppress empty-string variants. * Spec eval rule: clarify Pattern #8 (NCCL hang) vs Pattern #9 (NCCL bootstrap) trigger disjoint-ness — hang fires on PRESENCE of non-completed FR records mid-run; bootstrap fires on ABSENCE of any FR record past deadline. Both can fire on the same cohort during a heterogeneous bootstrap by design. * Spec impl-notes: add note #5 documenting the min(ReadyAt) / max(ReadyAt) split with the late-joiner-masks-stuck-rank scenario as the rationale. * Empty-Node skip comment in detector: previous comment claimed the skip biases toward false-negatives; in fact it STRENGTHENS the absence signal (rank stays "no FR seen" → counted as failed), biasing toward false-positives. Correct the directionality and call out the fallback-to-(namespace, rank) escape hatch. Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Closes #331. Replaces the single `pattern-pod-evicted` chaos job with a matrix-strategy `pattern-detectors` job covering every detector named in v1-rc1 cut-criterion §1: `hbm_ecc`, `thermal_throttle`, `pcie_aer`, `pod_evicted`, `nccl_hang`, `xid_correlation`. Each row runs the detector's positive + negative + schema `*_test.go` prefix under `-race`: - **Positive / Edge asserts** — verdict emitted with correct `pattern.id` within bounded time. - **Negative / SchemaRejectsDrift asserts** — zero verdicts on baseline / confounder input (the false-positive gate the chaos matrix exists to enforce). The `pod_evicted` row additionally runs the on-disk replay corpus under `module/pkg/replay/pod_evicted/`; sibling rows skip that step until their `replay/<pattern>/` fixtures land — tracked as Path A in [`docs/v1-rc1-test-audit.md`](../blob/main/docs/v1-rc1-test-audit.md) §4. ## Why this matrix shape, not new `failure-inject` modes Audit §4 explicitly recommends Path A (matrix over existing hermetic detector tests) for rc1 and defers Path B (per-pattern `failure-inject` CLI modes) to v1.x. Path A reuses the race-detector + golden-SHA shape already validated by `pattern-pod-evicted` and costs ~60 LoC of workflow YAML instead of five new CLI subcommands. ## Wall-time budget Observed per-row runtime under `-race` (measured locally on this worktree): | Row | runtime | |---|---| | hbm_ecc | 1.35s | | thermal_throttle | 1.33s | | pcie_aer | 1.30s | | pod_evicted (incl. replay corpus) | 1.35s + 1.34s | | nccl_hang | 1.31s | | xid_correlation | 1.32s | Per-row `timeout-minutes: 5` is ~3× headroom; matrix parallelism keeps total wall-time well inside the chaos-gate budget. ## Test plan - [x] `actionlint .github/workflows/chaos.yml` — clean. - [x] `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/chaos.yml'))"` — parses, 3 jobs (`harness-determinism`, `cpu-steal-mpstat`, `pattern-detectors`), 6-row matrix. - [x] For every matrix row, `cd module && go test -race -count=1 -run '<regex>' ./pkg/patterns/...` — all green (counts: HBMECC=16, ThermalThrottle=17, PCIeAER=15, PodEvicted+PressureFromNote=12, NCCLHang=10, XidCorrelation=14). - [x] `cd module && go test -race -count=1 ./pkg/replay/...` (pod_evicted corpus step) — green. - [x] Pre-commit hook: golangci-lint + go vet + go mod verify + attribute-namespace-check — all green. - [ ] Workflow runs green on `main` after merge (verifiable only post-merge — nightly cron + push-to-main). ```release-notes ci: chaos.yml now matrix-tests all six shipped detectors (#14 pod_evicted, #8 nccl_hang, #3 hbm_ecc, #4 thermal_throttle, #5 pcie_aer, xid_correlation) instead of just pod_evicted. Closes the chaos-coverage gap called out in v1-rc1-test-audit §4. No operator-visible behavior change. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
5 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Closes #314, #315. - Adds `step-security/harden-runner@v2.19.4` (SHA-pinned) in `audit` mode to the `package`, `ko-publish`, and `chart-publish` jobs. Every release-tag run now uploads a network-trace artifact verifiers can cross-check against the SLSA provenance bundle (issue #315 acceptance criterion). - Documents the SLSA L3 posture explicitly: workflow header, `docs/threat-model.md`, and `docs/v1-rc1-operational-gaps.md` all reflect the generic-generator-vs-Go-builder decision with the upstream blocker named. - Wires an SLSA L3 verification step into the `docs/RELEASE-CHECKLIST.md` tag-cut flow (`gh attestation verify` + `slsa-verifier` + harden-runner audit artifact inspection). ## Why not migrate to `builder_go_slsa3.yml` (#314 acceptance gap) Per the user prompt's deliverable 1, the explicit fallback `generator_generic_slsa3.yml` is permitted "if our build isn't a straight `go build`". Tracecore's build is OCB-driven: `make build` invokes the OpenTelemetry Collector Builder to regenerate `./_build/{main.go,components.go,go.mod,go.sum}` per-arch. There is no top-level Go module that produces the binary, so the Go builder reusable workflow — which reads `.slsa-goreleaser.yml` and runs `go build` inside its controlled env — has nothing to compile. Root cause: upstream `slsa-framework/slsa-github-generator` has no pre-build hook for ecosystem builders to run a generator-style step (`ocb generate`) before `go build`. Tracked at [slsa-framework/slsa-github-generator#3033](slsa-framework/slsa-github-generator#3033) and now called out in: - `.github/workflows/release.yml` header comment (lines 51-68) - `docs/v1-rc1-operational-gaps.md` §1 step 1 (marked deferred-upstream-blocked, not "to-do") - `docs/threat-model.md` row 9 (build-provenance asset) What we ship instead, all in this PR: | Track | Status | |---|---| | `slsa-github-generator@v2.1.0` generic-generator path | already in place; SLSA3+ provenance attestation (Sigstore-signed by Fulcio under workflow OIDC identity, recorded in Rekor) | | `actions/attest-build-provenance@v4.1.0` redundant L3-grade trail | already in place | | `step-security/harden-runner@v2.19.4` egress audit on every release-tag run | **new in this PR** (covers #315 fully) | The *attestation infrastructure* is L3-grade; the *build platform* is what's L2 in spec-strict terms because the package job runs outside the generator's controlled env. The harden-runner audit artifact gives verifiers a network-trace they can cross-check against the predicate's declared inputs — the next-best evidence short of a hermetic in-env build. ## Files changed - `.github/workflows/release.yml` — three `step-security/harden-runner@9af89fc71515a100421586dfdb3dc9c984fbf411 # v2.19.4` steps (one per job: `package`, `ko-publish`, `chart-publish`); header-comment block expanded with the SLSA L3 rationale. - `docs/RELEASE-CHECKLIST.md` — new tag-cut step #8 with three verification commands (`gh attestation verify`, `slsa-verifier verify-artifact`, harden-runner audit-artifact inspection). - `docs/threat-model.md` — supply-chain actor row + build-provenance asset row now name the actual control stack. - `docs/v1-rc1-operational-gaps.md` — §1 remediation steps 1/2/3 updated with done/deferred status linked to specific issue numbers. ## Validation ``` $ python3 -c "import yaml; yaml.safe_load(open('.github/workflows/release.yml'))" # clean $ actionlint .github/workflows/release.yml # clean $ zizmor .github/workflows/release.yml 9 findings (3 ignored, 2 suppressed, 4 unsafe fixes): 0 informational, 4 low, 0 medium, 0 high ``` Zizmor: same 9 findings as pre-PR baseline (verified via `git stash` + re-run). The 4 low-confidence `artipacked` findings are on pre-existing `actions/checkout` steps not touched by this PR. Pre-commit hooks (`make check`: fmt, tidy, golangci-lint, vet, mod-verify, attribute-namespace-check) green. ## Per repo memory `feedback_goreleaser_pro_features` All workflow changes use OSS-syntax actions only: - `step-security/harden-runner` is OSS Apache-2.0 (Tier-1 + Tier-2 dashboards are free for public repos) - `slsa-github-generator` is OSS Apache-2.0 - No Pro-only `prebuilt:` / `include:` / `nightly:` / `changelog:` config introduced ## Test plan - [ ] Cut a `v0.x.y-rc` tag against this branch in a sandbox fork and confirm: - [ ] `step-security/harden-runner` artifact appears in the workflow run summary for each of `package`, `ko-publish`, `chart-publish` jobs - [ ] Audit artifact lists egress to the expected endpoints only (`proxy.golang.org`, `sum.golang.org`, `ghcr.io`, `gcr.io`, `fulcio.sigstore.dev`, `rekor.sigstore.dev`) - [ ] `gh attestation verify --predicate-type https://slsa.dev/provenance/v1 tracecore_<ver>_linux_amd64.tar.gz` succeeds against the published archive - [ ] `slsa-verifier verify-artifact --provenance-path tracecore_<ver>_linux_amd64.intoto.jsonl --source-uri github.com/TraceCoreAI/tracecore --source-tag v<ver>` succeeds ```release-notes ci(release): SLSA L3 provenance verification + step-security/harden-runner egress audit added to release.yml. Every release-tag run now uploads a network-trace audit artifact alongside the SLSA provenance bundle. docs/RELEASE-CHECKLIST.md gains an explicit L3 verification step. ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Ships pattern-#9 (NCCL bootstrap timeout) detector end-to-end — the first job-start-time pattern in the library (sibling to pattern #8 which fires mid-run). A training-job cohort whose pods are Ready past `BootstrapDeadline` (default 5min) but where at least one rank never emitted any NCCL FlightRecorder record is stuck in NCCL bootstrap; a same-namespace K8s CNI / network-readiness event in the correlation window promotes the verdict to `confidence=full` and stamps `discriminator=cni_error`. Spec: [`docs/patterns/09-nccl-bootstrap-timeout.md`](docs/patterns/09-nccl-bootstrap-timeout.md). Status flipped from `planned` → `shipped`; `Implementation notes` section captures how each spec open-question resolved with the most-conservative reading. ## What landed - `module/pkg/patterns/nccl_bootstrap.go` — detector + `TrainingPodRecord` / `CNINetworkEventRecord` / `NCCLBootstrapTimeoutVerdict` types. Reuses `NCCLFRRecord` from `nccl_hang.go`. - `module/pkg/patterns/nccl_bootstrap_test.go` — 11 detector tests + schema-conformance + 10-falsifier drift battery. Covers: full-correlation fires, partial-when-no-CNI, normal-startup-no-fire, deadline-not-yet-reached, heterogeneous failure, multi-job cohorts don't merge, namespace-only fallback, cross-namespace CNI doesn't join, deadline-configurable, deterministic ordering, max(ReadyAt) drives age. - `module/pkg/patterns/testdata/nccl_bootstrap_verdict.schema.json` — JSON Schema with `additionalProperties:false` and full enum guards. - `module/processor/patterndetectorprocessor/nccl_bootstrap.go` — projections (`projectTrainingPodRecord` gates on `k8s.pod.ready_time` + `gen_ai.training.rank`; `projectCNINetworkEventRecord` gates on `k8s.event.reason` ∈ `{FailedCreatePodSandBox, NetworkNotReady, CNIError}`), verdict writer with promoted scalars (issue #270 contract), and runner that consumes NCCL FR records from the existing cross-cutting `collectInputs` (no double-projection). - `module/processor/patterndetectorprocessor/nccl_bootstrap_test.go` — 6 wiring tests (full verdict, partial verdict, partial-suppressed-by-flag, normal-startup-no-fire, sub-1s deadline rejection, sub-1s window rejection). - `Config.NCCLBootstrapDeadline` + `Config.NCCLBootstrapCorrelationWindow` with Validate guards (≥1s) and `withDefaults` / `defaultConfig` wiring; `example_config.yaml` updated. - `docs/ATTRIBUTES.md` — 3 new `tracecore.alert.nccl_bootstrap_timeout.*` rows, new `k8s.pod.ready_time` row, updated `gen_ai.training.job_id` row (now consumed with fallback), new per-pattern matrix row for `nccl_bootstrap`. ## Design calls (load-bearing) - **Cohort key.** `(gen_ai.training.job_id, k8s.namespace.name)` when stamped; `(k8s.namespace.name)`-only fallback when job_id is absent (spec open question #1). Empty `gen_ai.training.job_id` on the verdict signals the fallback path to operators. - **Bootstrap-failed-rank index key.** `(node, rank)` not `(namespace, rank)` — avoids cross-cohort contamination when two jobs in the same namespace land on different nodes. FR records with empty Node are skipped from the index (a wiring gap should NOT cause false-negatives — i.e. mask real bootstrap failures — even at the cost of cross-job false-positives that are unlikely in practice). - **CNI vocab.** v0 ships the K8s-control-plane vocabulary only (`FailedCreatePodSandBox` / `NetworkNotReady` / `CNIError`). Per-CNI raw-error parsing (Cilium / Calico / multus distinct strings) is the discriminator-branch follow-up that lights up `socket_ifname_mismatch` / `rendezvous_unreachable`. - **Cohort size.** Count of distinct ranks the detector observed pod-Ready signals for. Pods that never reached Ready (image-pull stuck) don't enter the cohort — they belong to pattern #15. Per the spec's edge case "slow image pull" no false-positive. - **`max(ReadyAt)` drives deadline.** A late-joining rank pushes the effective ready timestamp forward, preventing false-positives during rolling pod-Ready scenarios on cold-cache clusters. ## Test plan - [x] `cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/...` — clean - [x] `cd module && go vet ./...` — clean - [x] Pre-commit hook: `golangci-lint run ./...` — 0 issues; `attribute-namespace-check` — 72/72 documented - [x] TDD discipline: `test(nccl-boot): RED` → `feat(nccl-boot): GREEN` commits - [ ] CI green on PR (full matrix) ```release-notes feat(patterns): pattern-9 (NCCL bootstrap timeout) detector — fires when a training-job cohort has at least one rank with no NCCL FR record past `BootstrapDeadline` from pod-ready (default 5min); a same-namespace `FailedCreatePodSandBox` / `NetworkNotReady` / `CNIError` event promotes to `confidence=full` with `discriminator=cni_error`. New YAML knobs: `nccl_bootstrap_deadline` (default 5m), `nccl_bootstrap_correlation_window` (default 10m). Verdict shape pinned by `nccl_bootstrap_verdict.schema.json`. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Bundle three post-wave-audit docs-cleanup issues (`docs/v1-rc1-post-wave-audit.md` findings #5, #8, #15). One commit per issue for clean revert / history. Fourth audit cleanup (#380) intentionally deferred — precondition unmet. ## Per-issue scope + delta ### `chore(module): trim stale PR-I genesis-tag rationale` (closes #381) Audit finding #8. `module/doc.go` referenced PR-I.1a/1b/2 as "Contents land in…" and explained the genesis-tag proxy rationale. All three PRs landed long ago; the module is alive. Trimmed to package-doc only. - **Before:** 14 lines (8-line historical-artifact block). - **After:** 6 lines (`package module` declaration only, retained for proxy compat). - `go list -m github.com/tracecoreai/tracecore/module` still resolves. ### `docs(rc1): sweep strikethrough resolved-in-issue blocks` (closes #383) Audit finding #15. The two specific examples called out by the issue: - `docs/patterns/10-cuda-oom-deceptive.md:74` — `~~DCGM_FI_DEV_FB_* OTTL recipe extension.~~` referenced closed #337 (verified via `gh issue view 337` → CLOSED). - `docs/v1-rc1-simplification-audit.md:185-186` — `~~components/exporters/otlphttp/~~` + `~~components/exporters/stdoutexporter/~~` referenced closed #333 / #334 (both verified CLOSED). Resolution text retained as cross-link history; only the strike-through visual noise removed. **Explicitly NOT touched:** `docs/followups/M*.md`, `docs/MILESTONES.md`, `docs/rfcs/0003-*.md` strikethroughs — these are documented convention markers per `MILESTONES.md:65,69` ("A PR that completes a follow-up strikes it through (`~~…~~`)"). Sweeping them would delete a load-bearing convention. ### `chore(config): document pattern-prefixed knob naming convention` (closes #379) Audit finding #5. `config.go` (540 lines) carries three knob styles. The post-wave prefixed shape is right; renaming bare→prefixed pre-RC1 would break every existing `values.yaml`. Take option 1 from the issue: document the convention at top-of-file so future detectors follow it. - **Before:** No top-of-file knob-naming guidance. - **After:** 12-line `// Knob-naming convention` block before `package patterndetectorprocessor`, naming bare-name knobs explicitly and pointing forward to v2.0 nesting. ## Why #380 is NOT in this bundle #380 sweeps 12 "future PR-B" comments referencing the RFC-0014 `WithMetrics` bridge. Its own acceptance criteria say "Closes with PR-B's merge commit." Precondition check today: - `git log --all --grep=WithMetrics` → only wave-3 port commits, no bridge. - `docs/rfcs/0014-metrics-to-logs-pattern-input.md:3` → status `accepted, blocks issue #260 PR-B`. - `docs/integrations/prometheus-scrape.md:381` → "PR-B has NOT yet shipped." All 12 comments are factually accurate today. Sweeping prematurely would either delete load-bearing forward-looking documentation, or replace it with stale post-PR-B prose. Issue stays open as tracker per its design; commented on issue noting the deferral. ## Verification - `go build ./...` (module) — clean. - `go vet ./...` — clean (via pre-commit). - `go list -m github.com/tracecoreai/tracecore/module` — resolves. - `golangci-lint run ./...` — 0 issues (via pre-commit). - `attribute-namespace-check` — 100/100 (via pre-commit). - `gh issue view 337 333 334` — all CLOSED (justifies #383 sweep). ## Test plan - [x] Module builds + vets. - [x] Referenced closed-issue states verified. - [x] Strikethrough convention markers in `docs/followups/*` / `MILESTONES.md` / RFC-0003 left intact (deliberate scope-limit). - [x] `#380` deferred with on-issue rationale; not in `Closes` trailer. - [ ] CI green. ```release-notes NONE ``` Closes #383 Closes #381 Closes #379 --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
The patterndetector ships 11 detectors with 14 time-bounded knobs, but the join shape varies across patterns and the rationale lived only in code comments + PR review threads. Operators tuning windows had to read source per detector. Audit finding: five distinct shapes are load-bearing (chosen by the causal physics of each signal), not bugs: - One-sided lookback (#1 #3 #5 #6 #7 #10): cause precedes effect. - Asymmetric two-sided (#11): pre-stall covers concurrent-start checkpoints; post-stall covers OTTL-bridge logger latency. - Symmetric two-sided (#9 CNI-event leg): cohort-ready ±window could be cause OR consequence. - Job-window bounded (#13): SDC counter rise must fall in the bounded eval-cycle's owning job; no operator knob is meaningful. - Trailing-window rate / freshness (#2 #4 #8): rolling window anchored at `now` or the most-recent record. Decision: document the existing reality, do not converge. Forcing every detector to the asymmetric two-knob form would silently zero one leg for the one-sided detectors (footgun on clock skew) and would not apply to #13 at all. Adds: - 'Why this correlation shape' section in docs/patterns/07, 11, 13 (the three shapes the issue called out by name). - 'Correlation-window semantics' table in docs/patterns/README.md covering ALL 11 detectors with the predicate, anchor, and shape rationale, plus cross-links to the per-pattern sections. No code changes; no detector behavior changes. Closes #367. Signed-off-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
The Wave 1 spec set
required_approving_review_count: 1andrequire_codeowners_review: true. With a sole maintainer who authors every PR, that combination is unsatisfiable — GitHub doesn't allow authors to approve their own PRs, so every PR gets blocked waiting on an approval that can never come.The live protection on
mainwas relaxed viagh api PATCHimmediately before this commit:This PR brings
.github/branch-protection.ymlin sync so the next run ofscripts/apply-branch-protection.shdoesn't bounce the live rule back to "1 review required."All other gates remain in force: linear history, signed commits, required status checks (verify, builds, Analyze go, govulncheck, release-notes block, no AI co-author trailer, dco sign-off), no force-push, no deletions, conversation resolution,
enforce_adminstrue.Inline comment names the bump-back condition explicitly: restore to
1/truewhenNORTHSTARS.md§O7's M18 target (≥1 non-employee maintainer active) is met.PRINCIPLES.md §4 backs the relaxation: "don't police what you don't have."
Linked issue(s)
N/A — governance-spec follow-up to the live-protection PATCH.
Release notes
Checklist
make cipasses locally — N/A (no Go change)git commit -s)STYLE.md— N/A (no new components)