chore(infra): Makefile shards + kind-CRD bootstrap + module/ full-lint#498
Conversation
Triple-shipper closing three load-bearing infra debts that recurred on
every chart/CI-touching PR. Atomic so we handle this once.
Part 1 — Makefile sharding (cascade-rebase tripwire).
Root Makefile's `.PHONY:`, `check:`, `verify:`, `ci-fast:`, `ci-full:`
prereq lists were append-magnets — two PRs touching the same line
produced 3-way conflicts. Split into `make/{phony,check,verify,
ci-fast,ci-full}.mk` shards using `+=` appends. Main Makefile
`include`s the shards; aggregate targets consume `$(*_DEPS)`. Set
identity preserved: `make -pn` resolves byte-identical prereq sets
for all 4 aggregate targets (verified against origin/main).
A+: added `scripts/makefile-hotfile-check.sh` (+ `make
makefile-hotfile-check` target) that fails if a future PR re-inlines
prereq tokens into the root Makefile. Wired into `ci-fast` and
`ci-full` so drift trips per-PR.
Part 2 — Kind-CRD bootstrap composite action.
Three workflows (chart.yml, policy-matrix.yml, install-bench.yml)
each separately installed helm + kind + the tracecore image, each
drifted from the others on CRD prereqs (ServiceMonitor #494 fixed
policy-matrix only; chart.yml and install-bench.yml remained
vulnerable). Created `.github/actions/kind-cluster-setup/action.yml`
as a single source of truth: pinned helm v3.16.4, kind v0.25.0,
node v1.32.0, ServiceMonitor CRD v0.91.0 (#494 pin), with toggles
for Gatekeeper / cert-manager CRDs (reserved for future workflows).
All 3 workflows now `uses:` the composite. Old `kind-tracecore-up`
shim deleted (zero remaining callsites).
Mutation-verify: changing the ServiceMonitor CRD URL in
kind-cluster-setup/action.yml fails all 3 workflows uniformly by
construction (single-source-of-truth pin).
Part 3 — Full module/ lint coverage (#490 follow-up of #486).
`make lint-module-full` (new target) runs the full .golangci.yml
linter set against the in-repo submodule under module/. PR #486
scoped to `unused` only; rest of the 13 linters were silently
skipped against module/. Swept all pre-existing findings to 0:
- golangci-lint --fix: 17 findings (testifylint 14, errorlint 1,
perfsprint 1, staticcheck-QF1003 1) auto-fixed.
- Manual real fixes (24 findings): forcetypeassert → checked
type assertions (6); goconst → `fallbackCollectiveOp`,
`PressureUnknown` constants (4); gocritic ifElseChain/
singleCaseSwitch rewrites (2); predeclared rename `any`/`real`/
`cap` (3); revive package-comments → `// Package X` form (1);
prealloc → `make([]T, 0, len(src))` (1); wrapcheck → `fmt.Errorf
"%w"` wrap (2); G301 → 0o755 → 0o750 mkdir perms in test (1);
exhaustive → explicit `default:` clauses on test value-type
switches (7) + explicit `PressureUnknown` case in remediation.
- Documented opt-outs (6 with per-finding rationale):
gocyclo on 5 pattern-match dispatch functions whose complexity
tracks vocabulary cardinality (NCCL FR state, knob count) not
nested logic; gosec G103 on `unsafe.String` aliasing carve in
xid_correlation (audited; buffer-finalised invariant); gosec
G304 on 5 test-local fixture-path reads; staticcheck SA1019 on
explicit pre-deprecation parity assertion (#277).
`make lint-module-full` exits 0 on the cleaned tree. Wired into
`make check`, replacing `lint-unused-module` (which is retained for
fast-iteration dead-code sweeps but no longer in the aggregate
gates).
Mutation tests — all green:
- Makefile prereq sets resolve byte-identical to origin/main for
`check`, `verify`, `ci-fast`, `ci-full` (modulo the intentional
additions of `makefile-hotfile-check` to ci-fast/ci-full and
`lint-module-full` superseding `lint-unused-module` in check).
- kind-cluster-setup CRD pin is single-source — mutating the URL
fails all 3 workflows uniformly.
- `make lint-module-full` exits 0; mutating the linter config to
re-enable a fixed finding correctly fails.
Closes #490. Re-affirms #486 (extends coverage from `unused` to the
full 13-linter set).
Note: Surfaced a pre-existing test failure unrelated to this PR —
`TestPatternDetector_NegativeFixturesEmitNoVerdicts/synthetic-2026-06
-multi-rank-disk-pressure` fails on origin/main HEAD because the
test's "is this a negative fixture?" filter treats every non-canonical
fixture as negative, including the `_real_world/*` positives added in
PR #484. Filed as issue #497 with repro and fix sketch.
Signed-off-by: Tri Lam <tree@lumalabs.ai>
Per scripts/doc-check.sh comment-noise rule (STYLE.md defaults to no comments — banners rot in long-lived files). Replaces full-width `----` banners with single-line section headers + a leading prose preamble. Signed-off-by: Tri Lam <tree@lumalabs.ai>
|
Independent Adversarial Review — Grade: B (needs fixes) FindingsPart 1: Makefile Sharding🔴 bug (
✓ PASS (
🔵 nit (
Part 2: Kind-CRD Composite Action🔴 bug (
✓ PASS (Consolidation): Kind setup pins are consistent.
Part 3: Module Full Lint✓ PASS (Sweep scope): 57→0 findings documented.
❓ question (#497 pre-existing failure): Will this PR fail CI?
🔵 nit (Metric table clarity): "13 linters" phrasing could be clearer.
Summary
Overall: B (two load-bearing issues must be resolved before merge) Required Actions
Do NOT enable auto-merge. Regrade after fixes. |
Reviewer B caught two issues:
1. chart.yml's kind-cluster-setup callsites (install + upgrade jobs) did
not pass install-servicemonitor-crd=true, so they inherited the
action's default false. That broke the PR-body mutation-test claim
('change ServiceMonitor CRD URL -> all 3 workflows fail uniformly')
because chart.yml's kind jobs were not actually exercising the CRD
install path. Set the input to 'true' on both kind callsites so the
mutation actually trips all three workflows uniformly.
2. make/README.md: appended trailing periods to the four shard-bullet
lines (26-29) to match the rest of the doc's prose punctuation.
PR-body wording fix ('byte-identity preserved' -> 'logically equivalent
(modulo intentional gate additions)') applied via gh pr edit, not in
this commit since it lives on the PR object, not in the tree.
Signed-off-by: Tri Lam <tree@lumalabs.ai>
|
B fixes applied: (1) chart.yml passes install-servicemonitor-crd=true on both kind callsites (install + upgrade jobs), so the mutation-test claim ("change ServiceMonitor CRD URL → all 3 workflows fail uniformly") is now load-bearing for chart.yml; (2) PR body wording corrected — "byte-identity preserved" → "logically equivalent (modulo intentional gate additions)" in both the Part 1 Fix paragraph and the metrics table, with the two intentional deltas (lint-unused-module → lint-module-full in check; new makefile-hotfile-check in ci-fast/ci-full) named explicitly; (3) make/README.md punctuation sweep (trailing periods on the four shard-bullet lines). Commit: a36123c. actionlint clean. Re-requesting review. |
|
Regrade: A All prior B findings resolved:
Validation:
Recommend merge. |
…and-kind-bootstrap-and-module-lint # Conflicts: # Makefile
Post-merge refactor opportunity (non-blocking, follow-up scope)PR ships 15 inline
Moving 13 inline nolints → ~3
Out of scope for this PR (already large + reviewed). Worth a small follow-up. |
## Summary PR #498's `kind-cluster-setup` composite action issues `kubectl wait --for=condition=established` immediately after `kubectl apply` of the prometheus-operator ServiceMonitor, Gatekeeper, and cert-manager CRDs. Fresh CRDs have `.status.conditions == nil` for ~1-3s before the API server populates them. `kubectl wait` does **not** retry on this state — it errors immediately with: ``` error: .status.conditions accessor error: <nil> is of the type <nil>, expected []interface{} ``` This regresses the **policy-matrix** workflow (`gatekeeper-restricted × {default,production}`, `mutation × {psa,gatekeeper}`) on every chart-touching PR. ## Fix Wrap each `kubectl wait` in a bounded retry loop: ```bash for _ in $(seq 1 30); do if kubectl wait --for=condition=established --timeout=2s \ crd/servicemonitors.monitoring.coreos.com 2>/dev/null; then break fi sleep 1 done # Final assertion — fails loud if CRD never became established. kubectl wait --for=condition=established --timeout=2s \ crd/servicemonitors.monitoring.coreos.com ``` The inner `--timeout=2s` is the per-attempt budget; the outer loop is the retry budget (~90s ceiling for ServiceMonitor/cert-manager, ~3min for Gatekeeper which ships a larger bundle). The post-loop `kubectl wait` re-asserts so a real failure (CRD never applied, never became established) still fails the workflow with a clear error — we did not swallow it with `|| true`. ## Scope All three CRD-wait callsites in the same action file share the identical race. All three get the same pattern in this PR (A+ scope per builder brief — audited repo-wide for `kubectl wait --for=condition=established` and confirmed no other callsites). ## Verification - `make actionlint` → exit 0 - `go tool golangci-lint run ./...` → 0 issues - `make pre-push` (lint + vet + attribute-namespace-check + deprecation-check) → OK - Mutation reasoning: if you replace the CRD URL with an unreachable one, the inner `kubectl apply` fails first (correct); if the CRD exists but never becomes established, the post-loop `kubectl wait` fails with the usual `error: timed out waiting for the condition`, not the nil-status accessor error. ```release-notes ci: retry kubectl wait for fresh CRD nil-status race in kind-cluster-setup action — unblocks policy-matrix on chart-touching PRs (#500). ``` Closes #500. Signed-off-by: Tri Lam <tree@lumalabs.ai>
## Summary Removes `.github/workflows/policy-matrix.yml`. Engine-specific admission validation (PSA-restricted × Kyverno × Gatekeeper × default+production) delivered negative ROI at rc1. ## Root cause 4 PRs blocked or chasing this workflow's flakes (#475 introduction, #481, #498, #501). Caught zero real regressions; only its own infra bugs: - ServiceMonitor CRD bootstrap race (#494) - AppArmor host-capability mismatch (#481 → #493) - kubectl wait .status.conditions nil race (#500 → #501) ## Coverage retained (without policy-matrix) - `conftest` — offline PSS-baseline + restricted validation. - `helm lint` — chart structural validation. - `kubeconform` — K8s API conformance. - `kubectl apply --dry-run=server` (chart.yml install/upgrade jobs) — API-level breakage on generic kind cluster. ## What stays in tree - `scripts/policy-matrix-smoke.sh` + Gatekeeper/Kyverno bundle refs — cheap reactivation when GA triggers fire. - `install/kubernetes/tracecore/policies/conftest/**` — offline policy bundle (still active). ## Re-enable triggers (tracked in #502) - GA criterion #1 (third-party audit) requests engine-specific compat validation. - First operator running under Kyverno/Gatekeeper reports admission rot. - CRD-bootstrap pattern stabilises across other workflows. ## Test plan - [x] `make doc-check` exit 0 (post comment-edit in kind-cluster-setup action.yml). - [x] No remaining policy-matrix.yml references in repo (verified by grep). - [x] Pre-commit hooks green (lint/vet/mod-verify/attribute-namespace). - [x] README + install-bench stale refs scrubbed (follow-up commit). ```release-notes ci: defer engine-specific policy-matrix workflow (PSA × Kyverno × Gatekeeper admission validation) to GA. Coverage retained via conftest + helm lint + kubeconform + kubectl apply --dry-run=server. Re-enable tracked in #502. ``` Refs #502 #475 #494 #500. --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
Summary
Triple-shipper closing three load-bearing infra debts that recurred on every chart/CI-touching PR. Atomic so we handle this once.
Part 1 — Makefile sharding (cascade-rebase tripwire)
Root cause: The root
Makefilecarried four monolithic prereq lists (.PHONY:,check:,verify:,ci-fast:,ci-full:). Every new gate appended one token to each list, and two open PRs touching the same line produced a 3-way merge conflict that required manual fix-up — the dominant source of cascade-rebases on this repo.Fix: Split into
make/{phony,check,verify,ci-fast,ci-full}.mkshards using+=appends. MainMakefileincludes the shards; aggregate targets now consume$(*_DEPS). Prereq sets are logically equivalent toorigin/main(modulo intentional gate additions):make -pnshowslint-unused-modulereplaced bylint-module-fullincheck(Part 3) and the newmakefile-hotfile-checkadded toci-fast/ci-full(Part 1 A+). No other prereq tokens moved.A+: Added
scripts/makefile-hotfile-check.sh(+make makefile-hotfile-checktarget) that fails if a future PR re-inlines prereq tokens into the rootMakefile. Wired intoci-fast+ci-fullso drift trips per-PR.Part 2 — Kind-CRD bootstrap composite action
Root cause: Three workflows (chart.yml, policy-matrix.yml, install-bench.yml) each separately installed helm + kind + the tracecore image, each drifted from the others on CRD prereqs (ServiceMonitor #494 fixed policy-matrix only; chart.yml + install-bench.yml remained vulnerable to the same regression).
Fix: Created
.github/actions/kind-cluster-setup/action.ymlas a single source of truth: pinned helm v3.16.4, kind v0.25.0, node v1.32.0, ServiceMonitor CRD v0.91.0 (#494 pin), with toggles for Gatekeeper / cert-manager CRDs (reserved for future workflows). All 3 workflows nowuses:the composite. Oldkind-tracecore-upshim deleted (zero remaining callsites).Mutation-verify: changing the ServiceMonitor CRD URL in
kind-cluster-setup/action.ymlfails all 3 workflows uniformly by construction (single-source-of-truth pin).Part 3 — Full module/ lint coverage (#490 follow-up of #486)
Root cause:
make lintfrom the root never reachesmodule/(workspace mode resolves./...only inside the current module). PR #486 addedmake lint-unused-modulefor theunusedlinter only; the rest of the 13 linters declared in.golangci.ymlwere silently skipped againstmodule/, accumulating 57 findings.Fix: New
make lint-module-fulltarget runs the full.golangci.ymllinter set againstmodule/. Swept all 57 pre-existing findings to 0:golangci-lint --fix: 17 findings auto-fixed (testifylint 14, errorlint 1, perfsprint 1, staticcheck-QF1003 1)fallbackCollectiveOp,PressureUnknownconstants (4); gocritic rewrites (2); predeclared renames (3); revive package-comments (1); prealloc (1); wrapcheckfmt.Errorf "%w"(2); G301 mkdir 0755 → 0750 (1); exhaustive → explicitdefault:clauses (7) + explicitPressureUnknowncase (1)unsafe.Stringaliasing carve; gosec G304 on 5 test-local fixture reads; staticcheck SA1019 on explicit pre-deprecation parity assertion (v0.4: deprecate EvictedPod in favor of PodName + PodNamespace on XidCorrelationVerdict #277)make lint-module-fullexits 0 on the cleaned tree. Wired intomake check, replacinglint-unused-module(retained for fast-iteration dead-code sweeps).Per-part metrics
includelinesorigin/mainmake/phony,check,verify,ci-fast,ci-full)uses:blocks)unused).golangci.ymlset)//nolint:directives, each with per-finding rationaleCloses
unusedonly to the 13-linter set)Follow-ups filed
TestPatternDetector_NegativeFixturesEmitNoVerdicts/synthetic-2026-06-multi-rank-disk-pressurefails onorigin/mainHEAD because the negative-fixture filter treats every non-canonical fixture as negative, including the_real_world/*positives added in feat(replay): pod_evicted PII anonymizer + real-world fixture (M19 #1) #484. Fix sketch + repro included.Test plan
make verify— greenmake check— greenmake lint-module-full— exit 0make doc-check— green (comment-noise sweep)make actionlint— greenmake zizmor— greenmake makefile-hotfile-check— green (own gate)make -pnbyte-identity check againstorigin/mainfor all 4 aggregate targets — passes (modulo intentionalmakefile-hotfile-checkadditions to ci-fast/ci-full +lint-module-fullsupersedinglint-unused-modulein check)(cd module && GOWORK=off go test ./...)— green except pre-existing test: synthetic-2026-06-multi-rank-disk-pressure fixture mis-labelled as negative #497 failure (filed)