fix(ci): retry kubectl wait for fresh CRD nil-status race (#500)#501
Merged
Conversation
PR #498's kind-cluster-setup action issues `kubectl wait --for=condition=established` immediately after `kubectl apply` of the prometheus-operator ServiceMonitor, Gatekeeper, and cert-manager CRDs. Fresh CRDs have `.status.conditions == nil` for ~1-3s before the API server populates them. `kubectl wait` errors immediately (does not retry) with: error: .status.conditions accessor error: <nil> is of the type <nil>, expected []interface{} This regresses the policy-matrix workflow (gatekeeper-restricted x {default,production}, mutation x {psa,gatekeeper}) on every chart-touching PR. Wrap each `kubectl wait` in a bounded retry loop (`--timeout=2s` per attempt, sleep 1s, 30 or 60 attempts), then re-run the wait outside the loop so a real failure (CRD never applied, never became established) still fails loud. All three CRD-wait callsites in the action share the same race; all three get the same pattern (A+ scope per builder brief). Verified: `make actionlint` exit 0. Closes #500. Signed-off-by: Tri Lam <tree@lumalabs.ai>
Contributor
Author
Adversarial Review: VERDICT A+B/A/A+ GradingBaseline (B):
Additional (A):
Exceptional (A+):
FindingsOne informational nit (not blocking):
# Retry-loop guards fresh-CRD nil-status race (#500): .status.conditions
# is nil for ~1-3s after kubectl apply. Final assertion fails loud if
# the CRD never became established.Current version is clear and acceptable; this is style feedback only. VerdictShip-ready. All three callsites correctly wrapped. Repo-wide audit clean. No failure-swallowing or deadlock vectors. Ready to merge. |
This was referenced Jun 2, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
## Summary Removes `.github/workflows/policy-matrix.yml`. Engine-specific admission validation (PSA-restricted × Kyverno × Gatekeeper × default+production) delivered negative ROI at rc1. ## Root cause 4 PRs blocked or chasing this workflow's flakes (#475 introduction, #481, #498, #501). Caught zero real regressions; only its own infra bugs: - ServiceMonitor CRD bootstrap race (#494) - AppArmor host-capability mismatch (#481 → #493) - kubectl wait .status.conditions nil race (#500 → #501) ## Coverage retained (without policy-matrix) - `conftest` — offline PSS-baseline + restricted validation. - `helm lint` — chart structural validation. - `kubeconform` — K8s API conformance. - `kubectl apply --dry-run=server` (chart.yml install/upgrade jobs) — API-level breakage on generic kind cluster. ## What stays in tree - `scripts/policy-matrix-smoke.sh` + Gatekeeper/Kyverno bundle refs — cheap reactivation when GA triggers fire. - `install/kubernetes/tracecore/policies/conftest/**` — offline policy bundle (still active). ## Re-enable triggers (tracked in #502) - GA criterion #1 (third-party audit) requests engine-specific compat validation. - First operator running under Kyverno/Gatekeeper reports admission rot. - CRD-bootstrap pattern stabilises across other workflows. ## Test plan - [x] `make doc-check` exit 0 (post comment-edit in kind-cluster-setup action.yml). - [x] No remaining policy-matrix.yml references in repo (verified by grep). - [x] Pre-commit hooks green (lint/vet/mod-verify/attribute-namespace). - [x] README + install-bench stale refs scrubbed (follow-up commit). ```release-notes ci: defer engine-specific policy-matrix workflow (PSA × Kyverno × Gatekeeper admission validation) to GA. Coverage retained via conftest + helm lint + kubeconform + kubectl apply --dry-run=server. Re-enable tracked in #502. ``` Refs #502 #475 #494 #500. --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR #498's
kind-cluster-setupcomposite action issueskubectl wait --for=condition=establishedimmediately afterkubectl applyof the prometheus-operator ServiceMonitor, Gatekeeper, and cert-manager CRDs.Fresh CRDs have
.status.conditions == nilfor ~1-3s before the API server populates them.kubectl waitdoes not retry on this state — it errors immediately with:This regresses the policy-matrix workflow (
gatekeeper-restricted × {default,production},mutation × {psa,gatekeeper}) on every chart-touching PR.Fix
Wrap each
kubectl waitin a bounded retry loop:The inner
--timeout=2sis the per-attempt budget; the outer loop is the retry budget (~90s ceiling for ServiceMonitor/cert-manager, ~3min for Gatekeeper which ships a larger bundle). The post-loopkubectl waitre-asserts so a real failure (CRD never applied, never became established) still fails the workflow with a clear error — we did not swallow it with|| true.Scope
All three CRD-wait callsites in the same action file share the identical race. All three get the same pattern in this PR (A+ scope per builder brief — audited repo-wide for
kubectl wait --for=condition=establishedand confirmed no other callsites).Verification
make actionlint→ exit 0go tool golangci-lint run ./...→ 0 issuesmake pre-push(lint + vet + attribute-namespace-check + deprecation-check) → OKkubectl applyfails first (correct); if the CRD exists but never becomes established, the post-loopkubectl waitfails with the usualerror: timed out waiting for the condition, not the nil-status accessor error.Closes #500.