Skip to content

fix(ci): retry kubectl wait for fresh CRD nil-status race (#500)#501

Merged
trilamsr merged 1 commit into
mainfrom
fix/500-kind-setup-crd-wait-race
Jun 2, 2026
Merged

fix(ci): retry kubectl wait for fresh CRD nil-status race (#500)#501
trilamsr merged 1 commit into
mainfrom
fix/500-kind-setup-crd-wait-race

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

PR #498's kind-cluster-setup composite action issues kubectl wait --for=condition=established immediately after kubectl apply of the prometheus-operator ServiceMonitor, Gatekeeper, and cert-manager CRDs.

Fresh CRDs have .status.conditions == nil for ~1-3s before the API server populates them. kubectl wait does not retry on this state — it errors immediately with:

error: .status.conditions accessor error: <nil> is of the type <nil>, expected []interface{}

This regresses the policy-matrix workflow (gatekeeper-restricted × {default,production}, mutation × {psa,gatekeeper}) on every chart-touching PR.

Fix

Wrap each kubectl wait in a bounded retry loop:

for _ in $(seq 1 30); do
  if kubectl wait --for=condition=established --timeout=2s \
       crd/servicemonitors.monitoring.coreos.com 2>/dev/null; then
    break
  fi
  sleep 1
done
# Final assertion — fails loud if CRD never became established.
kubectl wait --for=condition=established --timeout=2s \
  crd/servicemonitors.monitoring.coreos.com

The inner --timeout=2s is the per-attempt budget; the outer loop is the retry budget (~90s ceiling for ServiceMonitor/cert-manager, ~3min for Gatekeeper which ships a larger bundle). The post-loop kubectl wait re-asserts so a real failure (CRD never applied, never became established) still fails the workflow with a clear error — we did not swallow it with || true.

Scope

All three CRD-wait callsites in the same action file share the identical race. All three get the same pattern in this PR (A+ scope per builder brief — audited repo-wide for kubectl wait --for=condition=established and confirmed no other callsites).

Verification

  • make actionlint → exit 0
  • go tool golangci-lint run ./... → 0 issues
  • make pre-push (lint + vet + attribute-namespace-check + deprecation-check) → OK
  • Mutation reasoning: if you replace the CRD URL with an unreachable one, the inner kubectl apply fails first (correct); if the CRD exists but never becomes established, the post-loop kubectl wait fails with the usual error: timed out waiting for the condition, not the nil-status accessor error.
ci: retry kubectl wait for fresh CRD nil-status race in kind-cluster-setup action — unblocks policy-matrix on chart-touching PRs (#500).

Closes #500.

PR #498's kind-cluster-setup action issues `kubectl wait
--for=condition=established` immediately after `kubectl apply` of the
prometheus-operator ServiceMonitor, Gatekeeper, and cert-manager CRDs.

Fresh CRDs have `.status.conditions == nil` for ~1-3s before the API
server populates them. `kubectl wait` errors immediately (does not
retry) with:

  error: .status.conditions accessor error: <nil> is of the type
  <nil>, expected []interface{}

This regresses the policy-matrix workflow (gatekeeper-restricted x
{default,production}, mutation x {psa,gatekeeper}) on every
chart-touching PR.

Wrap each `kubectl wait` in a bounded retry loop (`--timeout=2s` per
attempt, sleep 1s, 30 or 60 attempts), then re-run the wait outside
the loop so a real failure (CRD never applied, never became
established) still fails loud.

All three CRD-wait callsites in the action share the same race; all
three get the same pattern (A+ scope per builder brief).

Verified: `make actionlint` exit 0.

Closes #500.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Adversarial Review: VERDICT A+

B/A/A+ Grading

Baseline (B):

Additional (A):

  • ✅ Proportional retry counts (Gatekeeper 60 justified by larger bundle)
  • ✅ Inner --timeout=2s + outer loop prevents hanging
  • 2>/dev/null suppresses transient accessor error spam
  • break on success avoids unnecessary retries
  • ✅ DRY: secondary comments reference primary explanation
  • ✅ Diff size (+38/-5) matches scope

Exceptional (A+):

  • ✅ Mutation testing accounted for (URL break fails at apply; real timeout caught by post-loop)
  • ✅ No || true swallowing failures
  • ✅ No deadlock or fd-exhaustion vectors
  • ✅ Actionlint + linting verified in body

Findings

One informational nit (not blocking):

.github/actions/kind-cluster-setup/action.yml:118–126 — ServiceMonitor comment (8 lines) could trim math detail ("30 attempts × (2s wait + 1s sleep) → ~90s ceiling") + example error message since those details live in #500 issue + PR body. Possible reduction to 3 lines:

# Retry-loop guards fresh-CRD nil-status race (#500): .status.conditions
# is nil for ~1-3s after kubectl apply. Final assertion fails loud if
# the CRD never became established.

Current version is clear and acceptable; this is style feedback only.

Verdict

Ship-ready. All three callsites correctly wrapped. Repo-wide audit clean. No failure-swallowing or deadlock vectors. Ready to merge.

@trilamsr trilamsr enabled auto-merge (squash) June 2, 2026 08:25
@trilamsr trilamsr merged commit 693244a into main Jun 2, 2026
12 checks passed
@trilamsr trilamsr deleted the fix/500-kind-setup-crd-wait-race branch June 2, 2026 08:32
trilamsr added a commit that referenced this pull request Jun 2, 2026
## Summary

Removes `.github/workflows/policy-matrix.yml`. Engine-specific admission
validation (PSA-restricted × Kyverno × Gatekeeper × default+production)
delivered negative ROI at rc1.

## Root cause

4 PRs blocked or chasing this workflow's flakes (#475 introduction,
#481, #498, #501). Caught zero real regressions; only its own infra
bugs:
- ServiceMonitor CRD bootstrap race (#494)
- AppArmor host-capability mismatch (#481#493)
- kubectl wait .status.conditions nil race (#500#501)

## Coverage retained (without policy-matrix)

- `conftest` — offline PSS-baseline + restricted validation.
- `helm lint` — chart structural validation.
- `kubeconform` — K8s API conformance.
- `kubectl apply --dry-run=server` (chart.yml install/upgrade jobs) —
API-level breakage on generic kind cluster.

## What stays in tree

- `scripts/policy-matrix-smoke.sh` + Gatekeeper/Kyverno bundle refs —
cheap reactivation when GA triggers fire.
- `install/kubernetes/tracecore/policies/conftest/**` — offline policy
bundle (still active).

## Re-enable triggers (tracked in #502)

- GA criterion #1 (third-party audit) requests engine-specific compat
validation.
- First operator running under Kyverno/Gatekeeper reports admission rot.
- CRD-bootstrap pattern stabilises across other workflows.

## Test plan

- [x] `make doc-check` exit 0 (post comment-edit in kind-cluster-setup
action.yml).
- [x] No remaining policy-matrix.yml references in repo (verified by
grep).
- [x] Pre-commit hooks green (lint/vet/mod-verify/attribute-namespace).
- [x] README + install-bench stale refs scrubbed (follow-up commit).

```release-notes
ci: defer engine-specific policy-matrix workflow (PSA × Kyverno × Gatekeeper admission validation) to GA. Coverage retained via conftest + helm lint + kubeconform + kubectl apply --dry-run=server. Re-enable tracked in #502.
```

Refs #502 #475 #494 #500.

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ci(kind-cluster-setup): kubectl wait race on fresh CRD .status.conditions

1 participant