CNF-23436: Add health probes and richer status conditions#417
CNF-23436: Add health probes and richer status conditions#417sebrandon1 wants to merge 1 commit into
Conversation
|
@sebrandon1: This pull request references CNF-23436 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds HTTPS liveness ( ChangesHealth Probes Configuration
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels: Suggested reviewers:
🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sebrandon1 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
e2f1df3 to
cc2910e
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
config/manager/manager.yaml (1)
114-122: ⚡ Quick winTune readiness probe for faster drain on shutdown.
To better align with graceful termination, Line 120 and Line 122 are a bit slow (
10s * 3worst-case before NotReady). Consider faster readiness failure so endpoints stop routing sooner.Suggested tweak
readinessProbe: httpGet: path: /readyz port: https scheme: HTTPS initialDelaySeconds: 5 - periodSeconds: 10 + periodSeconds: 5 timeoutSeconds: 5 - failureThreshold: 3 + failureThreshold: 1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@config/manager/manager.yaml` around lines 114 - 122, The readinessProbe for the manager (httpGet path "/readyz", scheme HTTPS) is too slow to mark Pod NotReady during shutdown; adjust readinessProbe settings to fail faster by lowering periodSeconds (e.g., from 10 to 2–3), reducing failureThreshold (e.g., from 3 to 1–2) and/or decreasing timeoutSeconds to ensure the probe transitions to NotReady quickly so endpoints are drained sooner; update the readinessProbe block (httpGet path /readyz, initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold) accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@config/manager/manager.yaml`:
- Around line 114-122: The readinessProbe for the manager (httpGet path
"/readyz", scheme HTTPS) is too slow to mark Pod NotReady during shutdown;
adjust readinessProbe settings to fail faster by lowering periodSeconds (e.g.,
from 10 to 2–3), reducing failureThreshold (e.g., from 3 to 1–2) and/or
decreasing timeoutSeconds to ensure the probe transitions to NotReady quickly so
endpoints are drained sooner; update the readinessProbe block (httpGet path
/readyz, initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold)
accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f1db3899-53e4-40c3-a38c-c2fc93c4f11f
📒 Files selected for processing (2)
bundle/manifests/cert-manager-operator.clusterserviceversion.yamlconfig/manager/manager.yaml
cc2910e to
d9d40bd
Compare
d9d40bd to
7285ee6
Compare
7285ee6 to
53508d0
Compare
53508d0 to
19e3212
Compare
Add liveness and readiness probes to the operator deployment for improved health monitoring. Add a Progressing status condition and specific reason codes (Reconciling, WaitingForDependencies, ValidationFailed, MultipleInstancesFound) for IstioCSR and TrustManager CRs so users can diagnose issues from CR status without reading operator logs.
19e3212 to
4e2bb0d
Compare
|
/retest |
|
@sebrandon1: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Health Probes
The operator deployment currently has no health probes, so Kubernetes cannot detect if the operator process is stuck or not yet ready to serve. All cert-manager operands (controller, webhook, cainjector, trust-manager, istio-csr) already have probes configured — the operator itself is the only component missing them.
The library-go
controllercmdframework already serves/healthzand/readyzover HTTPS on port 8443 via its GenericAPIServer, so no Go code changes are needed./healthz(ping, log, post-start hooks)/readyz(same checks +shutdown, so the pod drains traffic during graceful termination)Richer Status Conditions
IstioCSR and TrustManager CRs currently report only two status conditions (
ReadyandDegraded) with generic reason constants (Failed,Ready,Progressing). Users can't tell why something is progressing or degraded without reading operator logs.This adds:
Progressingcondition type alongsideReadyandDegradedReconciling,WaitingForDependencies,ValidationFailed,MultipleInstancesFoundConditionReasonfield onReconcileErrorwithWithConditionReason()chainable setter so controllers can annotate errors with specific reasonsHandleReconcileResultto manage all three conditions and extract specific reasons from errorsTest plan
Health Probes
Status Conditions — Reconciling
IstioCSR CR during active reconciliation (retrying due to missing namespace):
[ { "lastTransitionTime": "2026-06-08T21:44:45Z", "message": "", "reason": "Ready", "status": "False", "type": "Degraded" }, { "lastTransitionTime": "2026-06-08T21:44:45Z", "message": "reconciliation failed, retrying: failed to create istio-system/cert-manager-istio-csr role resource: ...", "reason": "Progressing", "status": "False", "type": "Ready" }, { "lastTransitionTime": "2026-06-08T21:44:45Z", "message": "reconciliation in progress: failed to create istio-system/cert-manager-istio-csr role resource: ...", "reason": "Reconciling", "status": "True", "type": "Progressing" } ]Status Conditions — MultipleInstancesFound
Second IstioCSR CR rejected as a duplicate:
[ { "lastTransitionTime": "2026-06-08T21:45:06Z", "message": "", "reason": "MultipleInstancesFound", "status": "False", "type": "Degraded" }, { "lastTransitionTime": "2026-06-08T21:45:06Z", "message": "multiple instances of istiocsr exists, cert-manager-operator/default will not be processed", "reason": "MultipleInstancesFound", "status": "False", "type": "Ready" }, { "lastTransitionTime": "2026-06-08T21:45:06Z", "message": "multiple instances of istiocsr exists, cert-manager-operator/default will not be processed", "reason": "MultipleInstancesFound", "status": "False", "type": "Progressing" } ]