CNF-23436: Add health probes and richer status conditions by sebrandon1 · Pull Request #417 · openshift/cert-manager-operator

sebrandon1 · 2026-05-01T16:41:20Z

Summary

Health Probes

The operator deployment currently has no health probes, so Kubernetes cannot detect if the operator process is stuck or not yet ready to serve. All cert-manager operands (controller, webhook, cainjector, trust-manager, istio-csr) already have probes configured — the operator itself is the only component missing them.

The library-go controllercmd framework already serves /healthz and /readyz over HTTPS on port 8443 via its GenericAPIServer, so no Go code changes are needed.

Liveness → /healthz (ping, log, post-start hooks)
Readiness → /readyz (same checks + shutdown, so the pod drains traffic during graceful termination)

Richer Status Conditions

IstioCSR and TrustManager CRs currently report only two status conditions (Ready and Degraded) with generic reason constants (Failed, Ready, Progressing). Users can't tell why something is progressing or degraded without reading operator logs.

This adds:

A dedicated Progressing condition type alongside Ready and Degraded
Specific reason constants: Reconciling, WaitingForDependencies, ValidationFailed, MultipleInstancesFound
A ConditionReason field on ReconcileError with WithConditionReason() chainable setter so controllers can annotate errors with specific reasons
Updated HandleReconcileResult to manage all three conditions and extract specific reasons from errors

Test plan

Health Probes

$ curl -sk "https://localhost:8443/healthz?verbose"
[+]ping ok
[+]log ok
[+]poststarthook/max-in-flight-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
healthz check passed

$ curl -sk "https://localhost:8443/readyz?verbose"
[+]ping ok
[+]log ok
[+]poststarthook/max-in-flight-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]shutdown ok
readyz check passed

Status Conditions — Reconciling

IstioCSR CR during active reconciliation (retrying due to missing namespace):

[
    {
        "lastTransitionTime": "2026-06-08T21:44:45Z",
        "message": "",
        "reason": "Ready",
        "status": "False",
        "type": "Degraded"
    },
    {
        "lastTransitionTime": "2026-06-08T21:44:45Z",
        "message": "reconciliation failed, retrying: failed to create istio-system/cert-manager-istio-csr role resource: ...",
        "reason": "Progressing",
        "status": "False",
        "type": "Ready"
    },
    {
        "lastTransitionTime": "2026-06-08T21:44:45Z",
        "message": "reconciliation in progress: failed to create istio-system/cert-manager-istio-csr role resource: ...",
        "reason": "Reconciling",
        "status": "True",
        "type": "Progressing"
    }
]

Status Conditions — MultipleInstancesFound

Second IstioCSR CR rejected as a duplicate:

[
    {
        "lastTransitionTime": "2026-06-08T21:45:06Z",
        "message": "",
        "reason": "MultipleInstancesFound",
        "status": "False",
        "type": "Degraded"
    },
    {
        "lastTransitionTime": "2026-06-08T21:45:06Z",
        "message": "multiple instances of istiocsr exists, cert-manager-operator/default will not be processed",
        "reason": "MultipleInstancesFound",
        "status": "False",
        "type": "Ready"
    },
    {
        "lastTransitionTime": "2026-06-08T21:45:06Z",
        "message": "multiple instances of istiocsr exists, cert-manager-operator/default will not be processed",
        "reason": "MultipleInstancesFound",
        "status": "False",
        "type": "Progressing"
    }
]

All unit tests pass (123/123 Ginkgo specs + all Go packages)
No lint issues from changed files
Verified on OCP 4.22 cluster — all three conditions visible with correct reasons
E2E tests pass

openshift-ci-robot · 2026-05-01T16:41:24Z

@sebrandon1: This pull request references CNF-23436 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

The operator deployment currently has no health probes, so Kubernetes cannot detect if the operator process is stuck or not yet ready to serve. All cert-manager operands (controller, webhook, cainjector, trust-manager, istio-csr) already have probes configured — the operator itself is the only component missing them.

The library-go controllercmd framework already serves /healthz and /readyz over HTTPS on port 8443 via its GenericAPIServer, so no Go code changes are needed.

Liveness → /healthz (ping, log, post-start hooks)

Readiness → /readyz (same checks + shutdown, so the pod drains traffic during graceful termination)

Test plan

Tested locally against an OCP 4.22 cluster:
$ curl -sk "https://localhost:8443/healthz?verbose"
[+]ping ok
[+]log ok
[+]poststarthook/max-in-flight-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
healthz check passed

$ curl -sk "https://localhost:8443/readyz?verbose"
[+]ping ok
[+]log ok
[+]poststarthook/max-in-flight-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]shutdown ok
readyz check passed
Operator deploys and reports ready

/healthz and /readyz return 200 when operator is healthy

Pod is restarted by kubelet when liveness probe fails

Pod is removed from service endpoints during graceful shutdown via readyz shutdown check

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-01T16:41:32Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds HTTPS liveness (/healthz) and readiness (/readyz) probes to the cert-manager-operator container in two Kubernetes manifests, targeting the existing named https port and specifying probe timing parameters (initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold).

Changes

Health Probes Configuration

Layer / File(s)	Summary
Container probe spec (dev/run manager) `config/manager/manager.yaml`	Adds `livenessProbe` (HTTPS GET `/healthz`) and `readinessProbe` (HTTPS GET `/readyz`) to the `cert-manager-operator` container. Probes use port name `https` and include `initialDelaySeconds`, `periodSeconds`, `timeoutSeconds`, and `failureThreshold`.
Container probe spec (CSV bundle) `bundle/manifests/cert-manager-operator.clusterserviceversion.yaml`	Same `livenessProbe` and `readinessProbe` blocks added to the CSV container spec, using HTTPS on the named `https` port (8443) with matching timing parameters.
CSV container field ordering `bundle/manifests/cert-manager-operator.clusterserviceversion.yaml`	Reorders container fields so `name` and `ports` are positioned adjacent to the newly inserted probe blocks.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels: lgtm, qe-approved, approved

Suggested reviewers:

swghosh
TrilokGeer
chiragkyal

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Single Node Openshift (Sno) Test Compatibility	⚠️ Warning	PR adds trustmanager_test.go with PodAntiAffinity test using hostname topology key, requiring multiple nodes. No SNO protection found.	Add [Skipped:SingleReplicaTopology] label to "should apply custom affinity to deployment" test or guard with SNO topology check that skips on SingleReplicaTopologyMode.

✅ Passed checks (14 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All Ginkgo test names use static, descriptive strings with no dynamic elements (pod names, timestamps, UUIDs, IPs, concatenation). Tests remain deterministic across runs.
Test Structure And Quality	✅ Passed	PR modifies only Kubernetes manifest files (YAML), not Ginkgo test code. Custom check for test structure and quality is not applicable to manifest-only changes.
Microshift Test Compatibility	✅ Passed	PR adds health probes to Kubernetes manifests only; no new Ginkgo e2e tests are added. Check for test MicroShift compatibility is not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	The PR adds only liveness and readiness probes (HTTPS health checks) to the operator deployment. These are observational and do not introduce scheduling constraints. No topology-aware issues found.
Ote Binary Stdout Contract	✅ Passed	PR modifies only YAML manifest files (no Go code), adding Kubernetes health probes. OTE Binary Stdout Contract check applies only to source code process-level stdout writes, not configuration files.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No new Ginkgo e2e tests added in this PR—only YAML manifest changes adding health probes to operator deployment. Custom check applies only to new tests.
No-Weak-Crypto	✅ Passed	PR only adds Kubernetes health probes (HTTPS GET requests) to operator deployment YAML manifests; no cryptographic algorithms, custom crypto, or secret comparison code is present.
Container-Privileges	✅ Passed	No privilege-escalating settings found. Both manifests set privileged:false, allowPrivilegeEscalation:false, runAsNonRoot:true.
No-Sensitive-Data-In-Logs	✅ Passed	PR adds only Kubernetes health check probe configurations with HTTPS GET requests to /healthz and /readyz endpoints. No new logging code added; no sensitive data patterns found in changes.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title mentions health probes (liveness/readiness) which aligns with the main changes, but omits key context about operator deployment and probe targets.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-01T16:41:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sebrandon1
Once this PR has been reviewed and has the lgtm label, please assign swghosh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

🧹 Nitpick comments (1)

config/manager/manager.yaml (1)

114-122: ⚡ Quick win

Tune readiness probe for faster drain on shutdown.

To better align with graceful termination, Line 120 and Line 122 are a bit slow (10s * 3 worst-case before NotReady). Consider faster readiness failure so endpoints stop routing sooner.

Suggested tweak

           readinessProbe:
             httpGet:
               path: /readyz
               port: https
               scheme: HTTPS
             initialDelaySeconds: 5
-            periodSeconds: 10
+            periodSeconds: 5
             timeoutSeconds: 5
-            failureThreshold: 3
+            failureThreshold: 1

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/manager/manager.yaml` around lines 114 - 122, The readinessProbe for
the manager (httpGet path "/readyz", scheme HTTPS) is too slow to mark Pod
NotReady during shutdown; adjust readinessProbe settings to fail faster by
lowering periodSeconds (e.g., from 10 to 2–3), reducing failureThreshold (e.g.,
from 3 to 1–2) and/or decreasing timeoutSeconds to ensure the probe transitions
to NotReady quickly so endpoints are drained sooner; update the readinessProbe
block (httpGet path /readyz, initialDelaySeconds, periodSeconds, timeoutSeconds,
failureThreshold) accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@config/manager/manager.yaml`:
- Around line 114-122: The readinessProbe for the manager (httpGet path
"/readyz", scheme HTTPS) is too slow to mark Pod NotReady during shutdown;
adjust readinessProbe settings to fail faster by lowering periodSeconds (e.g.,
from 10 to 2–3), reducing failureThreshold (e.g., from 3 to 1–2) and/or
decreasing timeoutSeconds to ensure the probe transitions to NotReady quickly so
endpoints are drained sooner; update the readinessProbe block (httpGet path
/readyz, initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold)
accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f1db3899-53e4-40c3-a38c-c2fc93c4f11f

📥 Commits

Reviewing files that changed from the base of the PR and between e2f1df3 and cc2910e.

📒 Files selected for processing (2)

bundle/manifests/cert-manager-operator.clusterserviceversion.yaml
config/manager/manager.yaml

Add liveness and readiness probes to the operator deployment for improved health monitoring. Add a Progressing status condition and specific reason codes (Reconciling, WaitingForDependencies, ValidationFailed, MultipleInstancesFound) for IstioCSR and TrustManager CRs so users can diagnose issues from CR status without reading operator logs.

sebrandon1 · 2026-06-09T17:47:47Z

/retest

openshift-ci · 2026-06-09T20:03:42Z

@sebrandon1: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 1, 2026

openshift-ci Bot requested review from TrilokGeer and swghosh May 1, 2026 16:41

sebrandon1 force-pushed the add-operator-health-probes branch from e2f1df3 to cc2910e Compare May 5, 2026 22:37

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

sebrandon1 force-pushed the add-operator-health-probes branch from cc2910e to d9d40bd Compare May 14, 2026 19:07

sebrandon1 force-pushed the add-operator-health-probes branch from d9d40bd to 7285ee6 Compare May 29, 2026 15:51

sebrandon1 force-pushed the add-operator-health-probes branch from 7285ee6 to 53508d0 Compare June 8, 2026 21:39

sebrandon1 changed the title ~~CNF-23436: Add liveness and readiness probes to operator deployment~~ CNF-23436: Add health probes and richer status conditions Jun 8, 2026

sebrandon1 force-pushed the add-operator-health-probes branch from 53508d0 to 19e3212 Compare June 9, 2026 16:17

sebrandon1 force-pushed the add-operator-health-probes branch from 19e3212 to 4e2bb0d Compare June 9, 2026 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNF-23436: Add health probes and richer status conditions#417

CNF-23436: Add health probes and richer status conditions#417
sebrandon1 wants to merge 1 commit into
openshift:masterfrom
sebrandon1:add-operator-health-probes

sebrandon1 commented May 1, 2026 •

edited

Loading

Uh oh!

openshift-ci-robot commented May 1, 2026 •

edited by openshift-ci Bot

Loading

Summary

Test plan

Uh oh!

coderabbitai Bot commented May 1, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented May 1, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

sebrandon1 commented Jun 9, 2026

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sebrandon1 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Health Probes

Richer Status Conditions

Test plan

Health Probes

Status Conditions — Reconciling

Status Conditions — MultipleInstancesFound

Uh oh!

openshift-ci-robot commented May 1, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented May 1, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sebrandon1 commented Jun 9, 2026

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sebrandon1 commented May 1, 2026 •

edited

Loading

openshift-ci-robot commented May 1, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented May 1, 2026 •

edited

Loading