Skip to content

CNF-23436: Add health probes and richer status conditions#417

Open
sebrandon1 wants to merge 1 commit into
openshift:masterfrom
sebrandon1:add-operator-health-probes
Open

CNF-23436: Add health probes and richer status conditions#417
sebrandon1 wants to merge 1 commit into
openshift:masterfrom
sebrandon1:add-operator-health-probes

Conversation

@sebrandon1

@sebrandon1 sebrandon1 commented May 1, 2026

Copy link
Copy Markdown
Member

Summary

Health Probes

The operator deployment currently has no health probes, so Kubernetes cannot detect if the operator process is stuck or not yet ready to serve. All cert-manager operands (controller, webhook, cainjector, trust-manager, istio-csr) already have probes configured — the operator itself is the only component missing them.

The library-go controllercmd framework already serves /healthz and /readyz over HTTPS on port 8443 via its GenericAPIServer, so no Go code changes are needed.

  • Liveness/healthz (ping, log, post-start hooks)
  • Readiness/readyz (same checks + shutdown, so the pod drains traffic during graceful termination)

Richer Status Conditions

IstioCSR and TrustManager CRs currently report only two status conditions (Ready and Degraded) with generic reason constants (Failed, Ready, Progressing). Users can't tell why something is progressing or degraded without reading operator logs.

This adds:

  • A dedicated Progressing condition type alongside Ready and Degraded
  • Specific reason constants: Reconciling, WaitingForDependencies, ValidationFailed, MultipleInstancesFound
  • A ConditionReason field on ReconcileError with WithConditionReason() chainable setter so controllers can annotate errors with specific reasons
  • Updated HandleReconcileResult to manage all three conditions and extract specific reasons from errors

Test plan

Health Probes

$ curl -sk "https://localhost:8443/healthz?verbose"
[+]ping ok
[+]log ok
[+]poststarthook/max-in-flight-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
healthz check passed

$ curl -sk "https://localhost:8443/readyz?verbose"
[+]ping ok
[+]log ok
[+]poststarthook/max-in-flight-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]shutdown ok
readyz check passed

Status Conditions — Reconciling

IstioCSR CR during active reconciliation (retrying due to missing namespace):

[
    {
        "lastTransitionTime": "2026-06-08T21:44:45Z",
        "message": "",
        "reason": "Ready",
        "status": "False",
        "type": "Degraded"
    },
    {
        "lastTransitionTime": "2026-06-08T21:44:45Z",
        "message": "reconciliation failed, retrying: failed to create istio-system/cert-manager-istio-csr role resource: ...",
        "reason": "Progressing",
        "status": "False",
        "type": "Ready"
    },
    {
        "lastTransitionTime": "2026-06-08T21:44:45Z",
        "message": "reconciliation in progress: failed to create istio-system/cert-manager-istio-csr role resource: ...",
        "reason": "Reconciling",
        "status": "True",
        "type": "Progressing"
    }
]

Status Conditions — MultipleInstancesFound

Second IstioCSR CR rejected as a duplicate:

[
    {
        "lastTransitionTime": "2026-06-08T21:45:06Z",
        "message": "",
        "reason": "MultipleInstancesFound",
        "status": "False",
        "type": "Degraded"
    },
    {
        "lastTransitionTime": "2026-06-08T21:45:06Z",
        "message": "multiple instances of istiocsr exists, cert-manager-operator/default will not be processed",
        "reason": "MultipleInstancesFound",
        "status": "False",
        "type": "Ready"
    },
    {
        "lastTransitionTime": "2026-06-08T21:45:06Z",
        "message": "multiple instances of istiocsr exists, cert-manager-operator/default will not be processed",
        "reason": "MultipleInstancesFound",
        "status": "False",
        "type": "Progressing"
    }
]
  • All unit tests pass (123/123 Ginkgo specs + all Go packages)
  • No lint issues from changed files
  • Verified on OCP 4.22 cluster — all three conditions visible with correct reasons
  • E2E tests pass

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 1, 2026
@openshift-ci-robot

openshift-ci-robot commented May 1, 2026

Copy link
Copy Markdown

@sebrandon1: This pull request references CNF-23436 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

The operator deployment currently has no health probes, so Kubernetes cannot detect if the operator process is stuck or not yet ready to serve. All cert-manager operands (controller, webhook, cainjector, trust-manager, istio-csr) already have probes configured — the operator itself is the only component missing them.

The library-go controllercmd framework already serves /healthz and /readyz over HTTPS on port 8443 via its GenericAPIServer, so no Go code changes are needed.

  • Liveness/healthz (ping, log, post-start hooks)
  • Readiness/readyz (same checks + shutdown, so the pod drains traffic during graceful termination)

Test plan

Tested locally against an OCP 4.22 cluster:

$ curl -sk "https://localhost:8443/healthz?verbose"
[+]ping ok
[+]log ok
[+]poststarthook/max-in-flight-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
healthz check passed

$ curl -sk "https://localhost:8443/readyz?verbose"
[+]ping ok
[+]log ok
[+]poststarthook/max-in-flight-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]shutdown ok
readyz check passed
  • Operator deploys and reports ready
  • /healthz and /readyz return 200 when operator is healthy
  • Pod is restarted by kubelet when liveness probe fails
  • Pod is removed from service endpoints during graceful shutdown via readyz shutdown check

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented May 1, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds HTTPS liveness (/healthz) and readiness (/readyz) probes to the cert-manager-operator container in two Kubernetes manifests, targeting the existing named https port and specifying probe timing parameters (initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold).

Changes

Health Probes Configuration

Layer / File(s) Summary
Container probe spec (dev/run manager)
config/manager/manager.yaml
Adds livenessProbe (HTTPS GET /healthz) and readinessProbe (HTTPS GET /readyz) to the cert-manager-operator container. Probes use port name https and include initialDelaySeconds, periodSeconds, timeoutSeconds, and failureThreshold.
Container probe spec (CSV bundle)
bundle/manifests/cert-manager-operator.clusterserviceversion.yaml
Same livenessProbe and readinessProbe blocks added to the CSV container spec, using HTTPS on the named https port (8443) with matching timing parameters.
CSV container field ordering
bundle/manifests/cert-manager-operator.clusterserviceversion.yaml
Reorders container fields so name and ports are positioned adjacent to the newly inserted probe blocks.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels: lgtm, qe-approved, approved

Suggested reviewers:

  • swghosh
  • TrilokGeer
  • chiragkyal
🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning PR adds trustmanager_test.go with PodAntiAffinity test using hostname topology key, requiring multiple nodes. No SNO protection found. Add [Skipped:SingleReplicaTopology] label to "should apply custom affinity to deployment" test or guard with SNO topology check that skips on SingleReplicaTopologyMode.
✅ Passed checks (14 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All Ginkgo test names use static, descriptive strings with no dynamic elements (pod names, timestamps, UUIDs, IPs, concatenation). Tests remain deterministic across runs.
Test Structure And Quality ✅ Passed PR modifies only Kubernetes manifest files (YAML), not Ginkgo test code. Custom check for test structure and quality is not applicable to manifest-only changes.
Microshift Test Compatibility ✅ Passed PR adds health probes to Kubernetes manifests only; no new Ginkgo e2e tests are added. Check for test MicroShift compatibility is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed The PR adds only liveness and readiness probes (HTTPS health checks) to the operator deployment. These are observational and do not introduce scheduling constraints. No topology-aware issues found.
Ote Binary Stdout Contract ✅ Passed PR modifies only YAML manifest files (no Go code), adding Kubernetes health probes. OTE Binary Stdout Contract check applies only to source code process-level stdout writes, not configuration files.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests added in this PR—only YAML manifest changes adding health probes to operator deployment. Custom check applies only to new tests.
No-Weak-Crypto ✅ Passed PR only adds Kubernetes health probes (HTTPS GET requests) to operator deployment YAML manifests; no cryptographic algorithms, custom crypto, or secret comparison code is present.
Container-Privileges ✅ Passed No privilege-escalating settings found. Both manifests set privileged:false, allowPrivilegeEscalation:false, runAsNonRoot:true.
No-Sensitive-Data-In-Logs ✅ Passed PR adds only Kubernetes health check probe configurations with HTTPS GET requests to /healthz and /readyz endpoints. No new logging code added; no sensitive data patterns found in changes.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title mentions health probes (liveness/readiness) which aligns with the main changes, but omits key context about operator deployment and probe targets.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from TrilokGeer and swghosh May 1, 2026 16:41
@openshift-ci

openshift-ci Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sebrandon1
Once this PR has been reviewed and has the lgtm label, please assign swghosh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sebrandon1 sebrandon1 force-pushed the add-operator-health-probes branch from e2f1df3 to cc2910e Compare May 5, 2026 22:37

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
config/manager/manager.yaml (1)

114-122: ⚡ Quick win

Tune readiness probe for faster drain on shutdown.

To better align with graceful termination, Line 120 and Line 122 are a bit slow (10s * 3 worst-case before NotReady). Consider faster readiness failure so endpoints stop routing sooner.

Suggested tweak
           readinessProbe:
             httpGet:
               path: /readyz
               port: https
               scheme: HTTPS
             initialDelaySeconds: 5
-            periodSeconds: 10
+            periodSeconds: 5
             timeoutSeconds: 5
-            failureThreshold: 3
+            failureThreshold: 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/manager/manager.yaml` around lines 114 - 122, The readinessProbe for
the manager (httpGet path "/readyz", scheme HTTPS) is too slow to mark Pod
NotReady during shutdown; adjust readinessProbe settings to fail faster by
lowering periodSeconds (e.g., from 10 to 2–3), reducing failureThreshold (e.g.,
from 3 to 1–2) and/or decreasing timeoutSeconds to ensure the probe transitions
to NotReady quickly so endpoints are drained sooner; update the readinessProbe
block (httpGet path /readyz, initialDelaySeconds, periodSeconds, timeoutSeconds,
failureThreshold) accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@config/manager/manager.yaml`:
- Around line 114-122: The readinessProbe for the manager (httpGet path
"/readyz", scheme HTTPS) is too slow to mark Pod NotReady during shutdown;
adjust readinessProbe settings to fail faster by lowering periodSeconds (e.g.,
from 10 to 2–3), reducing failureThreshold (e.g., from 3 to 1–2) and/or
decreasing timeoutSeconds to ensure the probe transitions to NotReady quickly so
endpoints are drained sooner; update the readinessProbe block (httpGet path
/readyz, initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold)
accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f1db3899-53e4-40c3-a38c-c2fc93c4f11f

📥 Commits

Reviewing files that changed from the base of the PR and between e2f1df3 and cc2910e.

📒 Files selected for processing (2)
  • bundle/manifests/cert-manager-operator.clusterserviceversion.yaml
  • config/manager/manager.yaml

@sebrandon1 sebrandon1 force-pushed the add-operator-health-probes branch from cc2910e to d9d40bd Compare May 14, 2026 19:07
@sebrandon1 sebrandon1 force-pushed the add-operator-health-probes branch from d9d40bd to 7285ee6 Compare May 29, 2026 15:51
@sebrandon1 sebrandon1 force-pushed the add-operator-health-probes branch from 7285ee6 to 53508d0 Compare June 8, 2026 21:39
@sebrandon1 sebrandon1 changed the title CNF-23436: Add liveness and readiness probes to operator deployment CNF-23436: Add health probes and richer status conditions Jun 8, 2026
@sebrandon1 sebrandon1 force-pushed the add-operator-health-probes branch from 53508d0 to 19e3212 Compare June 9, 2026 16:17
Add liveness and readiness probes to the operator deployment for
improved health monitoring.

Add a Progressing status condition and specific reason codes
(Reconciling, WaitingForDependencies, ValidationFailed,
MultipleInstancesFound) for IstioCSR and TrustManager CRs so users
can diagnose issues from CR status without reading operator logs.
@sebrandon1 sebrandon1 force-pushed the add-operator-health-probes branch from 19e3212 to 4e2bb0d Compare June 9, 2026 17:47
@sebrandon1

Copy link
Copy Markdown
Member Author

/retest

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@sebrandon1: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants