OCPBUGS-78617: Fix PinnedImages test should respect node taints#30913
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughA new helper function filters nodes to identify ready, schedulable worker nodes by excluding control-plane and master nodes. The test code is refactored to use this function and changed to skip tests when insufficient worker nodes exist rather than returning errors. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| // Filter for worker nodes only | ||
| var workerNodes []corev1.Node | ||
| for _, node := range nodes.Items { | ||
| if _, hasWorkerLabel := node.Labels["node-role.kubernetes.io/worker"]; hasWorkerLabel { | ||
| workerNodes = append(workerNodes, node) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
@bshaw7 I think we could create a exported function in the package test/extended/node, maybe in node_utils.go, to wrap this logic with e2enode.GetReadySchedulableNodes so that other tests can reuse that logic (retrieve all scheduable worker nodes).
WDYT?
There was a problem hiding this comment.
@mtulio That's good idea to make it re-usable. So far i updated OCPBUGS-78617 with test result without exported function. Next I'll create exporter function and add test result of that too.
|
@bshaw7: This pull request references Jira Issue OCPBUGS-78617, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.19-opct-external-aws-ccm |
|
@mtulio: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/cb1fbc90-247e-11f1-8e80-2eb60f5762cd-0 |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.19-opct-external-aws-ccm |
|
@mtulio: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/30cc6790-2782-11f1-810c-f7710a795c2b-0 |
af1538d to
9b9fb62
Compare
mtulio
left a comment
There was a problem hiding this comment.
Overall looks good to me. We need to make sure the payload job will run as expected, then trigger the required jobs for the pipeline (/pipeline required)
|
/payload-job periodic-ci-openshift-release-main-nightly-4.19-opct-external-aws-ccm |
|
@bshaw7: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d2dfc640-279f-11f1-8d01-b0e33c149076-0 |
|
|
/pipeline required |
|
Scheduling required tests: |
|
/payload-job periodic-ci-redhat-openshift-ecosystem-opct-main-4.19-platform-external-vsphere |
|
@mtulio: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.19-e2e-aws-ovn-seria |
|
@mtulio: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.19-e2e-aws-ovn-serial |
| // Return an error if there are less schedulable worker nodes than the desired number of nodes to add to the custom MCP | ||
| if len(workerNodes) < numberOfNodes { | ||
| return nil, fmt.Errorf("Schedulable nodes in Worker MCP %d < Number of nodes needed in %s MCP %d", len(workerNodes), customMCP, numberOfNodes) |
There was a problem hiding this comment.
Is this really an error? I would more-so consider this an environment where this test is not applicable to run and, thus, should result in a skip.
There was a problem hiding this comment.
Hello @isabella-janssen
Skip test instead of error when insufficient worker nodes
- Changed addWorkerNodesToCustomPool() to call g.Skip() instead of returning error
- Insufficient worker nodes is an environmental constraint, not a test failure
- Provides clearer test results for SNO/compact cluster scenarios
- Tests properly skipped with message: "Not enough schedulable worker nodes (0) for custom MCP (need 1)"
[root@vm-234-63 ~]# cat /tmp/pinnedimages-run.log| grep -i "All Nodes in a custom Pool should have the PinnedImages even after Garbage Collection"
started: 0/2/13 "[Suite:openshift/machine-config-operator/disruptive][sig-mco][OCPFeatureGate:PinnedImages][Disruptive] [Slow]All Nodes in a custom Pool should have the PinnedImages even after Garbage Collection [apigroup:machineconfiguration.openshift.io] [Serial]"
skipped: (4.2s) 2026-03-26T21:22:19 "[Suite:openshift/machine-config-operator/disruptive][sig-mco][OCPFeatureGate:PinnedImages][Disruptive] [Slow]All Nodes in a custom Pool should have the PinnedImages even after Garbage Collection [apigroup:machineconfiguration.openshift.io] [Serial]"
[root@vm-234-63 ~]# oc get node,mcp
NAME STATUS ROLES AGE VERSION
node/ip-10-0-14-92.ec2.internal Ready control-plane,master,worker 86m v1.32.12
node/ip-10-0-5-6.ec2.internal Ready control-plane,master,worker 86m v1.32.12
node/ip-10-0-68-72.ec2.internal Ready control-plane,master,worker 86m v1.32.12
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
machineconfigpool.machineconfiguration.openshift.io/master rendered-master-1930b9457ae617fdbba7667a6044cc4c True False False 3 3 3 0 81m
machineconfigpool.machineconfiguration.openshift.io/worker rendered-worker-40946b8718ecc0535ed7778cad6f1e28 True False False 0 0 0 0 81m
[root@vm-234-63 ~]#
The PinnedImages conformance test was selecting worker nodes by label only, ignoring node taints. This caused issues in environments with dedicated/tainted nodes (e.g., OPCT test infrastructure, edge zones). Problem: - Test created custom MachineConfigPool targeting any worker node - Could select nodes with NoSchedule/NoExecute taints - In OPCT, selecting the dedicated node caused pod eviction and test failure Solution: - Use e2enode.GetReadySchedulableNodes() to filter nodes - This function excludes nodes with NoSchedule/NoExecute taints - Follows the same pattern used by other OpenShift conformance tests Changes: - Added import: e2enode "k8s.io/kubernetes/test/e2e/framework/node" - Modified addWorkerNodesToCustomPool() to: 1. Get schedulable nodes using e2enode.GetReadySchedulableNodes() 2. Filter for worker nodes only 3. Select from schedulable workers (excluding tainted nodes) Testing: - Verified node filtering logic correctly excludes tainted nodes - Tested on cluster with dedicated OPCT node (node-role.kubernetes.io/tests:NoSchedule) - Dedicated node correctly filtered out, test selects from 3 schedulable workers Fixes: https://github.com/redhat-openshift-ecosystem/opct/issues/[TBD] Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…function As suggested in code review, created a shared utility function GetReadySchedulableWorkerNodes() in test/extended/node/node_utils.go to encapsulate the logic of retrieving schedulable worker nodes (excluding nodes with NoSchedule/NoExecute taints). This makes the node selection logic reusable across multiple tests and improves code maintainability. Changes: - Added GetReadySchedulableWorkerNodes() in test/extended/node/node_utils.go - Refactored PinnedImages test to use the new utility function - Simplified addWorkerNodesToCustomPool() by removing manual filtering logic
This commit addresses two issues raised in PR review:
1. Exclude nodes with master/control-plane role from custom MCPs
- Created GetReadySchedulableWorkerNodes() in test/extended/util package
- Filters out nodes that have both worker and master/control-plane labels
- In SNO and compact clusters, nodes belong to master MCP and cannot be moved
to custom MCPs
- Prevents test failures when trying to assign these nodes to custom pools
2. Skip test instead of error when insufficient worker nodes
- Changed addWorkerNodesToCustomPool() to call g.Skip() instead of returning error
- Insufficient worker nodes is an environmental constraint, not a test failure
- Provides clearer test results for SNO/compact cluster scenarios
3. Fix import cycle
- Moved GetReadySchedulableWorkerNodes to util package to avoid import cycle
- machine_config package can now safely import from util without creating cycles
Tested on compact cluster (3 nodes with worker+master+control-plane roles):
- GetReadySchedulableWorkerNodes() correctly returned 0 nodes
- Tests properly skipped with message: "Not enough schedulable worker nodes (0) for custom MCP (need 1)"
Co-Authored-By: Bikash Shaw <bshaw@redhat.com>
a1174a7 to
a06beca
Compare
|
Scheduling required tests: |
|
/verified by CI and local tests #30913 (comment) / #30913 (comment) |
|
@bshaw7: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
|
/approve |
|
Wake up bot for approved labels. 😄 /jira-refresh |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bshaw7, isabella-janssen, neisw The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@bshaw7: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@bshaw7: Jira Issue Verification Checks: Jira Issue OCPBUGS-78617 Jira Issue OCPBUGS-78617 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-4.21 release-4.20 release-4.19 |
|
@mtulio: #30913 failed to apply on top of branch "release-4.21": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fix included in accepted release 4.22.0-0.nightly-2026-03-29-055437 |
Backport of OCPBUGS-78617 fix from main to release-4.21. The PinnedImages conformance test was selecting worker nodes without considering node taints, causing test failures in environments with dedicated/tainted nodes (like OPCT test infrastructure). Changes: - Added GetReadySchedulableWorkerNodes() utility function in test/extended/util/nodes.go to filter out nodes with NoSchedule/NoExecute taints and nodes with control-plane/master roles - Updated addWorkerNodesToCustomPool() in pinnedimages.go to use the new utility function - Tests now skip gracefully when insufficient schedulable worker nodes exist (handles SNO and compact cluster scenarios) This ensures the test works correctly across cluster topologies. Original PR: openshift#30913
Backport of OCPBUGS-78617 fix from main to release-4.20. The PinnedImages conformance test was selecting worker nodes without considering node taints, causing test failures in environments with dedicated/tainted nodes (like OPCT test infrastructure). Changes: - Added GetReadySchedulableWorkerNodes() utility function in test/extended/util/nodes.go to filter out nodes with NoSchedule/NoExecute taints and nodes with control-plane/master roles - Updated addWorkerNodesToCustomPool() in pinnedimages.go to use the new utility function - Tests now skip gracefully when insufficient schedulable worker nodes exist (handles SNO and compact cluster scenarios) This ensures the test works correctly across cluster topologies. Original PR: openshift#30913
Backport of OCPBUGS-78617 fix from main to release-4.19. The PinnedImages conformance test was selecting worker nodes without considering node taints, causing test failures in environments with dedicated/tainted nodes (like OPCT test infrastructure). Changes: - Added GetReadySchedulableWorkerNodes() utility function in test/extended/util/nodes.go to filter out nodes with NoSchedule/NoExecute taints and nodes with control-plane/master roles - Updated addWorkerNodesToCustomPool() in pinnedimages.go to use the new utility function - Tests now skip gracefully when insufficient schedulable worker nodes exist (handles SNO and compact cluster scenarios) This ensures the test works correctly across cluster topologies. Original PR: openshift#30913
The PinnedImages conformance test was selecting worker nodes by label only, ignoring node taints. This caused issues in environments with dedicated/tainted nodes (e.g., OPCT test infrastructure, edge zones).
Problem:
Solution:
Changes:
Testing:
Fixes: https://redhat.atlassian.net/browse/OCPBUGS-78617