OCPNODE-3751: Add automated tests for kubelet LimitedSwap drop-in configuration for CNV#30795
Conversation
|
/ok-to-test |
|
/test all |
|
@BhargaviGudi: This pull request references OCPNODE-3751 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
2c7a4a3 to
5283c24
Compare
|
/approve |
5283c24 to
efa4d61
Compare
|
/lgtm |
|
Scheduling required tests: |
|
/verified by @BhargaviGudi |
|
@BhargaviGudi: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-csi |
|
/label acknowledge-critical-fixes-only |
efa4d61 to
a6328e9
Compare
a6328e9 to
dd589ca
Compare
dd589ca to
5fb0a13
Compare
5fb0a13 to
ab38179
Compare
|
/payload-job periodic-ci-openshift-release-master-ci-4.22-e2e-aws-upgrade-ovn-single-node |
|
@BhargaviGudi: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
test/extended/node/node_swap_cnv.go (3)
41-42: Misplaceddefer g.GinkgoRecover()ing.Describeblock.
defer g.GinkgoRecover()at line 42 has no effect here because it's in theg.Describefunction body, not in a goroutine or a function that will panic.GinkgoRecoveris intended for use in goroutines spawned within tests to recover panics. This line can be removed as it serves no purpose in this context.Proposed fix
var _ = g.Describe("[Jira:Node/Kubelet][sig-node][Feature:NodeSwap][Serial][Disruptive][Suite:openshift/disruptive-longrunning] Kubelet LimitedSwap Drop-in Configuration for CNV", g.Ordered, func() { - defer g.GinkgoRecover() - var oc = exutil.NewCLI("cnv-swap")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/node_swap_cnv.go` around lines 41 - 42, Remove the no-op defer g.GinkgoRecover() call from inside the g.Describe block in test/extended/node/node_swap_cnv.go; locate the Describe declaration (the var _ = g.Describe(...) anonymous block) and delete the line `defer g.GinkgoRecover()`—if GinkgoRecover is needed, only add it inside goroutines spawned within tests (e.g., inside any go-routine bodies used in It/BeforeEach/AfterEach), not at the top-level Describe.
931-942: Incomplete swap restoration in TC12 cleanup.The cleanup logs "may need manual re-enable" for swap (line 937) but doesn't actually restore it. While this may be acceptable since the test is disruptive, it leaves the node in a potentially different state. Consider storing the original swap configuration and restoring it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/node_swap_cnv.go` around lines 931 - 942, The deferred cleanup currently only logs that swap "may need manual re-enable" but does not restore the original swap state; update the test to capture the node's initial swap configuration (e.g., whether swap was enabled and any related fstab or swapfile path) before disabling it and then restore that exact state in the defer block: when initialHasSwap is true re-create or restore the original fstab entry or swapfile and run swapon on the node (use the same oc/node exec helpers you already use around removeDropInFile, restartKubeletOnNode, waitForNodeToBeReady), ensuring these steps occur before restarting kubelet so the node returns to its prior state; reference initialHasSwap, cnvWorkerNode, cnvDropInFilePath, removeDropInFile, restartKubeletOnNode, and waitForNodeToBeReady when implementing the capture-and-restore logic.
1342-1352: Bearer token included in curl command string may appear in logs.The bearer token is embedded in the curl command at line 1342, and the result is logged at line 1349. While this is a test environment with short-lived tokens, consider using environment variables or a different approach to avoid token exposure in logs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/node_swap_cnv.go` around lines 1342 - 1352, The curl command currently embeds the bearer token in curlCmd (token variable) which risks exposing it; change the oc.AsAdmin().WithoutNamespace().Run("exec") invocation to avoid putting token directly into the command string by passing the token via stdin and sourcing it inside the remote shell (e.g., pipe the token into oc exec and use sh -c 'read TOKEN; curl -sk -H "Authorization: Bearer $TOKEN" "https://.../api/v1/query?query=..."'), stop logging the raw command (curlCmd) and ensure any logs (e.g., framework.Logf around swapTotalResult) never print the token variable or the composed command; update references in this block (curlCmd, token, swapTotalResult, the oc.AsAdmin().WithoutNamespace().Run("exec") call) accordingly.test/extended/node/node_utils.go (1)
139-155:rand.Intnused without seeding produces deterministic results.
rand.Intnat lines 145 and 154 uses the default global random source, which in Go versions prior to 1.20 starts with seed 1, producing deterministic sequences. While Go 1.20+ auto-seeds, for consistent behavior consider usingrand.New(rand.NewSource(time.Now().UnixNano()))or document the Go version requirement.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/node_utils.go` around lines 139 - 155, The function getCNVWorkerNodeName uses rand.Intn (calls to rand.Intn in this function) which can produce deterministic results if the global source isn't seeded; seed the PRNG at function or package init by creating a rand.New(rand.NewSource(time.Now().UnixNano())) and use that instance for the two rand.Intn calls (or seed the global source in init) so node selection is non-deterministic; update imports (time and math/rand) and replace the rand.Intn usages in getCNVWorkerNodeName accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/node/node_swap_cnv.go`:
- Around line 150-154: The current check uses
o.ContainSubstring("error.*kubelet.conf.d") which is a literal substring match
and will never treat ".*" as a regex; update the assertion to use a regex
matcher or separate substring checks: replace the ContainSubstring call with
o.MatchRegexp("error.*kubelet.conf.d") (or split into two literal checks for
"error" and "kubelet.conf.d") so the lowerOutput, o.Expect and the failing
assertion correctly detect patterns (use the existing lowerOutput variable and
the o.Expect(...) wrapper when changing ContainSubstring to MatchRegexp).
- Around line 341-380: The loop that creates drop-in files on control plane
nodes can exit early on assertions and skip cleanup; after calling
createDropInFile (and before any assertion that can fail) arrange guaranteed
cleanup by deferring removal: capture cpNodeName in a new local variable (e.g.,
curNode := cpNodeName) and immediately defer removeDropInFile(oc, curNode,
cnvDropInFilePath) and a DebugNodeWithChroot(oc, curNode, "rmdir", cnvDropInDir)
pair so the drop-in file and directory are removed even if
getKubeletConfigFromNode, restartKubeletOnNode, or the o.Expect assertions fail;
ensure the deferred calls use the local copy to avoid closure-of-loop-variable
bugs.
---
Nitpick comments:
In `@test/extended/node/node_swap_cnv.go`:
- Around line 41-42: Remove the no-op defer g.GinkgoRecover() call from inside
the g.Describe block in test/extended/node/node_swap_cnv.go; locate the Describe
declaration (the var _ = g.Describe(...) anonymous block) and delete the line
`defer g.GinkgoRecover()`—if GinkgoRecover is needed, only add it inside
goroutines spawned within tests (e.g., inside any go-routine bodies used in
It/BeforeEach/AfterEach), not at the top-level Describe.
- Around line 931-942: The deferred cleanup currently only logs that swap "may
need manual re-enable" but does not restore the original swap state; update the
test to capture the node's initial swap configuration (e.g., whether swap was
enabled and any related fstab or swapfile path) before disabling it and then
restore that exact state in the defer block: when initialHasSwap is true
re-create or restore the original fstab entry or swapfile and run swapon on the
node (use the same oc/node exec helpers you already use around removeDropInFile,
restartKubeletOnNode, waitForNodeToBeReady), ensuring these steps occur before
restarting kubelet so the node returns to its prior state; reference
initialHasSwap, cnvWorkerNode, cnvDropInFilePath, removeDropInFile,
restartKubeletOnNode, and waitForNodeToBeReady when implementing the
capture-and-restore logic.
- Around line 1342-1352: The curl command currently embeds the bearer token in
curlCmd (token variable) which risks exposing it; change the
oc.AsAdmin().WithoutNamespace().Run("exec") invocation to avoid putting token
directly into the command string by passing the token via stdin and sourcing it
inside the remote shell (e.g., pipe the token into oc exec and use sh -c 'read
TOKEN; curl -sk -H "Authorization: Bearer $TOKEN"
"https://.../api/v1/query?query=..."'), stop logging the raw command (curlCmd)
and ensure any logs (e.g., framework.Logf around swapTotalResult) never print
the token variable or the composed command; update references in this block
(curlCmd, token, swapTotalResult, the
oc.AsAdmin().WithoutNamespace().Run("exec") call) accordingly.
In `@test/extended/node/node_utils.go`:
- Around line 139-155: The function getCNVWorkerNodeName uses rand.Intn (calls
to rand.Intn in this function) which can produce deterministic results if the
global source isn't seeded; seed the PRNG at function or package init by
creating a rand.New(rand.NewSource(time.Now().UnixNano())) and use that instance
for the two rand.Intn calls (or seed the global source in init) so node
selection is non-deterministic; update imports (time and math/rand) and replace
the rand.Intn usages in getCNVWorkerNodeName accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 17d696af-1352-4db5-9690-c00b6ba41039
📒 Files selected for processing (11)
pkg/testsuites/standard_suites.gotest/extended/node/node_swap_cnv.gotest/extended/node/node_utils.gotest/extended/testdata/bindata.gotest/extended/testdata/node/cnv-swap/README.mdtest/extended/testdata/node/cnv-swap/cnv-hyperconverged.yamltest/extended/testdata/node/cnv-swap/cnv-namespace.yamltest/extended/testdata/node/cnv-swap/cnv-operatorgroup.yamltest/extended/testdata/node/cnv-swap/cnv-subscription.yamltest/extended/testdata/node/cnv-swap/kubelet-limitedswap-dropin.yamltest/extended/testdata/node/cnv-swap/kubelet-noswap-dropin.yaml
|
Scheduling required tests: |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-disruptive-longrunning |
|
@BhargaviGudi: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/fd1673c0-2293-11f1-86fc-d9415f12224b-0 |
|
/test e2e-metal-ipi-ovn-ipv6 |
|
All the tests are passed |
e178648 to
8a3bfd6
Compare
…figuration for CNV Updated go.mod by runing go mod tidy and make verify-deps verify-deps in the PR was failing due to this issue Resolved conflicts Removed exstra spaces resolved gofmt issue Resolved rebase issue Resolved rebase issue;Added skip for hypershift and SNO clusters as needed resolved regex issue
8a3bfd6 to
d37aed4
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/testsuites/standard_suites.go`:
- Around line 481-494: The suite entry with Name "openshift/nodes/cnv" has a
Qualifiers slice containing `name.contains("[Suite:openshift/nodes/cnv")` that
no longer matches the new CNV specs tagged
`[Suite:openshift/disruptive-longrunning]`; update the Qualifiers for that suite
(the Qualifiers slice for the "openshift/nodes/cnv" entry) to include the
correct selector `name.contains("[Suite:openshift/disruptive-longrunning]")` (or
replace the existing selector) so the new specs (e.g.,
test/extended/node/node_swap_cnv.go) are picked up by this suite.
In `@test/extended/node/node_swap_cnv.go`:
- Around line 901-907: The test disables host swap with swapoff -a but never
restores pre-existing swap; update the Setup/teardown around the blocks that
check initialHasSwap to first capture the current active swap entries (e.g., run
"swapon --show=NAME --noheadings" via DebugNodeWithNsenter using oc and
cnvWorkerNode) before calling swapoff, and add a defer/cleanup that re-enables
only those captured swap devices (using "swapon <device>" or the original swapon
entries) when initialHasSwap is true; apply the same change to the other
occurrences referenced (around the blocks identified by initialHasSwap at the
other ranges) so tests leave the node in its original state.
- Around line 1321-1434: The test currently treats Prometheus route/token/query
and kubelet metrics failures as warnings which lets TC14 pass without proving
metric exposure; change the logic so that if prometheusRoute is found then token
retrieval and both Prometheus queries (swapTotalResult, swapFreeResult) must
succeed and be non-empty (use o.Expect or framework.Failf on errors/empty
results) and if prometheusRoute is not found decide to fail the test or
explicitly Skip with a clear message; similarly, require kubeletMetrics to be
retrievable and contain swap-related lines (fail if the raw
/api/v1/nodes/.../proxy/metrics call errors or no swap metrics found) by
updating the checks around prometheusRoute, token, swapTotalResult,
swapFreeResult, kubeletMetrics and the final assertions that now assert metric
query success in addition to /proc/meminfo.
- Around line 75-79: The global setup currently calls
ensureDropInDirectoryExists (which does a mkdir -p across all workers) in
node_swap_cnv.go; remove that call from the global setup so test-created state
is not pre-populated and regressions in CNV/MCO directory creation won't be
masked. If some individual tests actually require the directory to exist for
their own preconditions, move the ensureDropInDirectoryExists() invocation into
those specific test setup blocks (not the global setup), or replace it with
explicit per-test setup that validates CNV/MCO behavior instead of pre-creating
the directory.
In `@test/extended/node/node_utils.go`:
- Around line 139-155: The helper getCNVWorkerNodeName currently falls back to
any worker when no node has the "kubevirt.io/schedulable=true" label; change it
to fail fast instead: remove the fallback call to getNodesByLabel(...,
"node-role.kubernetes.io/worker") and if getNodesByLabel(ctx, oc,
"kubevirt.io/schedulable=true") returns no nodes or an error, surface that
failure (e.g., change getCNVWorkerNodeName signature to return (string, error)
or call the test framework's fail helper like framework.Failf/GinkgoFail) so
callers cannot silently proceed; update all callers of getCNVWorkerNodeName to
handle the new error/failed-case accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f18679f5-6cda-4352-b1cc-607f9509d340
📒 Files selected for processing (11)
pkg/testsuites/standard_suites.gotest/extended/node/node_swap_cnv.gotest/extended/node/node_utils.gotest/extended/testdata/bindata.gotest/extended/testdata/node/cnv-swap/README.mdtest/extended/testdata/node/cnv-swap/cnv-hyperconverged.yamltest/extended/testdata/node/cnv-swap/cnv-namespace.yamltest/extended/testdata/node/cnv-swap/cnv-operatorgroup.yamltest/extended/testdata/node/cnv-swap/cnv-subscription.yamltest/extended/testdata/node/cnv-swap/kubelet-limitedswap-dropin.yamltest/extended/testdata/node/cnv-swap/kubelet-noswap-dropin.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
- test/extended/testdata/node/cnv-swap/README.md
- test/extended/testdata/node/cnv-swap/kubelet-noswap-dropin.yaml
- test/extended/testdata/node/cnv-swap/kubelet-limitedswap-dropin.yaml
- test/extended/testdata/node/cnv-swap/cnv-hyperconverged.yaml
| // Ensure drop-in directory exists on all worker nodes | ||
| err = ensureDropInDirectoryExists(ctx, oc, cnvDropInDir) | ||
| if err != nil { | ||
| framework.Logf("Warning: failed to ensure drop-in directory exists: %v", err) | ||
| } |
There was a problem hiding this comment.
Don't create the drop-in directory in global setup.
ensureDropInDirectoryExists() is a mkdir -p across all workers (test/extended/node/node_utils.go:644-657). That makes TC1/TC2/TC6 validate test-created state instead of CNV/MCO behavior, so regressions in directory creation or auto-recreation will be masked.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/node/node_swap_cnv.go` around lines 75 - 79, The global setup
currently calls ensureDropInDirectoryExists (which does a mkdir -p across all
workers) in node_swap_cnv.go; remove that call from the global setup so
test-created state is not pre-populated and regressions in CNV/MCO directory
creation won't be masked. If some individual tests actually require the
directory to exist for their own preconditions, move the
ensureDropInDirectoryExists() invocation into those specific test setup blocks
(not the global setup), or replace it with explicit per-test setup that
validates CNV/MCO behavior instead of pre-creating the directory.
| // If swap is already enabled, disable it for this test | ||
| if initialHasSwap { | ||
| g.By("Disabling existing OS-level swap for test") | ||
| framework.Logf("Running: swapoff -a") | ||
| _, _ = DebugNodeWithNsenter(oc, cnvWorkerNode, "swapoff", "-a") | ||
| framework.Logf("OS-level swap disabled") | ||
| } |
There was a problem hiding this comment.
Restore any pre-existing swap instead of leaving the node altered.
These specs disable all swap with swapoff -a, but cleanup only removes the temporary file and even notes that manual re-enable may be needed. On nodes that started with swap enabled, the suite permanently changes host state and can skew the rest of the ordered run. Capture the active swap entries before disabling them and re-enable them in the defer.
Also applies to: 931-941, 1068-1080, 1101-1104
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/node/node_swap_cnv.go` around lines 901 - 907, The test
disables host swap with swapoff -a but never restores pre-existing swap; update
the Setup/teardown around the blocks that check initialHasSwap to first capture
the current active swap entries (e.g., run "swapon --show=NAME --noheadings" via
DebugNodeWithNsenter using oc and cnvWorkerNode) before calling swapoff, and add
a defer/cleanup that re-enables only those captured swap devices (using "swapon
<device>" or the original swapon entries) when initialHasSwap is true; apply the
same change to the other occurrences referenced (around the blocks identified by
initialHasSwap at the other ranges) so tests leave the node in its original
state.
| prometheusRoute, err := oc.AsAdmin().WithoutNamespace().Run("get").Args( | ||
| "route", "prometheus-k8s", "-n", "openshift-monitoring", | ||
| "-o", "jsonpath={.spec.host}").Output() | ||
| if err != nil || prometheusRoute == "" { | ||
| framework.Logf("Warning: Could not get Prometheus route: %v", err) | ||
| framework.Logf("Skipping Prometheus metrics validation") | ||
| } else { | ||
| framework.Logf("Prometheus route: %s", prometheusRoute) | ||
|
|
||
| // Get bearer token for Prometheus access | ||
| token, err := oc.AsAdmin().WithoutNamespace().Run("whoami").Args("-t").Output() | ||
| if err != nil { | ||
| framework.Logf("Warning: Could not get token: %v", err) | ||
| } else { | ||
| framework.Logf("Got authentication token for Prometheus access") | ||
|
|
||
| g.By("Querying node_memory_SwapTotal_bytes metric") | ||
| // Query for swap total metric - URL encode the query | ||
| swapTotalQuery := fmt.Sprintf("node_memory_SwapTotal_bytes{instance=~\"%s.*\"}", cnvWorkerNode) | ||
| framework.Logf("Prometheus query: %s", swapTotalQuery) | ||
| encodedSwapTotalQuery := url.QueryEscape(swapTotalQuery) | ||
| curlCmd := fmt.Sprintf("curl -sk -H 'Authorization: Bearer %s' 'https://%s/api/v1/query?query=%s'", | ||
| strings.TrimSpace(token), prometheusRoute, encodedSwapTotalQuery) | ||
| swapTotalResult, err := oc.AsAdmin().WithoutNamespace().Run("exec").Args( | ||
| "-n", "openshift-monitoring", | ||
| "prometheus-k8s-0", "-c", "prometheus", | ||
| "--", "sh", "-c", curlCmd).Output() | ||
| if err == nil { | ||
| framework.Logf("Prometheus SwapTotal query result: %s", swapTotalResult) | ||
| } else { | ||
| framework.Logf("Warning: Prometheus query failed: %v", err) | ||
| } | ||
|
|
||
| g.By("Querying node_memory_SwapFree_bytes metric") | ||
| swapFreeQuery := fmt.Sprintf("node_memory_SwapFree_bytes{instance=~\"%s.*\"}", cnvWorkerNode) | ||
| framework.Logf("Prometheus query: %s", swapFreeQuery) | ||
| encodedSwapFreeQuery := url.QueryEscape(swapFreeQuery) | ||
| curlCmd = fmt.Sprintf("curl -sk -H 'Authorization: Bearer %s' 'https://%s/api/v1/query?query=%s'", | ||
| strings.TrimSpace(token), prometheusRoute, encodedSwapFreeQuery) | ||
| swapFreeResult, err := oc.AsAdmin().WithoutNamespace().Run("exec").Args( | ||
| "-n", "openshift-monitoring", | ||
| "prometheus-k8s-0", "-c", "prometheus", | ||
| "--", "sh", "-c", curlCmd).Output() | ||
| if err == nil { | ||
| framework.Logf("Prometheus SwapFree query result: %s", swapFreeResult) | ||
| } else { | ||
| framework.Logf("Warning: Prometheus query failed: %v", err) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| g.By("Querying kubelet metrics endpoint for swap data") | ||
| framework.Logf("Running: oc get --raw \"/api/v1/nodes/%s/proxy/metrics\" | grep -i swap", cnvWorkerNode) | ||
| kubeletMetrics, err := oc.AsAdmin().WithoutNamespace().Run("get").Args( | ||
| "--raw", fmt.Sprintf("/api/v1/nodes/%s/proxy/metrics", cnvWorkerNode)).Output() | ||
| if err == nil { | ||
| // Filter for swap-related metrics | ||
| swapMetrics := []string{} | ||
| for _, line := range strings.Split(kubeletMetrics, "\n") { | ||
| lowerLine := strings.ToLower(line) | ||
| if strings.Contains(lowerLine, "swap") && !strings.HasPrefix(line, "#") { | ||
| swapMetrics = append(swapMetrics, line) | ||
| } | ||
| } | ||
| if len(swapMetrics) > 0 { | ||
| framework.Logf("Kubelet swap-related metrics found:") | ||
| for _, metric := range swapMetrics { | ||
| framework.Logf(" %s", metric) | ||
| } | ||
| } else { | ||
| framework.Logf("No swap-specific metrics found in kubelet metrics (may be exposed via node-exporter)") | ||
| } | ||
| } else { | ||
| framework.Logf("Warning: Could not query kubelet metrics: %v", err) | ||
| } | ||
|
|
||
| g.By("Validating metrics are present and accurate") | ||
| // Verify /proc/meminfo shows swap info (SwapTotal and SwapFree fields should exist) | ||
| o.Expect(meminfoOutput).To(o.ContainSubstring("SwapTotal")) | ||
| o.Expect(meminfoOutput).To(o.ContainSubstring("SwapFree")) | ||
| framework.Logf("Validation passed: /proc/meminfo contains SwapTotal and SwapFree fields") | ||
|
|
||
| // If we created swap, verify non-zero values | ||
| if swapCreated || hasOSSwap { | ||
| o.Expect(swapTotalBytes).To(o.BeNumerically(">", 0), "SwapTotal should be > 0 when swap is enabled") | ||
| framework.Logf("Validation passed: SwapTotal=%d bytes (swap is enabled)", swapTotalBytes) | ||
| // Expected swap size ~512MB = 536870912 bytes (allow some variance) | ||
| if swapCreated { | ||
| expectedMinBytes := int64(swapSizeMB*1024*1024) - 10*1024*1024 // Allow 10MB variance | ||
| o.Expect(swapTotalBytes).To(o.BeNumerically(">=", expectedMinBytes), | ||
| fmt.Sprintf("SwapTotal should be approximately %dMB", swapSizeMB)) | ||
| } | ||
| } | ||
|
|
||
| g.By("Verifying node remains Ready with metrics collection active") | ||
| node, err := oc.AdminKubeClient().CoreV1().Nodes().Get(ctx, cnvWorkerNode, metav1.GetOptions{}) | ||
| o.Expect(err).NotTo(o.HaveOccurred()) | ||
| o.Expect(isNodeInReadyState(node)).To(o.BeTrue(), "Node should remain Ready") | ||
| framework.Logf("Node %s is Ready", cnvWorkerNode) | ||
|
|
||
| osSwapStatus := "enabled" | ||
| if swapCreated { | ||
| osSwapStatus = fmt.Sprintf("enabled (created %dMB by test)", swapSizeMB) | ||
| } else if hasOSSwap { | ||
| osSwapStatus = "enabled (pre-existing)" | ||
| } | ||
| framework.Logf("=== TC14 PASSED ===") | ||
| framework.Logf("Swap metrics and observability verification:") | ||
| framework.Logf("- Node: %s", cnvWorkerNode) | ||
| framework.Logf("- OS swap: %s", osSwapStatus) | ||
| framework.Logf("- Kubelet swapBehavior: %s", config.MemorySwap.SwapBehavior) | ||
| framework.Logf("- /proc/meminfo: SwapTotal=%d KB, SwapFree=%d KB", swapTotalKB, swapFreeKB) | ||
| framework.Logf("- Prometheus metrics: queried successfully with non-zero values") | ||
| framework.Logf("- Kubelet metrics endpoint: queried") |
There was a problem hiding this comment.
TC14 passes without actually proving metric exposure.
Every Prometheus/kubelet metrics failure here is reduced to a warning, and the hard assertions at the end only check /proc/meminfo. This spec can therefore succeed even when route lookup, auth, pod exec, or metric queries all fail.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/node/node_swap_cnv.go` around lines 1321 - 1434, The test
currently treats Prometheus route/token/query and kubelet metrics failures as
warnings which lets TC14 pass without proving metric exposure; change the logic
so that if prometheusRoute is found then token retrieval and both Prometheus
queries (swapTotalResult, swapFreeResult) must succeed and be non-empty (use
o.Expect or framework.Failf on errors/empty results) and if prometheusRoute is
not found decide to fail the test or explicitly Skip with a clear message;
similarly, require kubeletMetrics to be retrievable and contain swap-related
lines (fail if the raw /api/v1/nodes/.../proxy/metrics call errors or no swap
metrics found) by updating the checks around prometheusRoute, token,
swapTotalResult, swapFreeResult, kubeletMetrics and the final assertions that
now assert metric query success in addition to /proc/meminfo.
| // getCNVWorkerNodeName returns the name of a worker node with CNV label (kubevirt.io/schedulable=true) | ||
| func getCNVWorkerNodeName(ctx context.Context, oc *exutil.CLI) string { | ||
| // First try to get nodes with CNV schedulable label | ||
| nodes, err := getNodesByLabel(ctx, oc, "kubevirt.io/schedulable=true") | ||
| if err == nil && len(nodes) > 0 { | ||
| // Randomly select a node from the available CNV nodes | ||
| return nodes[rand.Intn(len(nodes))].Name | ||
| } | ||
|
|
||
| // Fallback to any worker node | ||
| nodes, err = getNodesByLabel(ctx, oc, "node-role.kubernetes.io/worker") | ||
| if err != nil || len(nodes) == 0 { | ||
| return "" | ||
| } | ||
| // Randomly select a node from available worker nodes | ||
| return nodes[rand.Intn(len(nodes))].Name | ||
| } |
There was a problem hiding this comment.
Fail fast when no CNV-schedulable worker exists.
This helper is used by CNV-only specs, but if kubevirt.io/schedulable=true is absent it silently falls back to any worker. That masks missing CNV coverage and can produce false passes/failures on nodes that were never configured for CNV.
Possible minimal fix
func getCNVWorkerNodeName(ctx context.Context, oc *exutil.CLI) string {
// First try to get nodes with CNV schedulable label
nodes, err := getNodesByLabel(ctx, oc, "kubevirt.io/schedulable=true")
if err == nil && len(nodes) > 0 {
// Randomly select a node from the available CNV nodes
return nodes[rand.Intn(len(nodes))].Name
}
-
- // Fallback to any worker node
- nodes, err = getNodesByLabel(ctx, oc, "node-role.kubernetes.io/worker")
- if err != nil || len(nodes) == 0 {
- return ""
- }
- // Randomly select a node from available worker nodes
- return nodes[rand.Intn(len(nodes))].Name
+ return ""
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // getCNVWorkerNodeName returns the name of a worker node with CNV label (kubevirt.io/schedulable=true) | |
| func getCNVWorkerNodeName(ctx context.Context, oc *exutil.CLI) string { | |
| // First try to get nodes with CNV schedulable label | |
| nodes, err := getNodesByLabel(ctx, oc, "kubevirt.io/schedulable=true") | |
| if err == nil && len(nodes) > 0 { | |
| // Randomly select a node from the available CNV nodes | |
| return nodes[rand.Intn(len(nodes))].Name | |
| } | |
| // Fallback to any worker node | |
| nodes, err = getNodesByLabel(ctx, oc, "node-role.kubernetes.io/worker") | |
| if err != nil || len(nodes) == 0 { | |
| return "" | |
| } | |
| // Randomly select a node from available worker nodes | |
| return nodes[rand.Intn(len(nodes))].Name | |
| } | |
| // getCNVWorkerNodeName returns the name of a worker node with CNV label (kubevirt.io/schedulable=true) | |
| func getCNVWorkerNodeName(ctx context.Context, oc *exutil.CLI) string { | |
| // First try to get nodes with CNV schedulable label | |
| nodes, err := getNodesByLabel(ctx, oc, "kubevirt.io/schedulable=true") | |
| if err == nil && len(nodes) > 0 { | |
| // Randomly select a node from the available CNV nodes | |
| return nodes[rand.Intn(len(nodes))].Name | |
| } | |
| return "" | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/node/node_utils.go` around lines 139 - 155, The helper
getCNVWorkerNodeName currently falls back to any worker when no node has the
"kubevirt.io/schedulable=true" label; change it to fail fast instead: remove the
fallback call to getNodesByLabel(..., "node-role.kubernetes.io/worker") and if
getNodesByLabel(ctx, oc, "kubevirt.io/schedulable=true") returns no nodes or an
error, surface that failure (e.g., change getCNVWorkerNodeName signature to
return (string, error) or call the test framework's fail helper like
framework.Failf/GinkgoFail) so callers cannot silently proceed; update all
callers of getCNVWorkerNodeName to handle the new error/failed-case accordingly.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (5)
test/extended/node/node_swap_cnv.go (5)
901-907:⚠️ Potential issue | 🟠 MajorSwap baseline is not restored after
swapoff -a.When nodes start with swap enabled, these tests disable host swap and do not re-enable original entries, leaving persistent state drift.
Proposed fix pattern
+ // capture active swap devices before any swapoff + preSwapOutput, err := DebugNodeWithNsenter(oc, cnvWorkerNode, "sh", "-c", "swapon --show=NAME --noheadings") + o.Expect(err).NotTo(o.HaveOccurred()) + preSwapDevices := []string{} + for _, line := range strings.Split(strings.TrimSpace(preSwapOutput), "\n") { + if strings.TrimSpace(line) != "" { + preSwapDevices = append(preSwapDevices, strings.TrimSpace(line)) + } + } @@ - // Re-enable swap if it was initially present - if initialHasSwap { - framework.Logf("Note: OS swap was initially enabled, may need manual re-enable") - } + for _, dev := range preSwapDevices { + _, _ = DebugNodeWithNsenter(oc, cnvWorkerNode, "swapon", dev) + }As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Also applies to: 931-938, 1068-1080, 1101-1104
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/node_swap_cnv.go` around lines 901 - 907, The test disables host swap with DebugNodeWithNsenter("swapoff", "-a") when initialHasSwap is true but never restores it; update the test to record initialHasSwap usage and, in the test teardown/cleanup (or immediately after test steps), call DebugNodeWithNsenter(oc, cnvWorkerNode, "swapon", "-a") when initialHasSwap is true (and log/handle errors via framework.Logf or process logger) so the node's original swap state is restored; reference the existing DebugNodeWithNsenter call, the initialHasSwap variable, and the g.By("Disabling existing OS-level swap for test") block to locate where to add the corresponding re-enable logic.
341-380:⚠️ Potential issue | 🟠 MajorTC5 control-plane cleanup is not guaranteed on early failure.
If an expectation fails before Lines 375-379, modified drop-ins/directories can remain on control-plane nodes and pollute subsequent tests.
Proposed fix
for i, cpNode := range controlPlaneNodes { cpNodeName := cpNode.Name @@ err = createDropInFile(oc, cpNodeName, cnvDropInFilePath, loadConfigFromFile(cnvLimitedSwapConfigPath)) o.Expect(err).NotTo(o.HaveOccurred()) framework.Logf("Created drop-in file: %s on %s", cnvDropInFilePath, cpNodeName) + + curNode := cpNodeName + defer func() { + removeDropInFile(oc, curNode, cnvDropInFilePath) + _, _ = DebugNodeWithChroot(oc, curNode, "rmdir", cnvDropInDir) + }() @@ - g.By(fmt.Sprintf("Cleaning up %s", cpNodeName)) - removeDropInFile(oc, cpNodeName, cnvDropInFilePath) - // Also remove the drop-in directory we created on control plane - _, _ = DebugNodeWithChroot(oc, cpNodeName, "rmdir", cnvDropInDir) - framework.Logf("Removed drop-in directory from control plane node %s", cpNodeName) }As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/node_swap_cnv.go` around lines 341 - 380, The test can leave created drop-ins/directories on control-plane nodes if an expectation fails before the explicit cleanup; after creating the drop-in directory and file (calls to DebugNodeWithChroot(..., "mkdir", "-p", cnvDropInDir) and createDropInFile(..., cnvDropInFilePath)), ensure cleanup always runs by introducing a per-iteration deferred cleanup (use an inner anonymous func so defer runs each loop) that calls removeDropInFile(oc, cpNodeName, cnvDropInFilePath) and DebugNodeWithChroot(oc, cpNodeName, "rmdir", cnvDropInDir); place the defer immediately after successful creation to guarantee removal even on early failures.
1324-1368:⚠️ Potential issue | 🟠 MajorTC14 can pass without validating Prometheus/kubelet swap metrics.
Route lookup, token retrieval, Prometheus query failures, and kubelet metrics errors are warning-only, yet final output reports success.
Proposed fix
if err != nil || prometheusRoute == "" { - framework.Logf("Warning: Could not get Prometheus route: %v", err) - framework.Logf("Skipping Prometheus metrics validation") + g.Skip(fmt.Sprintf("Prometheus route unavailable for TC14 validation: %v", err)) } else { @@ - if err != nil { - framework.Logf("Warning: Could not get token: %v", err) - } else { + o.Expect(err).NotTo(o.HaveOccurred(), "Prometheus token retrieval must succeed") + { @@ - if err == nil { - framework.Logf("Prometheus SwapTotal query result: %s", swapTotalResult) - } else { - framework.Logf("Warning: Prometheus query failed: %v", err) - } + o.Expect(err).NotTo(o.HaveOccurred(), "Prometheus SwapTotal query must succeed") + o.Expect(swapTotalResult).To(o.ContainSubstring(`"status":"success"`)) + o.Expect(swapTotalResult).NotTo(o.ContainSubstring(`"result":[]`)) @@ - if err == nil { - framework.Logf("Prometheus SwapFree query result: %s", swapFreeResult) - } else { - framework.Logf("Warning: Prometheus query failed: %v", err) - } + o.Expect(err).NotTo(o.HaveOccurred(), "Prometheus SwapFree query must succeed") + o.Expect(swapFreeResult).To(o.ContainSubstring(`"status":"success"`)) + o.Expect(swapFreeResult).NotTo(o.ContainSubstring(`"result":[]`)) } } @@ - if err == nil { + o.Expect(err).NotTo(o.HaveOccurred(), "kubelet metrics endpoint must be reachable") + if err == nil { @@ - if len(swapMetrics) > 0 { + o.Expect(len(swapMetrics)).To(o.BeNumerically(">", 0), "Expected swap-related metrics from kubelet endpoint") + if len(swapMetrics) > 0 {As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Also applies to: 1374-1395, 1427-1434
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/node_swap_cnv.go` around lines 1324 - 1368, The test currently treats failures to find the Prometheus route, get the bearer token, and execute Prometheus queries (variables/prometheusRoute, token from oc.AsAdmin().WithoutNamespace().Run("whoami"), and the exec curl commands producing swapTotalResult/swapFreeResult) as warnings, allowing TC14 to pass incorrectly; change these warning-only branches to fail the test instead (use framework.Failf or framework.ExpectNoError as appropriate) so that if prometheusRoute is empty or err != nil, token retrieval returns an error, or either Prometheus query exec returns an error, the test aborts with a clear failure message; apply the same change pattern to the other similar blocks referenced (lines handling swapTotalResult/swapFreeResult and the other occurrences noted).
41-41:⚠️ Potential issue | 🟠 MajorCNV suite selector won’t pick this spec due to tag mismatch.
This spec is tagged only with
[Suite:openshift/disruptive-longrunning], soopenshift/nodes/cnvselection won’t include it.Proposed fix
-var _ = g.Describe("[Jira:Node/Kubelet][sig-node][Feature:NodeSwap][Serial][Disruptive][Suite:openshift/disruptive-longrunning] Kubelet LimitedSwap Drop-in Configuration for CNV", g.Ordered, func() { +var _ = g.Describe("[Jira:Node/Kubelet][sig-node][Feature:NodeSwap][Serial][Disruptive][Suite:openshift/nodes/cnv][Suite:openshift/disruptive-longrunning] Kubelet LimitedSwap Drop-in Configuration for CNV", g.Ordered, func() {As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/node_swap_cnv.go` at line 41, The test's Describe registration string (the var _ = g.Describe(...) line) is missing the CNV suite tag so the openshift/nodes/cnv selector won't pick it; update the Describe tags to include the CNV suite (e.g., add "[Suite:openshift/nodes/cnv]" alongside the existing "[Suite:openshift/disruptive-longrunning]") in the var _ = g.Describe(...) declaration so the test is included by the CNV suite selector.
75-79:⚠️ Potential issue | 🟠 MajorGlobal pre-creation of the drop-in directory invalidates creation/recreation checks.
Pre-populating
/etc/openshift/kubelet.conf.din global setup can make TC1/TC2/TC6 pass even when CNV/MCO behavior regresses.Proposed fix
- // Ensure drop-in directory exists on all worker nodes - err = ensureDropInDirectoryExists(ctx, oc, cnvDropInDir) - if err != nil { - framework.Logf("Warning: failed to ensure drop-in directory exists: %v", err) - } + // Do not pre-create drop-in directory globally. + // Keep directory setup local to testcases that explicitly need it.As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/node_swap_cnv.go` around lines 75 - 79, The global pre-creation of the drop-in directory (cnvDropInDir) via ensureDropInDirectoryExists in the global setup is invalidating per-test creation/recreation checks; remove or guard that call so the directory is not created globally. Update the global setup in node_swap_cnv.go to stop calling ensureDropInDirectoryExists (or change that helper to only check existence without creating), and ensure individual tests (the TC1/TC2/TC6 test cases) explicitly create or verify cnvDropInDir as part of their own setup/assertions so they exercise CNV/MCO behavior correctly. Ensure references to ensureDropInDirectoryExists and cnvDropInDir are updated accordingly.
🧹 Nitpick comments (4)
test/extended/node/node_utils.go (4)
563-641: Uninstall always returnsnil, obscuring cleanup failures.The function logs warnings but always returns
nil. Callers cannot distinguish between successful cleanup and partial failure. If namespace deletion times out, subsequent test runs may fail unexpectedly.Consider tracking and returning errors for critical failures:
Suggested improvement
func uninstallCNVOperator(ctx context.Context, oc *exutil.CLI) error { framework.Logf("Uninstalling CNV operator...") + var errs []error dynamicClient := oc.AdminDynamicClient() // ... deletion steps ... // Wait for namespace to be deleted framework.Logf("Waiting for namespace to be deleted...") - _ = wait.PollUntilContextTimeout(ctx, 10*time.Second, 10*time.Minute, true, func(ctx context.Context) (bool, error) { + if err := wait.PollUntilContextTimeout(ctx, 10*time.Second, 10*time.Minute, true, func(ctx context.Context) (bool, error) { _, err := oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, cnvNamespace, metav1.GetOptions{}) if apierrors.IsNotFound(err) { return true, nil } framework.Logf("Waiting for namespace %s to be deleted...", cnvNamespace) return false, nil - }) + }); err != nil { + errs = append(errs, fmt.Errorf("namespace deletion timed out: %w", err)) + } // ... remaining steps ... framework.Logf("CNV operator uninstalled successfully") + if len(errs) > 0 { + return fmt.Errorf("uninstall completed with errors: %v", errs) + } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/node_utils.go` around lines 563 - 641, uninstallCNVOperator currently swallows all errors and always returns nil; change it to collect and return failures from critical steps (deleting HyperConverged CR, Subscription, CSVs, OperatorGroup, namespace deletion/wait, unlabelWorkerNodesForCNV, and waitForMCPRolloutComplete) so callers can detect partial failures. Modify uninstallCNVOperator to accumulate errors (e.g., []error or hashicorp/go-multierror) when dynamicClient.Resource(...).Delete calls or wait.PollUntilContextTimeout loops return non-notFound errors or timeouts, and return a consolidated error (or the multierror) at the end instead of nil; ensure you still log warnings with framework.Logf but propagate the underlying error from failures in HyperConverged deletion, namespace deletion/wait, and waitForMCPRolloutComplete so tests can react appropriately.
228-236: Cleanup errors are silently ignored.Both
removeDropInFileandrestartKubeletOnNodeerrors are discarded. If cleanup fails, subsequent tests may run against a misconfigured node. Consider logging warnings at minimum.Suggested improvement
func cleanupDropInAndRestartKubelet(ctx context.Context, oc *exutil.CLI, nodeName, filePath string) { framework.Logf("Removing drop-in file: %s", filePath) - removeDropInFile(oc, nodeName, filePath) + if err := removeDropInFile(oc, nodeName, filePath); err != nil { + framework.Logf("Warning: failed to remove drop-in file: %v", err) + } framework.Logf("Restarting kubelet on node: %s", nodeName) - restartKubeletOnNode(oc, nodeName) + if err := restartKubeletOnNode(oc, nodeName); err != nil { + framework.Logf("Warning: failed to restart kubelet: %v", err) + } framework.Logf("Waiting for node to be ready...") waitForNodeToBeReady(ctx, oc, nodeName) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/node_utils.go` around lines 228 - 236, The cleanup function currently discards errors from removeDropInFile and restartKubeletOnNode; update error handling so failures are logged and propagated: change removeDropInFile and restartKubeletOnNode to return error (if they don't already), call them from cleanupDropInAndRestartKubelet and capture their returned errors, emit framework.Logf or framework.Errorf warnings that include the nodeName, filePath and the error value, and if either fails consider returning the error from cleanupDropInAndRestartKubelet (or at minimum skip waitForNodeToBeReady when restartKubeletOnNode failed) so callers can react. Ensure you reference the functions cleanupDropInAndRestartKubelet, removeDropInFile, restartKubeletOnNode and waitForNodeToBeReady when making changes.
177-184: Shell escaping may be insufficient for content with$, backticks, or backslashes.The current escaping only handles single quotes. If
contentcontains$VAR, backticks, or backslashes, they will be interpreted by the shell. For kubelet YAML configs this is unlikely to be a problem, but consider usingcat <<'EOF'heredoc or base64 encoding for robustness.More robust alternative using heredoc
func createDropInFile(oc *exutil.CLI, nodeName, filePath, content string) error { - // Escape content for shell - escapedContent := strings.ReplaceAll(content, "'", "'\\''") - cmd := fmt.Sprintf("echo '%s' > %s && chmod 644 %s", escapedContent, filePath, filePath) - _, err := DebugNodeWithChroot(oc, nodeName, "sh", "-c", cmd) + // Use heredoc with quoted delimiter to prevent variable expansion + cmd := fmt.Sprintf("cat <<'EOF' > %s && chmod 644 %s\n%s\nEOF", filePath, filePath, content) + _, err := DebugNodeWithChroot(oc, nodeName, "sh", "-c", cmd) return err }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/node_utils.go` around lines 177 - 184, The createDropInFile function's current echo-based escaping is insufficient for content containing $, backticks, or backslashes; change it to write the content via a single-quoted heredoc (e.g., use cat <<'EOF' > filePath ... EOF) or by base64-encoding content and decoding on the node, so the shell won't interpret special characters. Update the command built for DebugNodeWithChroot to use the heredoc or base64 decode approach, ensure file permissions are set afterwards (chmod 644), and keep references to createDropInFile and DebugNodeWithChroot so reviewers can find and verify the change.
643-658:ensurefunction returnsnileven when directory creation fails.The function name suggests it guarantees the directory exists, but it silently continues on failure. Callers relying on this guarantee may encounter unexpected errors later.
Suggested fix
func ensureDropInDirectoryExists(ctx context.Context, oc *exutil.CLI, dirPath string) error { nodes, err := getNodesByLabel(ctx, oc, "node-role.kubernetes.io/worker") if err != nil { return fmt.Errorf("failed to get worker nodes: %w", err) } + var lastErr error for _, node := range nodes { _, err := DebugNodeWithChroot(oc, node.Name, "mkdir", "-p", dirPath) if err != nil { framework.Logf("Warning: failed to create directory on node %s: %v", node.Name, err) + lastErr = err } } - return nil + return lastErr }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/node_utils.go` around lines 643 - 658, The ensureDropInDirectoryExists function currently logs mkdir failures and returns nil, violating its "ensure" contract; update it to collect and return an error if any DebugNodeWithChroot call fails (e.g., aggregate errors or return the first error) so callers know the directory creation did not complete on all nodes; modify ensureDropInDirectoryExists to use the existing getNodesByLabel and DebugNodeWithChroot calls but when err != nil record that error (including node.Name and the original err) and after the loop return a combined error (or the first non-nil) instead of nil.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/node/node_swap_cnv.go`:
- Around line 1110-1115: The error handlers that currently log a warning and
"continue" when dd/mkswap/swapon fail (inside the swap-size setup blocks in
node_swap_cnv.go) allow TC13 to pass despite setup failure; change these
handlers so the test run fails immediately: when err != nil after the swap
commands, set result.success = false, append result to results, and then stop
the test flow (e.g., return the error or break out to the caller so the testcase
is marked failed) instead of continuing; apply the same fix to the three
identical blocks that handle dd/mkswap/swapon failures (the places updating the
result variable and appending to results).
- Around line 629-635: The check that inspects node labels (using
oc.AsAdmin().WithoutNamespace().Run("get").Args("node", nonCNVWorkerNode, "-o",
"jsonpath={.metadata.labels}").Output()) only logs a warning when the
kubevirt.io/schedulable label is present; change this to fail the test instead
so the non-CNV precondition is enforced—replace the warning branch (where
strings.Contains(output, "kubevirt.io/schedulable")) with a call that aborts the
test (e.g., framework.Failf or equivalent test-fail helper) with a clear message
referencing nonCNVWorkerNode and the label output; keep the else branch’s
confirmed log as-is.
- Around line 485-491: The test collects fileOwnership via DebugNodeWithChroot
into the local variable fileOwnership but never asserts it; add an expectation
asserting the ownership equals the intended value (e.g. "root:root" or the
appropriate owner for this test) by using the test framework's matcher
(reference the fileOwnership variable and the DebugNodeWithChroot call), so
replace the silent read with an assertion like
o.Expect(fileOwnership).To(Equal("<expected-owner>")) to fail the test on
ownership regressions.
---
Duplicate comments:
In `@test/extended/node/node_swap_cnv.go`:
- Around line 901-907: The test disables host swap with
DebugNodeWithNsenter("swapoff", "-a") when initialHasSwap is true but never
restores it; update the test to record initialHasSwap usage and, in the test
teardown/cleanup (or immediately after test steps), call
DebugNodeWithNsenter(oc, cnvWorkerNode, "swapon", "-a") when initialHasSwap is
true (and log/handle errors via framework.Logf or process logger) so the node's
original swap state is restored; reference the existing DebugNodeWithNsenter
call, the initialHasSwap variable, and the g.By("Disabling existing OS-level
swap for test") block to locate where to add the corresponding re-enable logic.
- Around line 341-380: The test can leave created drop-ins/directories on
control-plane nodes if an expectation fails before the explicit cleanup; after
creating the drop-in directory and file (calls to DebugNodeWithChroot(...,
"mkdir", "-p", cnvDropInDir) and createDropInFile(..., cnvDropInFilePath)),
ensure cleanup always runs by introducing a per-iteration deferred cleanup (use
an inner anonymous func so defer runs each loop) that calls removeDropInFile(oc,
cpNodeName, cnvDropInFilePath) and DebugNodeWithChroot(oc, cpNodeName, "rmdir",
cnvDropInDir); place the defer immediately after successful creation to
guarantee removal even on early failures.
- Around line 1324-1368: The test currently treats failures to find the
Prometheus route, get the bearer token, and execute Prometheus queries
(variables/prometheusRoute, token from
oc.AsAdmin().WithoutNamespace().Run("whoami"), and the exec curl commands
producing swapTotalResult/swapFreeResult) as warnings, allowing TC14 to pass
incorrectly; change these warning-only branches to fail the test instead (use
framework.Failf or framework.ExpectNoError as appropriate) so that if
prometheusRoute is empty or err != nil, token retrieval returns an error, or
either Prometheus query exec returns an error, the test aborts with a clear
failure message; apply the same change pattern to the other similar blocks
referenced (lines handling swapTotalResult/swapFreeResult and the other
occurrences noted).
- Line 41: The test's Describe registration string (the var _ = g.Describe(...)
line) is missing the CNV suite tag so the openshift/nodes/cnv selector won't
pick it; update the Describe tags to include the CNV suite (e.g., add
"[Suite:openshift/nodes/cnv]" alongside the existing
"[Suite:openshift/disruptive-longrunning]") in the var _ = g.Describe(...)
declaration so the test is included by the CNV suite selector.
- Around line 75-79: The global pre-creation of the drop-in directory
(cnvDropInDir) via ensureDropInDirectoryExists in the global setup is
invalidating per-test creation/recreation checks; remove or guard that call so
the directory is not created globally. Update the global setup in
node_swap_cnv.go to stop calling ensureDropInDirectoryExists (or change that
helper to only check existence without creating), and ensure individual tests
(the TC1/TC2/TC6 test cases) explicitly create or verify cnvDropInDir as part of
their own setup/assertions so they exercise CNV/MCO behavior correctly. Ensure
references to ensureDropInDirectoryExists and cnvDropInDir are updated
accordingly.
---
Nitpick comments:
In `@test/extended/node/node_utils.go`:
- Around line 563-641: uninstallCNVOperator currently swallows all errors and
always returns nil; change it to collect and return failures from critical steps
(deleting HyperConverged CR, Subscription, CSVs, OperatorGroup, namespace
deletion/wait, unlabelWorkerNodesForCNV, and waitForMCPRolloutComplete) so
callers can detect partial failures. Modify uninstallCNVOperator to accumulate
errors (e.g., []error or hashicorp/go-multierror) when
dynamicClient.Resource(...).Delete calls or wait.PollUntilContextTimeout loops
return non-notFound errors or timeouts, and return a consolidated error (or the
multierror) at the end instead of nil; ensure you still log warnings with
framework.Logf but propagate the underlying error from failures in
HyperConverged deletion, namespace deletion/wait, and waitForMCPRolloutComplete
so tests can react appropriately.
- Around line 228-236: The cleanup function currently discards errors from
removeDropInFile and restartKubeletOnNode; update error handling so failures are
logged and propagated: change removeDropInFile and restartKubeletOnNode to
return error (if they don't already), call them from
cleanupDropInAndRestartKubelet and capture their returned errors, emit
framework.Logf or framework.Errorf warnings that include the nodeName, filePath
and the error value, and if either fails consider returning the error from
cleanupDropInAndRestartKubelet (or at minimum skip waitForNodeToBeReady when
restartKubeletOnNode failed) so callers can react. Ensure you reference the
functions cleanupDropInAndRestartKubelet, removeDropInFile, restartKubeletOnNode
and waitForNodeToBeReady when making changes.
- Around line 177-184: The createDropInFile function's current echo-based
escaping is insufficient for content containing $, backticks, or backslashes;
change it to write the content via a single-quoted heredoc (e.g., use cat
<<'EOF' > filePath ... EOF) or by base64-encoding content and decoding on the
node, so the shell won't interpret special characters. Update the command built
for DebugNodeWithChroot to use the heredoc or base64 decode approach, ensure
file permissions are set afterwards (chmod 644), and keep references to
createDropInFile and DebugNodeWithChroot so reviewers can find and verify the
change.
- Around line 643-658: The ensureDropInDirectoryExists function currently logs
mkdir failures and returns nil, violating its "ensure" contract; update it to
collect and return an error if any DebugNodeWithChroot call fails (e.g.,
aggregate errors or return the first error) so callers know the directory
creation did not complete on all nodes; modify ensureDropInDirectoryExists to
use the existing getNodesByLabel and DebugNodeWithChroot calls but when err !=
nil record that error (including node.Name and the original err) and after the
loop return a combined error (or the first non-nil) instead of nil.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ceac0a87-6391-4da1-9ff1-38ae17cad17d
📒 Files selected for processing (11)
pkg/testsuites/standard_suites.gotest/extended/node/node_swap_cnv.gotest/extended/node/node_utils.gotest/extended/testdata/bindata.gotest/extended/testdata/node/cnv-swap/README.mdtest/extended/testdata/node/cnv-swap/cnv-hyperconverged.yamltest/extended/testdata/node/cnv-swap/cnv-namespace.yamltest/extended/testdata/node/cnv-swap/cnv-operatorgroup.yamltest/extended/testdata/node/cnv-swap/cnv-subscription.yamltest/extended/testdata/node/cnv-swap/kubelet-limitedswap-dropin.yamltest/extended/testdata/node/cnv-swap/kubelet-noswap-dropin.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
- test/extended/testdata/node/cnv-swap/cnv-namespace.yaml
- test/extended/testdata/node/cnv-swap/cnv-hyperconverged.yaml
- test/extended/testdata/node/cnv-swap/kubelet-limitedswap-dropin.yaml
| g.By("Verifying config file ownership") | ||
| framework.Logf("Running: stat -c %%U:%%G %s", testFile) | ||
| output, err = DebugNodeWithChroot(oc, cnvWorkerNode, "stat", "-c", "%U:%G", testFile) | ||
| o.Expect(err).NotTo(o.HaveOccurred()) | ||
| fileOwnership := strings.TrimSpace(output) | ||
| framework.Logf("File ownership: %s", fileOwnership) | ||
|
|
There was a problem hiding this comment.
TC7 reads file ownership but never asserts it.
This allows ownership regressions to pass silently.
Proposed fix
fileOwnership := strings.TrimSpace(output)
framework.Logf("File ownership: %s", fileOwnership)
+ o.Expect(fileOwnership).To(o.Equal("root:root"),
+ "Drop-in file ownership should be root:root")As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| g.By("Verifying config file ownership") | |
| framework.Logf("Running: stat -c %%U:%%G %s", testFile) | |
| output, err = DebugNodeWithChroot(oc, cnvWorkerNode, "stat", "-c", "%U:%G", testFile) | |
| o.Expect(err).NotTo(o.HaveOccurred()) | |
| fileOwnership := strings.TrimSpace(output) | |
| framework.Logf("File ownership: %s", fileOwnership) | |
| g.By("Verifying config file ownership") | |
| framework.Logf("Running: stat -c %%U:%%G %s", testFile) | |
| output, err = DebugNodeWithChroot(oc, cnvWorkerNode, "stat", "-c", "%U:%G", testFile) | |
| o.Expect(err).NotTo(o.HaveOccurred()) | |
| fileOwnership := strings.TrimSpace(output) | |
| framework.Logf("File ownership: %s", fileOwnership) | |
| o.Expect(fileOwnership).To(o.Equal("root:root"), | |
| "Drop-in file file ownership should be root:root") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/node/node_swap_cnv.go` around lines 485 - 491, The test
collects fileOwnership via DebugNodeWithChroot into the local variable
fileOwnership but never asserts it; add an expectation asserting the ownership
equals the intended value (e.g. "root:root" or the appropriate owner for this
test) by using the test framework's matcher (reference the fileOwnership
variable and the DebugNodeWithChroot call), so replace the silent read with an
assertion like o.Expect(fileOwnership).To(Equal("<expected-owner>")) to fail the
test on ownership regressions.
| framework.Logf("Running: oc get node %s --show-labels | grep kubevirt", nonCNVWorkerNode) | ||
| output, _ := oc.AsAdmin().WithoutNamespace().Run("get").Args("node", nonCNVWorkerNode, "-o", "jsonpath={.metadata.labels}").Output() | ||
| if strings.Contains(output, "kubevirt.io/schedulable") { | ||
| framework.Logf("Warning: Node still has CNV label in labels: %s", output) | ||
| } else { | ||
| framework.Logf("Confirmed: Node %s no longer has kubevirt.io/schedulable label", nonCNVWorkerNode) | ||
| } |
There was a problem hiding this comment.
TC9 does not enforce the non-CNV precondition.
If the label is still present, the test only logs a warning and continues against a CNV node.
Proposed fix
- output, _ := oc.AsAdmin().WithoutNamespace().Run("get").Args("node", nonCNVWorkerNode, "-o", "jsonpath={.metadata.labels}").Output()
- if strings.Contains(output, "kubevirt.io/schedulable") {
- framework.Logf("Warning: Node still has CNV label in labels: %s", output)
- } else {
- framework.Logf("Confirmed: Node %s no longer has kubevirt.io/schedulable label", nonCNVWorkerNode)
- }
+ output, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("node", nonCNVWorkerNode, "-o", "jsonpath={.metadata.labels}").Output()
+ o.Expect(err).NotTo(o.HaveOccurred())
+ o.Expect(output).NotTo(o.ContainSubstring("kubevirt.io/schedulable"),
+ "Node must not have CNV schedulable label for this testcase")As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/node/node_swap_cnv.go` around lines 629 - 635, The check that
inspects node labels (using
oc.AsAdmin().WithoutNamespace().Run("get").Args("node", nonCNVWorkerNode, "-o",
"jsonpath={.metadata.labels}").Output()) only logs a warning when the
kubevirt.io/schedulable label is present; change this to fail the test instead
so the non-CNV precondition is enforced—replace the warning branch (where
strings.Contains(output, "kubevirt.io/schedulable")) with a call that aborts the
test (e.g., framework.Failf or equivalent test-fail helper) with a clear message
referencing nonCNVWorkerNode and the label output; keep the else branch’s
confirmed log as-is.
| if err != nil { | ||
| framework.Logf("Warning: Failed to create swap file: %v", err) | ||
| result.success = false | ||
| results = append(results, result) | ||
| continue | ||
| } |
There was a problem hiding this comment.
TC13 can pass even when swap-size setup fails.
dd/mkswap/swapon failures are downgraded to warnings plus continue, so a scenario may be untested without failing the testcase.
Proposed fix
+ var setupFailures []string
@@
if err != nil {
- framework.Logf("Warning: Failed to create swap file: %v", err)
+ setupFailures = append(setupFailures, fmt.Sprintf("%s: dd failed: %v", swapSize.name, err))
result.success = false
results = append(results, result)
continue
}
@@
if err != nil {
- framework.Logf("Warning: Failed to mkswap: %v", err)
+ setupFailures = append(setupFailures, fmt.Sprintf("%s: mkswap failed: %v", swapSize.name, err))
result.success = false
results = append(results, result)
continue
}
@@
if err != nil {
- framework.Logf("Warning: Failed to enable swap: %v", err)
+ setupFailures = append(setupFailures, fmt.Sprintf("%s: swapon failed: %v", swapSize.name, err))
result.success = false
results = append(results, result)
continue
}
@@
+ o.Expect(setupFailures).To(o.BeEmpty(), "All swap-size scenarios must be exercised successfully")As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Also applies to: 1122-1127, 1131-1136
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/node/node_swap_cnv.go` around lines 1110 - 1115, The error
handlers that currently log a warning and "continue" when dd/mkswap/swapon fail
(inside the swap-size setup blocks in node_swap_cnv.go) allow TC13 to pass
despite setup failure; change these handlers so the test run fails immediately:
when err != nil after the swap commands, set result.success = false, append
result to results, and then stop the test flow (e.g., return the error or break
out to the caller so the testcase is marked failed) instead of continuing; apply
the same fix to the three identical blocks that handle dd/mkswap/swapon failures
(the places updating the result variable and appending to results).
|
Scheduling required tests: |
|
/test e2e-gcp-csi |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: BhargaviGudi, cpmeadors, smg247 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test e2e-metal-ipi-ovn-ipv6 |
|
/verified by CI-Job |
|
@BhargaviGudi: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/unhold |
|
@BhargaviGudi: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR adds automated tests for the LimitedSwap feature using the /etc/openshift/kubelet.conf.d/ drop-in directory on CNV-enabled clusters.
Jira: https://issues.redhat.com/browse/OCPNODE-3751
Test Cases:
TC1: Verify silent creation and ownership of drop-in directory on CNV nodes
TC2: Verify kubelet starts normally with empty directory
TC3: Apply LimitedSwap configuration from drop-in file
TC4: Revert to NoSwap when drop-in file is removed
TC5: Verify control plane kubelets ignore drop-in config
TC6: Verify drop-in directory is auto-recreated after deletion
TC7: Validate security and permissions of drop-in directory
TC8: Verify cluster stability with LimitedSwap enabled
TC9: Verify non-CNV workers have no swap configuration
TC10: Apply correct precedence with multiple drop-in files
TC11: Maintain consistent configuration with checksum verification across CNV nodes
TC12: Handle LimitedSwap config gracefully when OS swap is disabled
TC13: Work correctly with various swap sizes
TC14: Expose swap metrics correctly via Prometheus
Changes
test/extended/node/node_swap_cnv.go - New test file with CNV swap tests
test/extended/node/node_utils.go - Helper functions for node operations and CNV lifecycle
test/extended/testdata/node/cnv-swap/ - Test data YAML files and README
pkg/testsuites/standard_suites.go - Added openshift/nodes/cnv suite definition
Running Tests
Run individual test
./openshift-tests run-test "[Jira:Node/Kubelet][sig-node][Feature:NodeSwap][Serial][Disruptive][Suite:openshift/nodes/cnv] Kubelet LimitedSwap Drop-in Configuration for CNV TC1: should verify silent creation and ownership of drop-in directory on CNV nodesRun entire suite
./openshift-tests run openshift/nodes/cnvAll the testcases are passed with 4.22.0-0.nightly-2026-01-19-085729
I ran each test case individually and captured the outputs here.
Ran test suite and kept the output here
CI job:
https://github.com/openshift/release/pull/73670 WIP