fix: skip vfio groups already held by another process during discovery by artback · Pull Request #2 · HiveNetCode/kubevirt-gpu-device-plugin

artback · 2026-05-18T15:02:04Z

Summary

After the late-binding-discovery rollout to production-supply (hive-compute-ops#841), customer VM 1d35a850-... could not start on cptcan02 and crashlooped every retry with:

qemu-kvm: -device {"driver":"vfio-pci","host":"0000:23:00.0",...}:
  vfio 0000:23:00.0: Could not open '/dev/vfio/51': Device or resource busy

Root cause: createIommuDeviceMap adds every NVIDIA GPU whose driver is vfio-pci to the advertised pool. On nodes where other tenant VMs already have GPUs passed through (their GPUs are bound to vfio-pci because of the passthrough), those GPUs end up in our pool too — but they are not actually free. The Allocate() handler then hands their PCI IDs to a new pod, and qemu fails to open /dev/vfio/<group> because the VFIO kernel driver returns EBUSY on the second open of a group whose opened refcount is already 1 (drivers/vfio/group.c:vfio_group_fops_open).

Fix

Cheap, accurate filter in discovery: try to open /dev/vfio/<group> with O_RDWR | O_NONBLOCK | O_CLOEXEC. EBUSY means the group is held by another process — skip the device. Any other error (ENOENT, EACCES, ...) is treated as not-busy so a transient sysfs glitch does not silently shrink the advertised pool.

Tradeoff (V1 scope)

A GPU that frees up (tenant VM stops) stays bound to vfio-pci, and the existing vfio_watcher only fires on new vfio-pci bindings — not on releases. So a manual plugin restart is currently required to re-enter a freed GPU into the pool. That is strictly preferable to the current crashloop and will be addressed by adding a periodic isVfioGroupBusy poll in healthCheck in a follow-up.

Tests

Three new ginkgo specs in pkg/device_plugin/vfio_busy_test.go:

missing /dev/vfio/<group> reports not busy (transient ENOENT must not shrink the pool)
opening a free group reports not busy
createIommuDeviceMap skips busy groups and still picks up free ones

$ go test ./pkg/device_plugin/ -run TestDevicePlugin -ginkgo.focus 'vfio-busy'
SUCCESS! -- 3 Passed | 0 Failed | 0 Pending | 68 Skipped

Full suite still green.

Rollout

Build new image via Build Hive Image workflow_dispatch on this branch.
Bump the image tag in hive-compute-ops/clusters/dev-supply/infrastructure.yaml, validate; then staging-supply; then production-supply.
On production-supply rollout, the daemonset on cptcan02 will restart and re-discover, this time advertising only the GPUs whose /dev/vfio/<group> is currently free.
Un-halt customer VM 1d35a850-... — it should now start cleanly on cptcan02.

Tracks COMP-2335.

After the late-binding-discovery rollout to production, customer VM 1d35a850-... crashed every retry with vfio 0000:23:00.0: Could not open '/dev/vfio/51': Device or resource busy The discovery walk in createIommuDeviceMap added every NVIDIA GPU whose driver was vfio-pci to the advertised pool, including GPUs that were already passed through to other tenant VMs on the same node. The Allocate() handler then handed those PCI IDs out to the new pod, whose qemu failed to open /dev/vfio/<group> because another qemu held the group exclusively (vfio_group_fops_open returns EBUSY when group->opened is already 1). Cheap, accurate filter: try to open(/dev/vfio/<group>, O_RDWR|O_NONBLOCK) non-blocking. EBUSY means a tenant holds it -- skip the device. Any other error (ENOENT, EACCES, ...) is treated as not-busy so a transient sysfs glitch does not silently shrink the advertised pool. Side effect: a GPU that frees up (tenant VM stops) stays bound to vfio-pci and the existing vfio_watcher does not fire on stop events, so a manual plugin restart is currently needed to re-enter the pool. That tradeoff is preferable to the current crashloop and will be followed up with a healthCheck poll in a separate change. Tested with ginkgo focus 'vfio-busy' -- three new specs cover: - missing /dev/vfio/<group> reports not busy (transient ENOENT) - opening a free group reports not busy - discovery skips busy groups and still picks up free ones

V1 of this patch skipped busy GPUs at discovery, which kept Allocate() from handing them out but also shrank kubelet's advertised capacity below the count of pre-existing allocations. With six GPUs already checkpointed to running tenant pods and capacity now reported as two, kubelet computed "free = capacity - allocated = -4" and refused to schedule any new GPU pod on the node even though there genuinely were two free GPUs. Switch to advertising every vfio-pci-bound NVIDIA GPU and setting Health=Unhealthy on those whose /dev/vfio/<group> is currently held. Kubelet's contract for Unhealthy devices is exactly what we need: they count toward capacity, existing pod allocations on them are preserved across plugin restarts, but Allocate() will not give them to new pods. The new pod gets one of the truly free, Healthy devices and qemu opens /dev/vfio/<group> cleanly. Test updated: createDevicePlugins now keeps the busy device in the plugin's pool with Unhealthy, and the free one with Healthy.

artback temporarily deployed to dev May 18, 2026 15:02 — with GitHub Actions Inactive

artback had a problem deploying to dev May 18, 2026 15:49 — with GitHub Actions Failure

artback temporarily deployed to dev May 18, 2026 15:52 — with GitHub Actions Inactive

artback merged commit 857969c into fix/vfio-late-binding-discovery May 22, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: skip vfio groups already held by another process during discovery#2

fix: skip vfio groups already held by another process during discovery#2
artback merged 2 commits into
fix/vfio-late-binding-discoveryfrom
fix/skip-busy-vfio-groups

artback commented May 18, 2026 •

edited by atlassian Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

artback commented May 18, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Tradeoff (V1 scope)

Tests

Rollout

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

artback commented May 18, 2026 •

edited by atlassian Bot

Loading