fix: skip vfio groups already held by another process during discovery#2
Merged
artback merged 2 commits intoMay 22, 2026
Merged
Conversation
After the late-binding-discovery rollout to production, customer VM
1d35a850-... crashed every retry with
vfio 0000:23:00.0: Could not open '/dev/vfio/51': Device or resource busy
The discovery walk in createIommuDeviceMap added every NVIDIA GPU
whose driver was vfio-pci to the advertised pool, including GPUs that
were already passed through to other tenant VMs on the same node. The
Allocate() handler then handed those PCI IDs out to the new pod, whose
qemu failed to open /dev/vfio/<group> because another qemu held the
group exclusively (vfio_group_fops_open returns EBUSY when
group->opened is already 1).
Cheap, accurate filter: try to open(/dev/vfio/<group>, O_RDWR|O_NONBLOCK)
non-blocking. EBUSY means a tenant holds it -- skip the device. Any
other error (ENOENT, EACCES, ...) is treated as not-busy so a
transient sysfs glitch does not silently shrink the advertised pool.
Side effect: a GPU that frees up (tenant VM stops) stays bound to
vfio-pci and the existing vfio_watcher does not fire on stop events,
so a manual plugin restart is currently needed to re-enter the pool.
That tradeoff is preferable to the current crashloop and will be
followed up with a healthCheck poll in a separate change.
Tested with ginkgo focus 'vfio-busy' -- three new specs cover:
- missing /dev/vfio/<group> reports not busy (transient ENOENT)
- opening a free group reports not busy
- discovery skips busy groups and still picks up free ones
V1 of this patch skipped busy GPUs at discovery, which kept Allocate() from handing them out but also shrank kubelet's advertised capacity below the count of pre-existing allocations. With six GPUs already checkpointed to running tenant pods and capacity now reported as two, kubelet computed "free = capacity - allocated = -4" and refused to schedule any new GPU pod on the node even though there genuinely were two free GPUs. Switch to advertising every vfio-pci-bound NVIDIA GPU and setting Health=Unhealthy on those whose /dev/vfio/<group> is currently held. Kubelet's contract for Unhealthy devices is exactly what we need: they count toward capacity, existing pod allocations on them are preserved across plugin restarts, but Allocate() will not give them to new pods. The new pod gets one of the truly free, Healthy devices and qemu opens /dev/vfio/<group> cleanly. Test updated: createDevicePlugins now keeps the busy device in the plugin's pool with Unhealthy, and the free one with Healthy.
857969c
into
fix/vfio-late-binding-discovery
1 of 2 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
After the late-binding-discovery rollout to production-supply (hive-compute-ops#841), customer VM
1d35a850-...could not start oncptcan02and crashlooped every retry with:Root cause:
createIommuDeviceMapadds every NVIDIA GPU whose driver isvfio-pcito the advertised pool. On nodes where other tenant VMs already have GPUs passed through (their GPUs are bound tovfio-pcibecause of the passthrough), those GPUs end up in our pool too — but they are not actually free. TheAllocate()handler then hands their PCI IDs to a new pod, and qemu fails to open/dev/vfio/<group>because the VFIO kernel driver returnsEBUSYon the second open of a group whoseopenedrefcount is already 1 (drivers/vfio/group.c:vfio_group_fops_open).Fix
Cheap, accurate filter in discovery: try to open
/dev/vfio/<group>withO_RDWR | O_NONBLOCK | O_CLOEXEC.EBUSYmeans the group is held by another process — skip the device. Any other error (ENOENT,EACCES, ...) is treated as not-busy so a transient sysfs glitch does not silently shrink the advertised pool.Tradeoff (V1 scope)
A GPU that frees up (tenant VM stops) stays bound to
vfio-pci, and the existingvfio_watcheronly fires on new vfio-pci bindings — not on releases. So a manual plugin restart is currently required to re-enter a freed GPU into the pool. That is strictly preferable to the current crashloop and will be addressed by adding a periodicisVfioGroupBusypoll inhealthCheckin a follow-up.Tests
Three new ginkgo specs in
pkg/device_plugin/vfio_busy_test.go:/dev/vfio/<group>reports not busy (transient ENOENT must not shrink the pool)createIommuDeviceMapskips busy groups and still picks up free onesFull suite still green.
Rollout
Build Hive Imageworkflow_dispatch on this branch.hive-compute-ops/clusters/dev-supply/infrastructure.yaml, validate; then staging-supply; then production-supply.cptcan02will restart and re-discover, this time advertising only the GPUs whose/dev/vfio/<group>is currently free.1d35a850-...— it should now start cleanly on cptcan02.Tracks COMP-2335.