Skip to content

fix: skip vfio groups already held by another process during discovery#2

Merged
artback merged 2 commits into
fix/vfio-late-binding-discoveryfrom
fix/skip-busy-vfio-groups
May 22, 2026
Merged

fix: skip vfio groups already held by another process during discovery#2
artback merged 2 commits into
fix/vfio-late-binding-discoveryfrom
fix/skip-busy-vfio-groups

Conversation

@artback
Copy link
Copy Markdown
Collaborator

@artback artback commented May 18, 2026

Summary

After the late-binding-discovery rollout to production-supply (hive-compute-ops#841), customer VM 1d35a850-... could not start on cptcan02 and crashlooped every retry with:

qemu-kvm: -device {"driver":"vfio-pci","host":"0000:23:00.0",...}:
  vfio 0000:23:00.0: Could not open '/dev/vfio/51': Device or resource busy

Root cause: createIommuDeviceMap adds every NVIDIA GPU whose driver is vfio-pci to the advertised pool. On nodes where other tenant VMs already have GPUs passed through (their GPUs are bound to vfio-pci because of the passthrough), those GPUs end up in our pool too — but they are not actually free. The Allocate() handler then hands their PCI IDs to a new pod, and qemu fails to open /dev/vfio/<group> because the VFIO kernel driver returns EBUSY on the second open of a group whose opened refcount is already 1 (drivers/vfio/group.c:vfio_group_fops_open).

Fix

Cheap, accurate filter in discovery: try to open /dev/vfio/<group> with O_RDWR | O_NONBLOCK | O_CLOEXEC. EBUSY means the group is held by another process — skip the device. Any other error (ENOENT, EACCES, ...) is treated as not-busy so a transient sysfs glitch does not silently shrink the advertised pool.

Tradeoff (V1 scope)

A GPU that frees up (tenant VM stops) stays bound to vfio-pci, and the existing vfio_watcher only fires on new vfio-pci bindings — not on releases. So a manual plugin restart is currently required to re-enter a freed GPU into the pool. That is strictly preferable to the current crashloop and will be addressed by adding a periodic isVfioGroupBusy poll in healthCheck in a follow-up.

Tests

Three new ginkgo specs in pkg/device_plugin/vfio_busy_test.go:

  • missing /dev/vfio/<group> reports not busy (transient ENOENT must not shrink the pool)
  • opening a free group reports not busy
  • createIommuDeviceMap skips busy groups and still picks up free ones
$ go test ./pkg/device_plugin/ -run TestDevicePlugin -ginkgo.focus 'vfio-busy'
SUCCESS! -- 3 Passed | 0 Failed | 0 Pending | 68 Skipped

Full suite still green.

Rollout

  1. Build new image via Build Hive Image workflow_dispatch on this branch.
  2. Bump the image tag in hive-compute-ops/clusters/dev-supply/infrastructure.yaml, validate; then staging-supply; then production-supply.
  3. On production-supply rollout, the daemonset on cptcan02 will restart and re-discover, this time advertising only the GPUs whose /dev/vfio/<group> is currently free.
  4. Un-halt customer VM 1d35a850-... — it should now start cleanly on cptcan02.

Tracks COMP-2335.

After the late-binding-discovery rollout to production, customer VM
1d35a850-... crashed every retry with

    vfio 0000:23:00.0: Could not open '/dev/vfio/51': Device or resource busy

The discovery walk in createIommuDeviceMap added every NVIDIA GPU
whose driver was vfio-pci to the advertised pool, including GPUs that
were already passed through to other tenant VMs on the same node. The
Allocate() handler then handed those PCI IDs out to the new pod, whose
qemu failed to open /dev/vfio/<group> because another qemu held the
group exclusively (vfio_group_fops_open returns EBUSY when
group->opened is already 1).

Cheap, accurate filter: try to open(/dev/vfio/<group>, O_RDWR|O_NONBLOCK)
non-blocking. EBUSY means a tenant holds it -- skip the device. Any
other error (ENOENT, EACCES, ...) is treated as not-busy so a
transient sysfs glitch does not silently shrink the advertised pool.

Side effect: a GPU that frees up (tenant VM stops) stays bound to
vfio-pci and the existing vfio_watcher does not fire on stop events,
so a manual plugin restart is currently needed to re-enter the pool.
That tradeoff is preferable to the current crashloop and will be
followed up with a healthCheck poll in a separate change.

Tested with ginkgo focus 'vfio-busy' -- three new specs cover:
 - missing /dev/vfio/<group> reports not busy (transient ENOENT)
 - opening a free group reports not busy
 - discovery skips busy groups and still picks up free ones
V1 of this patch skipped busy GPUs at discovery, which kept Allocate()
from handing them out but also shrank kubelet's advertised capacity
below the count of pre-existing allocations. With six GPUs already
checkpointed to running tenant pods and capacity now reported as two,
kubelet computed "free = capacity - allocated = -4" and refused to
schedule any new GPU pod on the node even though there genuinely were
two free GPUs.

Switch to advertising every vfio-pci-bound NVIDIA GPU and setting
Health=Unhealthy on those whose /dev/vfio/<group> is currently held.
Kubelet's contract for Unhealthy devices is exactly what we need:
they count toward capacity, existing pod allocations on them are
preserved across plugin restarts, but Allocate() will not give them
to new pods. The new pod gets one of the truly free, Healthy devices
and qemu opens /dev/vfio/<group> cleanly.

Test updated: createDevicePlugins now keeps the busy device in the
plugin's pool with Unhealthy, and the free one with Healthy.
@artback artback merged commit 857969c into fix/vfio-late-binding-discovery May 22, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant