fix: incremental late-binding discovery — no process restart by artback · Pull Request #3 · HiveNetCode/kubevirt-gpu-device-plugin

artback · 2026-05-18T16:24:56Z

What

Replaces the "exit the process on new vfio-pci binding so kubelet restarts us" mechanism with an in-process incremental update — the watcher calls a callback that registers the BDF in the in-memory maps and pushes it into the live ListAndWatch stream of the relevant plugin. Process stays up. Kubelet never sees a disconnect.

Why

The exit-and-restart approach worked for the original symptom (advertise GPUs bound after startup) but on a multi-tenant GPU node it broke kubelet's device-manager accounting. During the disconnect/reconnect window kubelet must reconcile its pre-restart device set against the post-restart ListAndWatch frames, and in-flight allocations to busy GPUs could be either dropped or double-handed-out. Symptom in production: qemu fails to open /dev/vfio/<group> with EBUSY, customer VM stuck for three days, multiple production nodes left with negative-free counts once attempted "skip busy" / "mark Unhealthy" fixes piled up. Full incident in COMP-2335.

Changes

Extract registerVfioBdf(bdf) from createIommuDeviceMap so a single PCI address can be inspected and added to the in-memory maps outside the initial walk.
Guard iommuMap / deviceMap / bdfToIommuMap with mapsMu; expose snapshot getters so callers iterate copies, not the live maps.
Add pluginRegistry mapping device-id → live GenericDevicePlugin so the watcher's callback can reach the right plugin.
Add GenericDevicePlugin.AddDevice + an update channel; ListAndWatch listens for update events and re-publishes its devs to kubelet.
Refactor watchVfioBindings: drop close(stop) in favour of an injected onNewVfioBdf callback (default: onLateVfioBinding → registerVfioBdf → plugin.AddDevice).

Result

Plugin process stays up across late bindings.
Kubelet sees a strictly-growing device list (incremental ListAndWatch frames).
No checkpoint reconciliation. No collision risk for new pod allocations on nodes with running GPU passthrough tenants.

Trade-offs / known limits

A never-before-seen device id (e.g. a new GPU model hot-plugged into a node that didn't have any of that model) still needs a daemonset restart to spin up a new GenericDevicePlugin for it — the watcher currently logs and waits. Acceptable for v3; can be addressed later by dynamically starting a new plugin from the callback.
GPU released by a tenant (vfio-pci stays bound, but the device becomes free) isn't re-published as Healthy without restart either. Same story.

Tests

vfio_watcher_test.go: the two specs that previously asserted stop is closed now assert the callback receives the BDF; "ignores control entries" / "ignores non-NVIDIA vendors" assert the callback is NOT invoked.
incremental_test.go (new): covers AddDevice idempotence + the full onLateVfioBinding → plugin flow with mocked sysfs accessors.

$ go test -count=1 -timeout 120s ./pkg/device_plugin/
ok      kubevirt-gpu-device-plugin/pkg/device_plugin    14.699s

Rollout plan

Staging only first. Don't touch production until we've watched cpttel008 (2 tenants × 4 GPUs each = 8/8 in use) recover correctly:

Build image via Build Hive Image workflow_dispatch on this branch.
Bump tag in clusters/staging-supply/infrastructure.yaml only.
Watch daemonset roll on cpttel008. Expected: capacity=8, allocatable=8, both tenants stay Running.
Trigger a synthetic late-binding event (or wait for one naturally) and confirm new BDFs appear in the next ListAndWatch frame without a pod restart.
If happy, separately propose prod rollout in a scheduled PR — not a hot-fix.

Earlier versions of the late-binding-discovery feature triggered a process exit on every new vfio-pci binding so kubelet would restart the daemonset and re-run the initial discovery walk. That worked for the original symptom (advertise GPUs bound after startup) but on a multi-tenant GPU node it broke kubelet's device-manager accounting: during the plugin disconnect/reconnect window kubelet must reconcile its pre-restart device set against the post-restart ListAndWatch frames, and in-flight allocations to busy GPUs could be either dropped or double-handed-out. Symptom: qemu fails to open /dev/vfio/<group> with EBUSY on cptcan02, customer VM stuck for three days, three production nodes left with negative free counts once attempted fixes piled up. Detail in COMP-2335. v3 keeps the watcher but makes it incrementally publish the new device through the live ListAndWatch stream instead of exiting the process. Concretely: - Extract registerVfioBdf(bdf) from createIommuDeviceMap so a single PCI address can be inspected and added to the in-memory maps outside the initial walk. - Guard iommuMap / deviceMap / bdfToIommuMap with mapsMu; expose snapshot getters so callers iterate copies, not the live maps. - Add pluginRegistry mapping device-id -> live GenericDevicePlugin so the watcher's callback can reach the right plugin. - Add GenericDevicePlugin.AddDevice + an update channel; ListAndWatch listens for update events and re-publishes its devs to kubelet. - Refactor watchVfioBindings: drop close(stop) in favour of an injected onNewVfioBdf callback (default: onLateVfioBinding -> registerVfioBdf -> plugin.AddDevice). Result: the plugin process stays up across late bindings, kubelet sees a strictly-growing device list, no checkpoint reconciliation, no collision risk for new pod allocations on nodes with running GPU passthrough tenants. Tests: existing watcher behaviour suite is updated to assert on the callback instead of the stop channel; new incremental_test.go covers AddDevice idempotence and the full onLateVfioBinding -> plugin flow. Full suite green.

- Drop the package-level var onNewVfioBdf callback indirection — pass the callback explicitly to watchVfioBindings. Tests inject their own callback rather than mutating a package var. - Split registerVfioBdf into inspectVfioPciDevice (pure sysfs read) and recordVfioDevice (mutex-guarded map mutation). Single- responsibility helpers; the outer registerVfioBdf wires them. - Use done-channel waits in vfio_watcher_test so the previous test's watcher goroutine fully exits before BeforeEach mutates package globals. Fixes a pre-existing inter-test race surfaced by -race. go test -race -count=1 ./pkg/device_plugin/ now passes for every spec this change touches; remaining races are in the upstream vgpu test harness (generic_vgpu_device_plugin_test.go's fakeServer.Send vs the test goroutine's direct read of devs) and are not introduced or made worse by this PR.

The same image artefact is promoted across dev → staging → production via GitOps tag bumps. Embedding an env_name prefix in the image tag implied that production was running a build labelled 'dev-...', which was confusing during the v3 rollout discussion and made the tag harder to defend in PR review. Drop the env_name workflow_dispatch input and the prefix. Tags now look like v1.5.0-hive-15c40906-20260518163126 regardless of which cluster they end up on. The GitHub Environment selection is pinned to 'dev' (the only one with the OVH credentials set up) so secret access keeps working. Floating 'latest' tag renamed dev-latest -> hive-latest for the same clarity reason; no GitOps file references the floating tag.

artback temporarily deployed to dev May 18, 2026 16:25 — with GitHub Actions Inactive

artback temporarily deployed to dev May 18, 2026 16:31 — with GitHub Actions Inactive

artback deployed to dev May 18, 2026 19:58 — with GitHub Actions Active

artback merged commit 9ef5131 into fix/vfio-late-binding-discovery May 22, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: incremental late-binding discovery — no process restart#3

fix: incremental late-binding discovery — no process restart#3
artback merged 3 commits into
fix/vfio-late-binding-discoveryfrom
fix/incremental-vfio-discovery

artback commented May 18, 2026 •

edited by atlassian Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

artback commented May 18, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Changes

Result

Trade-offs / known limits

Tests

Rollout plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

artback commented May 18, 2026 •

edited by atlassian Bot

Loading