Skip to content

fix: incremental late-binding discovery — no process restart#3

Merged
artback merged 3 commits into
fix/vfio-late-binding-discoveryfrom
fix/incremental-vfio-discovery
May 22, 2026
Merged

fix: incremental late-binding discovery — no process restart#3
artback merged 3 commits into
fix/vfio-late-binding-discoveryfrom
fix/incremental-vfio-discovery

Conversation

@artback
Copy link
Copy Markdown
Collaborator

@artback artback commented May 18, 2026

What

Replaces the "exit the process on new vfio-pci binding so kubelet restarts us" mechanism with an in-process incremental update — the watcher calls a callback that registers the BDF in the in-memory maps and pushes it into the live ListAndWatch stream of the relevant plugin. Process stays up. Kubelet never sees a disconnect.

Why

The exit-and-restart approach worked for the original symptom (advertise GPUs bound after startup) but on a multi-tenant GPU node it broke kubelet's device-manager accounting. During the disconnect/reconnect window kubelet must reconcile its pre-restart device set against the post-restart ListAndWatch frames, and in-flight allocations to busy GPUs could be either dropped or double-handed-out. Symptom in production: qemu fails to open /dev/vfio/<group> with EBUSY, customer VM stuck for three days, multiple production nodes left with negative-free counts once attempted "skip busy" / "mark Unhealthy" fixes piled up. Full incident in COMP-2335.

Changes

  • Extract registerVfioBdf(bdf) from createIommuDeviceMap so a single PCI address can be inspected and added to the in-memory maps outside the initial walk.
  • Guard iommuMap / deviceMap / bdfToIommuMap with mapsMu; expose snapshot getters so callers iterate copies, not the live maps.
  • Add pluginRegistry mapping device-id → live GenericDevicePlugin so the watcher's callback can reach the right plugin.
  • Add GenericDevicePlugin.AddDevice + an update channel; ListAndWatch listens for update events and re-publishes its devs to kubelet.
  • Refactor watchVfioBindings: drop close(stop) in favour of an injected onNewVfioBdf callback (default: onLateVfioBindingregisterVfioBdfplugin.AddDevice).

Result

  • Plugin process stays up across late bindings.
  • Kubelet sees a strictly-growing device list (incremental ListAndWatch frames).
  • No checkpoint reconciliation. No collision risk for new pod allocations on nodes with running GPU passthrough tenants.

Trade-offs / known limits

  • A never-before-seen device id (e.g. a new GPU model hot-plugged into a node that didn't have any of that model) still needs a daemonset restart to spin up a new GenericDevicePlugin for it — the watcher currently logs and waits. Acceptable for v3; can be addressed later by dynamically starting a new plugin from the callback.
  • GPU released by a tenant (vfio-pci stays bound, but the device becomes free) isn't re-published as Healthy without restart either. Same story.

Tests

  • vfio_watcher_test.go: the two specs that previously asserted stop is closed now assert the callback receives the BDF; "ignores control entries" / "ignores non-NVIDIA vendors" assert the callback is NOT invoked.
  • incremental_test.go (new): covers AddDevice idempotence + the full onLateVfioBinding → plugin flow with mocked sysfs accessors.
$ go test -count=1 -timeout 120s ./pkg/device_plugin/
ok      kubevirt-gpu-device-plugin/pkg/device_plugin    14.699s

Rollout plan

Staging only first. Don't touch production until we've watched cpttel008 (2 tenants × 4 GPUs each = 8/8 in use) recover correctly:

  1. Build image via Build Hive Image workflow_dispatch on this branch.
  2. Bump tag in clusters/staging-supply/infrastructure.yaml only.
  3. Watch daemonset roll on cpttel008. Expected: capacity=8, allocatable=8, both tenants stay Running.
  4. Trigger a synthetic late-binding event (or wait for one naturally) and confirm new BDFs appear in the next ListAndWatch frame without a pod restart.
  5. If happy, separately propose prod rollout in a scheduled PR — not a hot-fix.

Earlier versions of the late-binding-discovery feature triggered a
process exit on every new vfio-pci binding so kubelet would restart
the daemonset and re-run the initial discovery walk. That worked for
the original symptom (advertise GPUs bound after startup) but on a
multi-tenant GPU node it broke kubelet's device-manager accounting:
during the plugin disconnect/reconnect window kubelet must reconcile
its pre-restart device set against the post-restart ListAndWatch
frames, and in-flight allocations to busy GPUs could be either
dropped or double-handed-out. Symptom: qemu fails to open
/dev/vfio/<group> with EBUSY on cptcan02, customer VM stuck for
three days, three production nodes left with negative free counts
once attempted fixes piled up. Detail in COMP-2335.

v3 keeps the watcher but makes it incrementally publish the new
device through the live ListAndWatch stream instead of exiting the
process. Concretely:

- Extract registerVfioBdf(bdf) from createIommuDeviceMap so a single
  PCI address can be inspected and added to the in-memory maps
  outside the initial walk.
- Guard iommuMap / deviceMap / bdfToIommuMap with mapsMu; expose
  snapshot getters so callers iterate copies, not the live maps.
- Add pluginRegistry mapping device-id -> live GenericDevicePlugin
  so the watcher's callback can reach the right plugin.
- Add GenericDevicePlugin.AddDevice + an update channel; ListAndWatch
  listens for update events and re-publishes its devs to kubelet.
- Refactor watchVfioBindings: drop close(stop) in favour of an
  injected onNewVfioBdf callback (default: onLateVfioBinding ->
  registerVfioBdf -> plugin.AddDevice).

Result: the plugin process stays up across late bindings, kubelet
sees a strictly-growing device list, no checkpoint reconciliation,
no collision risk for new pod allocations on nodes with running GPU
passthrough tenants.

Tests: existing watcher behaviour suite is updated to assert on the
callback instead of the stop channel; new incremental_test.go covers
AddDevice idempotence and the full onLateVfioBinding -> plugin flow.
Full suite green.
- Drop the package-level var onNewVfioBdf callback indirection — pass
  the callback explicitly to watchVfioBindings. Tests inject their own
  callback rather than mutating a package var.
- Split registerVfioBdf into inspectVfioPciDevice (pure sysfs read)
  and recordVfioDevice (mutex-guarded map mutation). Single-
  responsibility helpers; the outer registerVfioBdf wires them.
- Use done-channel waits in vfio_watcher_test so the previous test's
  watcher goroutine fully exits before BeforeEach mutates package
  globals. Fixes a pre-existing inter-test race surfaced by -race.

go test -race -count=1 ./pkg/device_plugin/ now passes for every spec
this change touches; remaining races are in the upstream vgpu test
harness (generic_vgpu_device_plugin_test.go's fakeServer.Send vs the
test goroutine's direct read of devs) and are not introduced or made
worse by this PR.
The same image artefact is promoted across dev → staging → production
via GitOps tag bumps. Embedding an env_name prefix in the image tag
implied that production was running a build labelled 'dev-...', which
was confusing during the v3 rollout discussion and made the tag harder
to defend in PR review.

Drop the env_name workflow_dispatch input and the prefix. Tags now
look like

    v1.5.0-hive-15c40906-20260518163126

regardless of which cluster they end up on. The GitHub Environment
selection is pinned to 'dev' (the only one with the OVH credentials
set up) so secret access keeps working.

Floating 'latest' tag renamed dev-latest -> hive-latest for the same
clarity reason; no GitOps file references the floating tag.
@artback artback merged commit 9ef5131 into fix/vfio-late-binding-discovery May 22, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant