fix: incremental late-binding discovery — no process restart#3
Merged
artback merged 3 commits intoMay 22, 2026
Merged
Conversation
Earlier versions of the late-binding-discovery feature triggered a process exit on every new vfio-pci binding so kubelet would restart the daemonset and re-run the initial discovery walk. That worked for the original symptom (advertise GPUs bound after startup) but on a multi-tenant GPU node it broke kubelet's device-manager accounting: during the plugin disconnect/reconnect window kubelet must reconcile its pre-restart device set against the post-restart ListAndWatch frames, and in-flight allocations to busy GPUs could be either dropped or double-handed-out. Symptom: qemu fails to open /dev/vfio/<group> with EBUSY on cptcan02, customer VM stuck for three days, three production nodes left with negative free counts once attempted fixes piled up. Detail in COMP-2335. v3 keeps the watcher but makes it incrementally publish the new device through the live ListAndWatch stream instead of exiting the process. Concretely: - Extract registerVfioBdf(bdf) from createIommuDeviceMap so a single PCI address can be inspected and added to the in-memory maps outside the initial walk. - Guard iommuMap / deviceMap / bdfToIommuMap with mapsMu; expose snapshot getters so callers iterate copies, not the live maps. - Add pluginRegistry mapping device-id -> live GenericDevicePlugin so the watcher's callback can reach the right plugin. - Add GenericDevicePlugin.AddDevice + an update channel; ListAndWatch listens for update events and re-publishes its devs to kubelet. - Refactor watchVfioBindings: drop close(stop) in favour of an injected onNewVfioBdf callback (default: onLateVfioBinding -> registerVfioBdf -> plugin.AddDevice). Result: the plugin process stays up across late bindings, kubelet sees a strictly-growing device list, no checkpoint reconciliation, no collision risk for new pod allocations on nodes with running GPU passthrough tenants. Tests: existing watcher behaviour suite is updated to assert on the callback instead of the stop channel; new incremental_test.go covers AddDevice idempotence and the full onLateVfioBinding -> plugin flow. Full suite green.
- Drop the package-level var onNewVfioBdf callback indirection — pass the callback explicitly to watchVfioBindings. Tests inject their own callback rather than mutating a package var. - Split registerVfioBdf into inspectVfioPciDevice (pure sysfs read) and recordVfioDevice (mutex-guarded map mutation). Single- responsibility helpers; the outer registerVfioBdf wires them. - Use done-channel waits in vfio_watcher_test so the previous test's watcher goroutine fully exits before BeforeEach mutates package globals. Fixes a pre-existing inter-test race surfaced by -race. go test -race -count=1 ./pkg/device_plugin/ now passes for every spec this change touches; remaining races are in the upstream vgpu test harness (generic_vgpu_device_plugin_test.go's fakeServer.Send vs the test goroutine's direct read of devs) and are not introduced or made worse by this PR.
The same image artefact is promoted across dev → staging → production
via GitOps tag bumps. Embedding an env_name prefix in the image tag
implied that production was running a build labelled 'dev-...', which
was confusing during the v3 rollout discussion and made the tag harder
to defend in PR review.
Drop the env_name workflow_dispatch input and the prefix. Tags now
look like
v1.5.0-hive-15c40906-20260518163126
regardless of which cluster they end up on. The GitHub Environment
selection is pinned to 'dev' (the only one with the OVH credentials
set up) so secret access keeps working.
Floating 'latest' tag renamed dev-latest -> hive-latest for the same
clarity reason; no GitOps file references the floating tag.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Replaces the "exit the process on new vfio-pci binding so kubelet restarts us" mechanism with an in-process incremental update — the watcher calls a callback that registers the BDF in the in-memory maps and pushes it into the live
ListAndWatchstream of the relevant plugin. Process stays up. Kubelet never sees a disconnect.Why
The exit-and-restart approach worked for the original symptom (advertise GPUs bound after startup) but on a multi-tenant GPU node it broke kubelet's device-manager accounting. During the disconnect/reconnect window kubelet must reconcile its pre-restart device set against the post-restart ListAndWatch frames, and in-flight allocations to busy GPUs could be either dropped or double-handed-out. Symptom in production: qemu fails to open
/dev/vfio/<group>withEBUSY, customer VM stuck for three days, multiple production nodes left with negative-free counts once attempted "skip busy" / "mark Unhealthy" fixes piled up. Full incident in COMP-2335.Changes
registerVfioBdf(bdf)fromcreateIommuDeviceMapso a single PCI address can be inspected and added to the in-memory maps outside the initial walk.iommuMap/deviceMap/bdfToIommuMapwithmapsMu; expose snapshot getters so callers iterate copies, not the live maps.pluginRegistrymapping device-id → liveGenericDevicePluginso the watcher's callback can reach the right plugin.GenericDevicePlugin.AddDevice+ anupdatechannel;ListAndWatchlistens for update events and re-publishes itsdevsto kubelet.watchVfioBindings: dropclose(stop)in favour of an injectedonNewVfioBdfcallback (default:onLateVfioBinding→registerVfioBdf→plugin.AddDevice).Result
Trade-offs / known limits
GenericDevicePluginfor it — the watcher currently logs and waits. Acceptable for v3; can be addressed later by dynamically starting a new plugin from the callback.Tests
vfio_watcher_test.go: the two specs that previously assertedstopis closed now assert the callback receives the BDF; "ignores control entries" / "ignores non-NVIDIA vendors" assert the callback is NOT invoked.incremental_test.go(new): coversAddDeviceidempotence + the fullonLateVfioBinding→ plugin flow with mocked sysfs accessors.Rollout plan
Staging only first. Don't touch production until we've watched cpttel008 (2 tenants × 4 GPUs each = 8/8 in use) recover correctly:
Build Hive Imageworkflow_dispatch on this branch.clusters/staging-supply/infrastructure.yamlonly.