Skip to content

Fix tc class collision with retry and class ID persistence#179

Merged
sjmiller609 merged 3 commits intomainfrom
hypeship/fix-tc-class-exists
Mar 31, 2026
Merged

Fix tc class collision with retry and class ID persistence#179
sjmiller609 merged 3 commits intomainfrom
hypeship/fix-tc-class-exists

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Mar 31, 2026

Summary

Replaces tc class replace with tc class add + retry loop to properly detect and handle class ID hash collisions.

Problem: Two different TAP names can hash to the same 16-bit class ID via deriveClassID. Using tc class replace silently overwrites the existing class, breaking rate limiting for the original VM.

Solution:

  • addVMClass now uses tc class add and retries with linear probing on "File exists" errors (up to 5 attempts)
  • The actual class ID assigned is persisted to a classid file in the instance directory
  • removeVMClass and deleteTAPDevice use the stored class ID (falling back to deriveClassID for backwards compatibility with old allocations)
  • CleanupOrphanedClasses considers both derived and stored class IDs when determining which classes are valid
  • New Prometheus counter hypeman_network_tc_class_collisions_total with attempt label (initial / retry) tracks collision frequency

Changes

  • lib/network/bridge_linux.go — retry loop in addVMClass, classID threading through createTAPDevice/deleteTAPDevice/removeVMClass/CleanupOrphanedClasses
  • lib/network/bridge_darwin.go — updated stubs to match new signatures
  • lib/network/types.goClassID field on Allocation
  • lib/network/allocate.go — persist/load classID, pass stored ID on release
  • lib/network/derive.go — load stored classID in deriveAllocation
  • lib/network/metrics.gohypeman_network_tc_class_collisions_total counter

Backwards compatibility

Old allocations without a classid file continue to work — removeVMClass and CleanupOrphanedClasses fall back to deriveClassID.


Note

Medium Risk
Touches Linux traffic-control setup/teardown and cleanup logic; mistakes could leak or delete the wrong tc classes and impact VM rate limiting. Changes are localized but affect host networking behavior across restarts.

Overview
Fixes a Linux upload-shaping edge case where two TAPs could hash to the same HTB class ID and silently clobber each other by switching to tc class add with a small collision-retry probe loop in addVMClass.

Threads the actual assigned class ID through TAP creation and persists it to an instance classid file, adds Allocation.ClassID, uses it during release/teardown, and updates orphaned class cleanup to treat stored (probed) IDs as valid. Adds a new metric counter hypeman_network_tc_class_collisions_total (labeled initial/retry) to track collision frequency, and updates macOS stubs for the new function signatures.

Written by Cursor Bugbot for commit 9719ac1. This will update automatically on new commits. Configure here.

@sjmiller609 sjmiller609 force-pushed the hypeship/fix-tc-class-exists branch from f69eed2 to b942731 Compare March 31, 2026 14:07
@sjmiller609 sjmiller609 changed the title Fix RTNETLINK 'File exists' error in addVMClass Retry with linear probing on tc class ID collision in addVMClass Mar 31, 2026
@sjmiller609 sjmiller609 marked this pull request as ready for review March 31, 2026 14:16
@sjmiller609 sjmiller609 changed the title Retry with linear probing on tc class ID collision in addVMClass fix: tc class ID collision in addVMClass Mar 31, 2026
@sjmiller609 sjmiller609 force-pushed the hypeship/fix-tc-class-exists branch from 4af990a to ae99868 Compare March 31, 2026 14:51
@sjmiller609 sjmiller609 changed the title fix: tc class ID collision in addVMClass Use tc class replace to fix RTNETLINK File exists error Mar 31, 2026
@sjmiller609 sjmiller609 requested a review from hiroTamada March 31, 2026 14:53
@sjmiller609 sjmiller609 force-pushed the hypeship/fix-tc-class-exists branch from ae99868 to 06e4b98 Compare March 31, 2026 15:08
@sjmiller609 sjmiller609 changed the title Use tc class replace to fix RTNETLINK File exists error Fix tc class collision with retry and class ID persistence Mar 31, 2026
Replace tc class replace with tc class add + retry loop to properly
handle class ID collisions. On File exists error, probe the next
class ID (up to 5 attempts). Persist the actual class ID assigned
to each allocation so removal and cleanup use the correct ID.

Changes:
- addVMClass: retry loop with linear probing on collision
- Allocation.ClassID: persisted to disk, loaded in deriveAllocation
- removeVMClass/deleteTAPDevice: use stored class ID with fallback
- CleanupOrphanedClasses: considers stored class IDs from allocations
- New metric: hypeman_network_tc_class_collisions_total (attempt label)
@sjmiller609 sjmiller609 force-pushed the hypeship/fix-tc-class-exists branch from 06e4b98 to ce541ca Compare March 31, 2026 15:14
deriveClassIDVal and the probing loop now skip both 0 (invalid) and
1 (root class 1:1). The wrap-around guard checks for 0/1 after
uint16 increment instead of the unreachable > 0xFFFF comparison.
Copy link
Copy Markdown
Contributor

@hiroTamada hiroTamada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix — clean linear probing with good backwards compat and observability. Two minor nits inline.


// Build set of class IDs that belong to existing TAP devices
// Build set of class IDs that belong to existing TAP devices.
// Include both derived class IDs and stored class IDs from allocations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: swallowing the error here means a transient ListAllocations failure would return an empty map, making every class look orphaned and get deleted. might be worth logging a warning and returning early on error.

output, err := cmd.CombinedOutput()
if err != nil {
// Check for "File exists" collision (exit status 2).
var exitErr *exec.ExitError
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the goto classAdded is fine and clear, but this loop could also be a small extracted function that returns (string, error) — matter of taste.

Avoids deleting valid probed classes when a transient ListAllocations
error returns an empty set of stored class IDs.
@sjmiller609 sjmiller609 merged commit ea5e61e into main Mar 31, 2026
6 checks passed
@sjmiller609 sjmiller609 deleted the hypeship/fix-tc-class-exists branch March 31, 2026 16:01
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

// Persist assigned tc class ID so removal uses the correct ID after collisions.
if classID != "" {
m.saveClassID(req.InstanceID, classID)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale classID file not cleared when rate limiting removed

Medium Severity

Both CreateAllocation and RecreateAllocation only call saveClassID when classID != "". If an instance previously had upload rate limiting (classID file saved to disk) and is later recreated without it (e.g., RecreateAllocation with uploadBps = 0), the old classid file persists with a stale value. deriveAllocation then loads this stale classID into alloc.ClassID, causing ReleaseAllocation to call deleteTAPDevice with the wrong classID — potentially deleting another VM's HTB class.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants