Skip to content

fix(hid): bound the Bolt per-slot probe so one hung device can't drop the whole receiver (#218)#251

Open
buliwyf42 wants to merge 1 commit into
AprilNEA:masterfrom
buliwyf42:fix/bolt-per-slot-probe-timeout
Open

fix(hid): bound the Bolt per-slot probe so one hung device can't drop the whole receiver (#218)#251
buliwyf42 wants to merge 1 commit into
AprilNEA:masterfrom
buliwyf42:fix/bolt-per-slot-probe-timeout

Conversation

@buliwyf42

Copy link
Copy Markdown
Contributor

Context

Live re-testing #218 on the repro rig (macOS 27 beta, MX Master 4 + MX Master 3S + MX Mechanical Mini on one Bolt receiver) showed that #222 (NodeLedger) and #237 (one-shot retry) fix the short transient misses, but a sustained failure still drops the device list to "No devices" — details in #218. This PR fixes that remaining case.

Root cause

A single paired online device's deep feature walk (probe_features / Device::new) hangs on every cycle, burning the whole 5 s PROBE_BUDGET. Each failing cycle still reads pairing_count=Some(3) and all three slot codenames — only the deep walk hangs — but because probe_one is the unit wrapped in timeout(PROBE_BUDGET), when it fires the entire node is discarded, including the pairing-register reads that succeeded. The receiver yields nothing → GUI inventory refreshed count=0.

The tell: the Unifying slot probe already guards exactly this with UNIFYING_SLOT_PROBE (a per-slot cap, so a slow walk returns partial instead of starving the budget). The Bolt slot probe had no equivalent — probe_or_reuse was awaited bare in probe_bolt_slot, so one hung Bolt device blew the whole-receiver budget.

Fix

Mirror the Unifying guard onto the Bolt path: a new BOLT_SLOT_PROBE (1.5 s) wraps the Bolt slot's probe_or_reuse; on timeout the slot falls back to its cached / identity-only data (codename + kind + online come from the pairing register, which reads fine every cycle) instead of letting the whole probe_one time out. A hung device keeps showing with its last-known capabilities while the rest of the receiver still enumerates.

1.5 s is well above a healthy BTLE walk (sub-second) yet small enough that two simultaneously-hung slots still fit PROBE_BUDGET after the 1.5 s arrival drain; the rare all-slots-hung case degrades to the existing per-node ledger replay (#222).

Verification

cargo fmt --all -- --check                                   # clean
cargo clippy -p openlogi-hid --all-targets -- -D warnings     # clean
cargo test  -p openlogi-hid                                   # 61 pass

I couldn't add a unit test (the per-slot timeout needs a live/mock HID++ channel, same as the existing Unifying guard, which also has none). The change mirrors that proven pattern exactly.

Manual hardware verification still recommended: I have the repro rig but the wedge cleared on a receiver replug and I can't trigger the sustained hang on demand. The expected behaviour with this patch: when a device's deep probe hangs, the agent logs Bolt slot probe timed out; using cached data if available and the device list stays populated (the hung device keeps its last-known card) instead of dropping to "No devices".

Refs #218, #222.

@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown

Greptile Summary

Mirrors the Unifying per-slot probe timeout guard onto the Bolt path. A new BOLT_SLOT_PROBE constant (1 s) wraps probe_or_reuse in probe_bolt_slot; on timeout the slot falls back to its cached or default ProbedFeatures instead of letting the whole probe_one budget expire, which previously cleared the entire device list.

  • Adds BOLT_SLOT_PROBE = Duration::from_secs(1) with a thorough comment explaining the 3-slot worst-case math (1.5 + 3×1 = 4.5 s < 5 s PROBE_BUDGET).
  • id.clone() is correctly threaded through so both probe_or_reuse and the timeout fallback seen(id) have their own ownership of the key; cache eviction is unaffected.
  • When all slots time out but all return a device entry, complete = truehealthy = true, so the one-shot enumerate retries are not triggered — an intentional design trade-off documented in the PR: devices show up with cached capabilities rather than missing entirely.

Confidence Score: 5/5

Safe to merge; the change is a targeted, well-bounded fallback that mirrors an already-proven pattern on the Unifying path and does not alter any shared data structures or success paths.

The fix wraps one await in a per-slot timeout and handles the Err branch with a cache fallback. id.clone() is correctly threaded so ownership is valid in both branches, and the seen(id) outcome prevents premature cache eviction. The 3-slot worst-case timing math (4.5 s) fits within PROBE_BUDGET (5 s), and the previous reviewer concern about 3-device all-hung timing has been addressed by reducing the constant from 1.5 s to 1 s.

No files require special attention

Important Files Changed

Filename Overview
crates/openlogi-hid/src/inventory.rs Adds per-slot BOLT_SLOT_PROBE (1 s) timeout around probe_or_reuse in probe_bolt_slot, mirroring the existing Unifying guard; on timeout falls back to cached/default probe so timed-out devices still appear in the list

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[probe_one] -->|Bolt receiver| B[probe_bolt_receiver]
    B --> C[drain_device_arrival]
    C --> D{for slot in 1..=MAX_BOLT_SLOTS}
    D --> E[probe_bolt_slot]
    E --> F[get_device_pairing_information]
    F -->|fail| G[return None]
    F -->|ok| H{timeout BOLT_SLOT_PROBE 1s}
    H -->|Ok| I[probe_or_reuse feature walk]
    H -->|Timeout| J[use cached or default probe]
    I --> K[PairedDevice]
    J --> K
    K --> D
    D -->|done| L{paired.len == pairing_count?}
    L -->|yes| M[healthy=true]
    L -->|no| N[healthy=false]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[probe_one] -->|Bolt receiver| B[probe_bolt_receiver]
    B --> C[drain_device_arrival]
    C --> D{for slot in 1..=MAX_BOLT_SLOTS}
    D --> E[probe_bolt_slot]
    E --> F[get_device_pairing_information]
    F -->|fail| G[return None]
    F -->|ok| H{timeout BOLT_SLOT_PROBE 1s}
    H -->|Ok| I[probe_or_reuse feature walk]
    H -->|Timeout| J[use cached or default probe]
    I --> K[PairedDevice]
    J --> K
    K --> D
    D -->|done| L{paired.len == pairing_count?}
    L -->|yes| M[healthy=true]
    L -->|no| N[healthy=false]
Loading

Reviews (3): Last reviewed commit: "fix(hid): bound the Bolt per-slot probe ..." | Re-trigger Greptile

Comment thread crates/openlogi-hid/src/inventory.rs Outdated
@buliwyf42 buliwyf42 force-pushed the fix/bolt-per-slot-probe-timeout branch 2 times, most recently from 1d42ddf to 34314ba Compare June 13, 2026 16:59
@buliwyf42

Copy link
Copy Markdown
Contributor Author

Live-tested this on the #218 rig (macOS 27 beta, MX Master 4 + MX Master 3S + MX Mechanical Mini on a Bolt receiver) and tightened it as a result.

The per-slot cap works: when a slot's deep walk hung, Bolt slot probe timed out … slot=N fired and that device kept its cached card instead of the whole receiver vanishing — confirmed live for 1–2 hung slots.

But I hit a budget bug at my first value (1.5 s). When the MX Master 3S woke so all three slots were online and hung at once, 1.5 s arrival drain + 3 × 1.5 s = 6 s overran the 5 s PROBE_BUDGET, so the outer timeout(PROBE_BUDGET, probe_one) still fired and dropped the whole node. Tightened to BOLT_SLOT_PROBE = 1 s so all three online slots fit (1.5 + 3×1 = 4.5 s); pushed. (The all-three-at-1s case is budget-derived — the wedge is non-deterministic, so I couldn't re-trigger it on demand without more churn.)

One honest caveat: this keeps the device list populated (via the cached fallback), but it doesn't cure the underlying wedge — during it the device genuinely stops answering HID++ (writes like DPI/SmartShift would still time out), and only a physical receiver replug cleared it on my rig. So #251 is graceful degradation for the symptom; the root macOS-27 IOHID wedge stays the separate #218 track.

@buliwyf42

Copy link
Copy Markdown
Contributor Author

Thanks — already addressed. After live-testing on the rig I hit exactly this: with the MX Master 3S awake, all three slots online and hung at once overran the budget at 1.5 s/slot. The current head (34314ba) tightens BOLT_SLOT_PROBE to Duration::from_secs(1), so 1.5 s drain + 3×1 s = 4.5 s fits PROBE_BUDGET — your review picked up the earlier 1.5 s push. (Empty slots fail fast on UnknownDevice without a deep probe, so the worst realistic case is the 3 paired devices, not 6 slots.)

… the whole receiver (AprilNEA#218)

On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the
device list still dropped to "No devices" under a *sustained* failure that
AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's
deep feature walk (`probe_features` / `Device::new`) hangs every cycle,
burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped
in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the
pairing-register reads that succeeded — so the receiver yields nothing.
Confirmed device/OS-level (a standalone `openlogi list` hangs identically);
only a physical receiver replug cleared it.

The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE`
(a per-slot cap, so a slow walk returns partial instead of starving the
budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited
bare, so one hung device blew the whole-receiver budget.

Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt
slot's `probe_or_reuse`; on timeout the slot falls back to its cached /
identity data (codename + kind + online come from the pairing register, which
reads fine every cycle) instead of the whole `probe_one` timing out. A hung
device keeps showing with its last-known capabilities while the rest of the
receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and
small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after
the arrival drain; the rare all-slots-hung case degrades to the existing
per-node ledger replay (AprilNEA#222).

Refs AprilNEA#218, AprilNEA#222.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@buliwyf42 buliwyf42 force-pushed the fix/bolt-per-slot-probe-timeout branch from 34314ba to 35f2993 Compare June 19, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant