fix(hid): bound the Bolt per-slot probe so one hung device can't drop the whole receiver (#218)#251
Conversation
Greptile SummaryMirrors the Unifying per-slot probe timeout guard onto the Bolt path. A new
Confidence Score: 5/5Safe to merge; the change is a targeted, well-bounded fallback that mirrors an already-proven pattern on the Unifying path and does not alter any shared data structures or success paths. The fix wraps one await in a per-slot timeout and handles the Err branch with a cache fallback. id.clone() is correctly threaded so ownership is valid in both branches, and the seen(id) outcome prevents premature cache eviction. The 3-slot worst-case timing math (4.5 s) fits within PROBE_BUDGET (5 s), and the previous reviewer concern about 3-device all-hung timing has been addressed by reducing the constant from 1.5 s to 1 s. No files require special attention Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[probe_one] -->|Bolt receiver| B[probe_bolt_receiver]
B --> C[drain_device_arrival]
C --> D{for slot in 1..=MAX_BOLT_SLOTS}
D --> E[probe_bolt_slot]
E --> F[get_device_pairing_information]
F -->|fail| G[return None]
F -->|ok| H{timeout BOLT_SLOT_PROBE 1s}
H -->|Ok| I[probe_or_reuse feature walk]
H -->|Timeout| J[use cached or default probe]
I --> K[PairedDevice]
J --> K
K --> D
D -->|done| L{paired.len == pairing_count?}
L -->|yes| M[healthy=true]
L -->|no| N[healthy=false]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[probe_one] -->|Bolt receiver| B[probe_bolt_receiver]
B --> C[drain_device_arrival]
C --> D{for slot in 1..=MAX_BOLT_SLOTS}
D --> E[probe_bolt_slot]
E --> F[get_device_pairing_information]
F -->|fail| G[return None]
F -->|ok| H{timeout BOLT_SLOT_PROBE 1s}
H -->|Ok| I[probe_or_reuse feature walk]
H -->|Timeout| J[use cached or default probe]
I --> K[PairedDevice]
J --> K
K --> D
D -->|done| L{paired.len == pairing_count?}
L -->|yes| M[healthy=true]
L -->|no| N[healthy=false]
Reviews (3): Last reviewed commit: "fix(hid): bound the Bolt per-slot probe ..." | Re-trigger Greptile |
1d42ddf to
34314ba
Compare
|
Live-tested this on the #218 rig (macOS 27 beta, MX Master 4 + MX Master 3S + MX Mechanical Mini on a Bolt receiver) and tightened it as a result. The per-slot cap works: when a slot's deep walk hung, But I hit a budget bug at my first value (1.5 s). When the MX Master 3S woke so all three slots were online and hung at once, One honest caveat: this keeps the device list populated (via the cached fallback), but it doesn't cure the underlying wedge — during it the device genuinely stops answering HID++ (writes like DPI/SmartShift would still time out), and only a physical receiver replug cleared it on my rig. So #251 is graceful degradation for the symptom; the root macOS-27 IOHID wedge stays the separate #218 track. |
|
Thanks — already addressed. After live-testing on the rig I hit exactly this: with the MX Master 3S awake, all three slots online and hung at once overran the budget at 1.5 s/slot. The current head ( |
… the whole receiver (AprilNEA#218) On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the device list still dropped to "No devices" under a *sustained* failure that AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's deep feature walk (`probe_features` / `Device::new`) hangs every cycle, burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the pairing-register reads that succeeded — so the receiver yields nothing. Confirmed device/OS-level (a standalone `openlogi list` hangs identically); only a physical receiver replug cleared it. The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE` (a per-slot cap, so a slow walk returns partial instead of starving the budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited bare, so one hung device blew the whole-receiver budget. Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt slot's `probe_or_reuse`; on timeout the slot falls back to its cached / identity data (codename + kind + online come from the pairing register, which reads fine every cycle) instead of the whole `probe_one` timing out. A hung device keeps showing with its last-known capabilities while the rest of the receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after the arrival drain; the rare all-slots-hung case degrades to the existing per-node ledger replay (AprilNEA#222). Refs AprilNEA#218, AprilNEA#222. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
34314ba to
35f2993
Compare
Context
Live re-testing #218 on the repro rig (macOS 27 beta, MX Master 4 + MX Master 3S + MX Mechanical Mini on one Bolt receiver) showed that #222 (
NodeLedger) and #237 (one-shot retry) fix the short transient misses, but a sustained failure still drops the device list to "No devices" — details in #218. This PR fixes that remaining case.Root cause
A single paired online device's deep feature walk (
probe_features/Device::new) hangs on every cycle, burning the whole 5 sPROBE_BUDGET. Each failing cycle still readspairing_count=Some(3)and all three slot codenames — only the deep walk hangs — but becauseprobe_oneis the unit wrapped intimeout(PROBE_BUDGET), when it fires the entire node is discarded, including the pairing-register reads that succeeded. The receiver yields nothing → GUIinventory refreshed count=0.The tell: the Unifying slot probe already guards exactly this with
UNIFYING_SLOT_PROBE(a per-slot cap, so a slow walk returns partial instead of starving the budget). The Bolt slot probe had no equivalent —probe_or_reusewas awaited bare inprobe_bolt_slot, so one hung Bolt device blew the whole-receiver budget.Fix
Mirror the Unifying guard onto the Bolt path: a new
BOLT_SLOT_PROBE(1.5 s) wraps the Bolt slot'sprobe_or_reuse; on timeout the slot falls back to its cached / identity-only data (codename + kind + online come from the pairing register, which reads fine every cycle) instead of letting the wholeprobe_onetime out. A hung device keeps showing with its last-known capabilities while the rest of the receiver still enumerates.1.5 s is well above a healthy BTLE walk (sub-second) yet small enough that two simultaneously-hung slots still fit
PROBE_BUDGETafter the 1.5 s arrival drain; the rare all-slots-hung case degrades to the existing per-node ledger replay (#222).Verification
I couldn't add a unit test (the per-slot timeout needs a live/mock HID++ channel, same as the existing Unifying guard, which also has none). The change mirrors that proven pattern exactly.
Manual hardware verification still recommended: I have the repro rig but the wedge cleared on a receiver replug and I can't trigger the sustained hang on demand. The expected behaviour with this patch: when a device's deep probe hangs, the agent logs
Bolt slot probe timed out; using cached data if availableand the device list stays populated (the hung device keeps its last-known card) instead of dropping to "No devices".Refs #218, #222.