fix(hid): bound the Bolt per-slot probe so one hung device can't drop the whole receiver (#218) by buliwyf42 · Pull Request #251 · AprilNEA/OpenLogi

buliwyf42 · 2026-06-13T16:47:36Z

Context

Live re-testing #218 on the repro rig (macOS 27 beta, MX Master 4 + MX Master 3S + MX Mechanical Mini on one Bolt receiver) showed that #222 (NodeLedger) and #237 (one-shot retry) fix the short transient misses, but a sustained failure still drops the device list to "No devices" — details in #218. This PR fixes that remaining case.

Root cause

A single paired online device's deep feature walk (probe_features / Device::new) hangs on every cycle, burning the whole 5 s PROBE_BUDGET. Each failing cycle still reads pairing_count=Some(3) and all three slot codenames — only the deep walk hangs — but because probe_one is the unit wrapped in timeout(PROBE_BUDGET), when it fires the entire node is discarded, including the pairing-register reads that succeeded. The receiver yields nothing → GUI inventory refreshed count=0.

The tell: the Unifying slot probe already guards exactly this with UNIFYING_SLOT_PROBE (a per-slot cap, so a slow walk returns partial instead of starving the budget). The Bolt slot probe had no equivalent — probe_or_reuse was awaited bare in probe_bolt_slot, so one hung Bolt device blew the whole-receiver budget.

Fix

Mirror the Unifying guard onto the Bolt path: a new BOLT_SLOT_PROBE (1.5 s) wraps the Bolt slot's probe_or_reuse; on timeout the slot falls back to its cached / identity-only data (codename + kind + online come from the pairing register, which reads fine every cycle) instead of letting the whole probe_one time out. A hung device keeps showing with its last-known capabilities while the rest of the receiver still enumerates.

1.5 s is well above a healthy BTLE walk (sub-second) yet small enough that two simultaneously-hung slots still fit PROBE_BUDGET after the 1.5 s arrival drain; the rare all-slots-hung case degrades to the existing per-node ledger replay (#222).

Verification

cargo fmt --all -- --check                                   # clean
cargo clippy -p openlogi-hid --all-targets -- -D warnings     # clean
cargo test  -p openlogi-hid                                   # 61 pass

I couldn't add a unit test (the per-slot timeout needs a live/mock HID++ channel, same as the existing Unifying guard, which also has none). The change mirrors that proven pattern exactly.

Manual hardware verification still recommended: I have the repro rig but the wedge cleared on a receiver replug and I can't trigger the sustained hang on demand. The expected behaviour with this patch: when a device's deep probe hangs, the agent logs Bolt slot probe timed out; using cached data if available and the device list stays populated (the hung device keeps its last-known card) instead of dropping to "No devices".

Refs #218, #222.

greptile-apps · 2026-06-13T16:51:20Z

Greptile Summary

Mirrors the Unifying per-slot probe timeout guard onto the Bolt path. A new BOLT_SLOT_PROBE constant (1 s) wraps probe_or_reuse in probe_bolt_slot; on timeout the slot falls back to its cached or default ProbedFeatures instead of letting the whole probe_one budget expire, which previously cleared the entire device list.

Adds BOLT_SLOT_PROBE = Duration::from_secs(1) with a thorough comment explaining the 3-slot worst-case math (1.5 + 3×1 = 4.5 s < 5 s PROBE_BUDGET).
id.clone() is correctly threaded through so both probe_or_reuse and the timeout fallback seen(id) have their own ownership of the key; cache eviction is unaffected.
When all slots time out but all return a device entry, complete = true → healthy = true, so the one-shot enumerate retries are not triggered — an intentional design trade-off documented in the PR: devices show up with cached capabilities rather than missing entirely.

Confidence Score: 5/5

Safe to merge; the change is a targeted, well-bounded fallback that mirrors an already-proven pattern on the Unifying path and does not alter any shared data structures or success paths.

The fix wraps one await in a per-slot timeout and handles the Err branch with a cache fallback. id.clone() is correctly threaded so ownership is valid in both branches, and the seen(id) outcome prevents premature cache eviction. The 3-slot worst-case timing math (4.5 s) fits within PROBE_BUDGET (5 s), and the previous reviewer concern about 3-device all-hung timing has been addressed by reducing the constant from 1.5 s to 1 s.

No files require special attention

Important Files Changed

Filename	Overview
crates/openlogi-hid/src/inventory.rs	Adds per-slot BOLT_SLOT_PROBE (1 s) timeout around probe_or_reuse in probe_bolt_slot, mirroring the existing Unifying guard; on timeout falls back to cached/default probe so timed-out devices still appear in the list

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[probe_one] -->|Bolt receiver| B[probe_bolt_receiver]
    B --> C[drain_device_arrival]
    C --> D{for slot in 1..=MAX_BOLT_SLOTS}
    D --> E[probe_bolt_slot]
    E --> F[get_device_pairing_information]
    F -->|fail| G[return None]
    F -->|ok| H{timeout BOLT_SLOT_PROBE 1s}
    H -->|Ok| I[probe_or_reuse feature walk]
    H -->|Timeout| J[use cached or default probe]
    I --> K[PairedDevice]
    J --> K
    K --> D
    D -->|done| L{paired.len == pairing_count?}
    L -->|yes| M[healthy=true]
    L -->|no| N[healthy=false]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[probe_one] -->|Bolt receiver| B[probe_bolt_receiver]
    B --> C[drain_device_arrival]
    C --> D{for slot in 1..=MAX_BOLT_SLOTS}
    D --> E[probe_bolt_slot]
    E --> F[get_device_pairing_information]
    F -->|fail| G[return None]
    F -->|ok| H{timeout BOLT_SLOT_PROBE 1s}
    H -->|Ok| I[probe_or_reuse feature walk]
    H -->|Timeout| J[use cached or default probe]
    I --> K[PairedDevice]
    J --> K
    K --> D
    D -->|done| L{paired.len == pairing_count?}
    L -->|yes| M[healthy=true]
    L -->|no| N[healthy=false]

_{Reviews (3): Last reviewed commit: "fix(hid): bound the Bolt per-slot probe ..." | Re-trigger Greptile}

buliwyf42 · 2026-06-13T17:01:09Z

Live-tested this on the #218 rig (macOS 27 beta, MX Master 4 + MX Master 3S + MX Mechanical Mini on a Bolt receiver) and tightened it as a result.

The per-slot cap works: when a slot's deep walk hung, Bolt slot probe timed out … slot=N fired and that device kept its cached card instead of the whole receiver vanishing — confirmed live for 1–2 hung slots.

But I hit a budget bug at my first value (1.5 s). When the MX Master 3S woke so all three slots were online and hung at once, 1.5 s arrival drain + 3 × 1.5 s = 6 s overran the 5 s PROBE_BUDGET, so the outer timeout(PROBE_BUDGET, probe_one) still fired and dropped the whole node. Tightened to BOLT_SLOT_PROBE = 1 s so all three online slots fit (1.5 + 3×1 = 4.5 s); pushed. (The all-three-at-1s case is budget-derived — the wedge is non-deterministic, so I couldn't re-trigger it on demand without more churn.)

One honest caveat: this keeps the device list populated (via the cached fallback), but it doesn't cure the underlying wedge — during it the device genuinely stops answering HID++ (writes like DPI/SmartShift would still time out), and only a physical receiver replug cleared it on my rig. So #251 is graceful degradation for the symptom; the root macOS-27 IOHID wedge stays the separate #218 track.

buliwyf42 · 2026-06-14T06:38:54Z

Thanks — already addressed. After live-testing on the rig I hit exactly this: with the MX Master 3S awake, all three slots online and hung at once overran the budget at 1.5 s/slot. The current head (34314ba) tightens BOLT_SLOT_PROBE to Duration::from_secs(1), so 1.5 s drain + 3×1 s = 4.5 s fits PROBE_BUDGET — your review picked up the earlier 1.5 s push. (Empty slots fail fast on UnknownDevice without a deep probe, so the worst realistic case is the 3 paired devices, not 6 slots.)

… the whole receiver (AprilNEA#218) On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the device list still dropped to "No devices" under a *sustained* failure that AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's deep feature walk (`probe_features` / `Device::new`) hangs every cycle, burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the pairing-register reads that succeeded — so the receiver yields nothing. Confirmed device/OS-level (a standalone `openlogi list` hangs identically); only a physical receiver replug cleared it. The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE` (a per-slot cap, so a slow walk returns partial instead of starving the budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited bare, so one hung device blew the whole-receiver budget. Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt slot's `probe_or_reuse`; on timeout the slot falls back to its cached / identity data (codename + kind + online come from the pairing register, which reads fine every cycle) instead of the whole `probe_one` timing out. A hung device keeps showing with its last-known capabilities while the rest of the receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after the arrival drain; the rare all-slots-hung case degrades to the existing per-node ledger replay (AprilNEA#222). Refs AprilNEA#218, AprilNEA#222. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

buliwyf42 mentioned this pull request Jun 13, 2026

fix(hid): replay a node's last inventory through transient probe failures #222

Merged

greptile-apps Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread crates/openlogi-hid/src/inventory.rs Outdated

buliwyf42 force-pushed the fix/bolt-per-slot-probe-timeout branch 2 times, most recently from 1d42ddf to 34314ba Compare June 13, 2026 16:59

AprilNEA mentioned this pull request Jun 16, 2026

Bolt partial-walk recovery does a cold re-probe: ledger replay grace and probe-cache grace expire on the same tick #278

Open

buliwyf42 force-pushed the fix/bolt-per-slot-probe-timeout branch from 34314ba to 35f2993 Compare June 19, 2026 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(hid): bound the Bolt per-slot probe so one hung device can't drop the whole receiver (#218)#251

fix(hid): bound the Bolt per-slot probe so one hung device can't drop the whole receiver (#218)#251
buliwyf42 wants to merge 1 commit into
AprilNEA:masterfrom
buliwyf42:fix/bolt-per-slot-probe-timeout

buliwyf42 commented Jun 13, 2026

Uh oh!

greptile-apps Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

buliwyf42 commented Jun 13, 2026

Uh oh!

buliwyf42 commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

buliwyf42 commented Jun 13, 2026

Context

Root cause

Fix

Verification

Uh oh!

greptile-apps Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

buliwyf42 commented Jun 13, 2026

Uh oh!

buliwyf42 commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 13, 2026 •

edited

Loading