Skip to content

fix(spv): fix shutdown deadlock and ordering#440

Merged
xdustinface merged 25 commits intov0.42-devfrom
fix/block-header-error-retry-loop
Feb 18, 2026
Merged

fix(spv): fix shutdown deadlock and ordering#440
xdustinface merged 25 commits intov0.42-devfrom
fix/block-header-error-retry-loop

Conversation

@lklimek
Copy link
Contributor

@lklimek lklimek commented Feb 16, 2026

Summary

Fix shutdown-related deadlocks and ordering issues in the SPV client:

  • Shutdown ordering: SyncCoordinator::shutdown() is awaited in stop() before network.disconnect() so manager tasks drain gracefully instead of hitting channel-closed errors.
  • Shutdown deadlock fix: connect_to_peer() now acquires the task JoinSet lock via tokio::select! with the shutdown token, preventing a deadlock where PeerNetworkManager::shutdown() would block draining the JoinSet while the maintenance loop held the same lock.
  • Prompt shutdown cancellation: Peer::connect calls are wrapped with a shutdown token check so they abort promptly on shutdown, with proper cleanup of the connecting state.
  • Early exit in maintenance tick: maintenance_tick() returns early when the shutdown token is cancelled, avoiding unnecessary work after shutdown is signalled.

Non-shutdown changes (network error recovery, progress refactoring, etc.) have been moved to the fix/spv-network-error-recovery branch.

Used by dashpay/dash-evo-tool#577

Test plan

  • test_spv_client_start_stop — 10/10 runs pass (previously hung intermittently due to deadlock)
  • Manual: start dash-evo-tool, verify clean shutdown without hanging
  • CI green (all workflows pass)

🤖 Generated with Claude Code

lklimek and others added 2 commits February 16, 2026 12:18
…n before disconnect

The tick branch in SyncManager::run() now breaks on SyncError::Network
instead of logging indefinitely, preventing ~26s of channel-closed spam
during shutdown. SyncCoordinator gains a signal_shutdown() method that
cancels the shared token without awaiting joins, and DashSpvClient::stop()
calls it before network.disconnect() so managers observe cancellation
promptly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lukasz Klimek <842586+lklimek@users.noreply.github.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds awaiting of the sync coordinator during client shutdown; converts logging to tracing and makes peer connection flows cancellation-aware; introduces network-error recovery and cooldown logic in the sync manager trait and run loop; updates related docs and small docstrings.

Changes

Cohort / File(s) Summary
Lifecycle
dash-spv/src/client/lifecycle.rs
Stop now awaits sync coordinator shutdown (awaits coordinator shutdown future) and logs shutdown errors before network disconnect.
Network Manager
dash-spv/src/network/manager.rs
Replaced log calls with tracing macros; made connect_to_peer and Peer::connect cancellation-aware via tokio::select! against shutdown; added tracing instrumentation around peer lifecycle, maintenance loop early-exit guards, and safer task draining.
Sync Manager
dash-spv/src/sync/sync_manager.rs
Added recover_from_network_error(...) to SyncManager trait and a default update_target_height(u32) no-op; integrated network-error handling and a cooldown in the main run loop, resetting state to WaitingForConnections and emitting ManagerError; added tests/mocks exercising network-error recovery.
Error & Events
dash-spv/src/error.rs, dash-spv/src/sync/events.rs
Expanded doc comment for SyncError::Network; adjusted SyncEvent::ManagerError doc text. No signature/variant changes.
Docs
dash-spv/CLAUDE.md
Added logging convention recommending the tracing framework and its macros for new code.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Coordinator
    participant SyncManager
    participant NetworkManager
    participant Peers

    Client->>Coordinator: stop()
    Coordinator->>SyncManager: signal shutdown / await
    rect rgba(255, 0, 0, 0.5)
    Note over SyncManager: run loop may detect network error
    SyncManager->>SyncManager: recover_from_network_error() -> set WaitingForConnections\nemit ManagerError, enable cooldown
    end
    Coordinator->>NetworkManager: begin shutdown
    rect rgba(0, 0, 255, 0.5)
    NetworkManager->>NetworkManager: drain tasks (mem::take) and check shutdown guards
    NetworkManager->>Peers: disconnect/remove peers
    end
    Coordinator->>Client: stop() completes
Loading
sequenceDiagram
    participant NetworkManager
    participant Peer
    participant ShutdownToken

    rect rgba(0, 255, 0, 0.5)
    Note over NetworkManager: connect_to_peer uses tokio::select!
    par Connection Path
        NetworkManager->>Peer: Peer::connect() (async)
    and Cancellation Path
        ShutdownToken->>NetworkManager: cancelled()
    end
    end
    alt Connect succeeds
        Peer->>NetworkManager: connected
        NetworkManager->>NetworkManager: proceed with handshake & add to pool
    else Shutdown wins
        ShutdownToken->>NetworkManager: cancellation observed
        NetworkManager->>NetworkManager: log cancellation, return early
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibbled at logs and stitched the trace,
Awaited the coordinator—no frantic race,
When network hiccups tumble the plan,
I reset, cool down, and hop where I can,
Clean exits, gentle turns—hip-hop, rabbit grace.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (30 files):

⚔️ dash-spv-ffi/include/dash_spv_ffi.h (content)
⚔️ dash-spv-ffi/src/bin/ffi_cli.rs (content)
⚔️ dash-spv-ffi/src/types.rs (content)
⚔️ dash-spv-ffi/tests/test_types.rs (content)
⚔️ dash-spv-ffi/tests/unit/test_client_lifecycle.rs (content)
⚔️ dash-spv/CLAUDE.md (content)
⚔️ dash-spv/src/client/lifecycle.rs (content)
⚔️ dash-spv/src/error.rs (content)
⚔️ dash-spv/src/network/constants.rs (content)
⚔️ dash-spv/src/network/manager.rs (content)
⚔️ dash-spv/src/network/peer.rs (content)
⚔️ dash-spv/src/network/pool.rs (content)
⚔️ dash-spv/src/network/tests.rs (content)
⚔️ dash-spv/src/sync/block_headers/manager.rs (content)
⚔️ dash-spv/src/sync/block_headers/pipeline.rs (content)
⚔️ dash-spv/src/sync/block_headers/progress.rs (content)
⚔️ dash-spv/src/sync/block_headers/segment_state.rs (content)
⚔️ dash-spv/src/sync/block_headers/sync_manager.rs (content)
⚔️ dash-spv/src/sync/events.rs (content)
⚔️ dash-spv/src/sync/filter_headers/manager.rs (content)
⚔️ dash-spv/src/sync/filter_headers/progress.rs (content)
⚔️ dash-spv/src/sync/filter_headers/sync_manager.rs (content)
⚔️ dash-spv/src/sync/filters/manager.rs (content)
⚔️ dash-spv/src/sync/filters/progress.rs (content)
⚔️ dash-spv/src/sync/filters/sync_manager.rs (content)
⚔️ dash-spv/src/sync/mod.rs (content)
⚔️ dash-spv/src/sync/progress.rs (content)
⚔️ dash-spv/src/sync/sync_coordinator.rs (content)
⚔️ dash-spv/src/sync/sync_manager.rs (content)
⚔️ dash-spv/tests/peer_test.rs (content)

These conflicts must be resolved before merging into v0.42-dev.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(spv): fix shutdown deadlock and ordering' accurately summarizes the main changes: addressing shutdown deadlock issues and ensuring proper shutdown ordering before disconnect.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/block-header-error-retry-loop
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch fix/block-header-error-retry-loop
  • Create stacked PR with resolved conflicts
  • Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
dash-spv/src/client/lifecycle.rs (1)

196-198: The intended shutdown flow calls sync_coordinator.shutdown().await through monitor_network() (line 77 in sync_coordinator.rs), which properly drains spawned manager tasks with timeout. The stop() method is designed to be called after monitor_network() exits, not as a standalone shutdown path.

However, there is a real robustness concern: if stop() is called directly without completing monitor_network(), manager tasks will only be signaled to stop via signal_shutdown() but never awaited. While tasks will exit via shutdown.cancelled() listeners, adding self.sync_coordinator.shutdown().await to stop() would provide defensive cleanup for this edge case and align with the coding guideline to drain tasks explicitly.

lklimek and others added 2 commits February 16, 2026 13:03
- Emit SyncEvent::ManagerError before breaking on network tick errors
  so the coordinator has visibility into why a manager stopped
- Clear pool connecting state when shutdown cancels a peer connection
- Import SyncError and use short form instead of fully-qualified path
- Document SyncError::Network variant's tick() exit semantics

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Await sync_coordinator.shutdown() in stop() to drain manager tasks
before tearing down network and storage. Remove the now-unused
signal_shutdown() method.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lklimek lklimek changed the title fix(spv): exit manager task loop on network errors and signal shutdown before disconnect fix(spv): exit manager task loop on network errors and gracefully shutdown before disconnect Feb 16, 2026
lklimek and others added 3 commits February 16, 2026 15:14
PeerNetworkManager::shutdown() held self.tasks lock while draining the
JoinSet. The maintenance loop task, being drained, called connect_to_peer()
which also needed self.tasks lock — classic deadlock.

Fix with two changes:
- shutdown(): use std::mem::take to extract JoinSet from mutex before
  draining, so the lock is released immediately
- connect_to_peer(): wrap lock acquisition with tokio::select! against
  shutdown token so it returns immediately during shutdown

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SyncEvent::ManagerExited variant and set SyncState::Error on fatal
network exits so consumers can detect when a manager stops permanently.
Extract handle_fatal_exit helper to deduplicate the 4 FatalNetwork paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
dash-spv/src/error.rs (1)

340-372: Missing test coverage for the new FatalNetwork category.

The existing test_sync_error_category test covers all other variants but does not verify SyncError::FatalNetwork. Consider adding an assertion to maintain completeness.

🧪 Proposed test addition
         assert_eq!(SyncError::Network("test".to_string()).category(), "network");
+        assert_eq!(SyncError::FatalNetwork("test".to_string()).category(), "network");
         assert_eq!(SyncError::Validation("test".to_string()).category(), "validation");
dash-spv/src/sync/events.rs (1)

237-243: Unnecessary .. in pattern destructure.

ManagerExited only has manager and error fields — both are already bound. The .. is harmless and consistent with the ManagerError arm (line 233), so this is purely cosmetic.

✨ Optional cleanup
             SyncEvent::ManagerExited {
                 manager,
                 error,
-                ..
             } => {
dash-spv/src/network/manager.rs (1)

251-282: Inconsistent use of tracing:: vs log:: macros in new code.

New lines (e.g., 252, 257, 270, 274, 282) use tracing:: while nearby unchanged code (e.g., lines 320, 336) continues using log::. This is a pre-existing inconsistency in the file but is being extended. Consider aligning to one framework in future cleanup.

dash-spv/src/sync/sync_manager.rs (1)

71-76: Consider enforcing the set_state(Error) precondition.

The doc at line 69 states the caller "must call set_state(SyncState::Error) before this method", but this isn't enforced. If a caller forgets, progress will not reflect the error state. You could either accept &mut dyn SyncManager and call set_state internally, or add a debug assertion.

💡 Option: assert the precondition in debug builds
     pub(super) fn handle_fatal_exit(
         &self,
         manager: ManagerIdentifier,
         source: &str,
         progress: SyncManagerProgress,
         msg: &str,
     ) {
+        debug_assert_eq!(progress.state(), SyncState::Error, "caller must set_state(Error) before handle_fatal_exit");
         tracing::warn!("{} {} fatal network error, exiting: {}", manager, source, msg);

Managers now reset to WaitingForConnections and continue their loop on
FatalNetwork errors, letting PeersUpdated events re-trigger sync. This
eliminates the need for consumers to detect ManagerExited and restart
the entire client. Managers only exit when the shutdown token fires.

- Rename handle_fatal_exit → handle_fatal_network_error
- Change all 4 FatalNetwork paths from break to continue
- Remove SyncEvent::ManagerExited variant (no longer needed)
- Update FatalNetwork doc to reflect reset behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@dash-spv/src/sync/sync_manager.rs`:
- Around line 585-592: The test's timing is fragile: with error_after set to 3
and a 100ms tick interval, the 500ms sleep can yield a flaky tick_count
assertion; update the test in sync_manager (references: progress_rx,
SyncState::WaitingForConnections, tick_count, error_after) to make it
tolerant—either increase the sleep Duration.from_millis to ~800 (to allow ≥5
ticks reliably) or relax the assertion to
assert!(tick_count.load(Ordering::Relaxed) > 3) so the test is robust on slow CI
while still ensuring the manager keeps ticking after the FatalNetwork error.
🧹 Nitpick comments (1)
dash-spv/src/sync/sync_manager.rs (1)

257-261: Consider extracting the repeated FatalNetwork handling into a small closure or macro.

All four tokio::select! arms repeat the same three-line sequence: set_state → handle_fatal_network_error → continue. The only varying parameter is the source: &str. A local closure (or a small macro) could reduce duplication and prevent future drift (e.g., someone forgetting the continue in one arm).

Example sketch:

♻️ Optional refactor

A closure won't work directly due to mutable borrow conflicts inside tokio::select!, but a macro_rules would:

macro_rules! handle_fatal {
    ($self:ident, $ctx:ident, $identifier:ident, $source:expr, $msg:ident) => {{
        $self.set_state(SyncState::WaitingForConnections);
        $ctx.handle_fatal_network_error($identifier, $source, $self.progress(), &$msg);
        continue;
    }};
}

Then each arm becomes:

Err(SyncError::FatalNetwork(ref msg)) => handle_fatal!(self, context, identifier, "tick", msg),

Also applies to: 288-292, 320-324, 346-350

lklimek and others added 3 commits February 16, 2026 16:40
The JoinSet was moved out of the mutex via std::mem::take before
draining. This created a race where in-flight tasks that had not yet
observed the cancellation token could spawn new tasks into the original
(now empty) JoinSet, leaving them orphaned.

Since connect_to_peer() already uses select! with the cancellation
token when acquiring the tasks lock, no deadlock can occur. Drain the
JoinSet while holding the lock instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add shutdown_token check before connect_to_peer in the DNS discovery
peer loop, and add an early-exit check before the ping/save section
so the maintenance loop exits promptly during shutdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove FatalNetwork variant (never produced in production) and match on
Network instead. Add 2-second cooldown after network-error recovery to
prevent log/event flooding from the 100ms tick loop. Replace the
error-prone handle_fatal_network_error (which required callers to call
set_state first) with a recover_from_network_error default trait method
that enforces correct ordering. Migrate all log:: calls to tracing:: in
manager.rs to match project convention.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lklimek
Copy link
Contributor Author

lklimek commented Feb 16, 2026

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 16, 2026

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
dash-spv/src/sync/sync_manager.rs (1)

239-242: Consider extracting the 2-second cooldown as a named constant.

The Duration::from_secs(2) at line 351 is a magic number. Extracting it as a const NETWORK_ERROR_COOLDOWN: Duration = Duration::from_secs(2); at the module level would improve readability and make it easier to tune later.

Proposed constant extraction
+/// Cooldown period after a network-error recovery to avoid log/event flooding.
+const NETWORK_ERROR_COOLDOWN: Duration = Duration::from_secs(2);
+
 // ... inside run():
 
-                    if since.elapsed() < Duration::from_secs(2) {
+                    if since.elapsed() < NETWORK_ERROR_COOLDOWN {

Also applies to: 346-355

@lklimek lklimek requested a review from xdustinface February 16, 2026 16:31
@lklimek lklimek changed the title fix(spv): exit manager task loop on network errors and gracefully shutdown before disconnect fix(spv): self-recover on network errors and fix shutdown deadlock Feb 16, 2026
@github-actions github-actions bot added the merge-conflict The PR conflicts with the target branch. label Feb 17, 2026
@github-actions
Copy link

This PR has merge conflicts with the base branch. Please rebase or merge the base branch into your branch to resolve them.

lklimek and others added 2 commits February 17, 2026 08:34
…-error-retry-loop

# Conflicts:
#	dash-spv/src/network/manager.rs
Replace inline `Duration::from_secs(2)` with `NETWORK_ERROR_COOLDOWN`
constant for clarity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot removed the merge-conflict The PR conflicts with the target branch. label Feb 17, 2026
lklimek and others added 3 commits February 17, 2026 08:50
The upstream refactor (#430) extracted maintenance loop logic into
maintenance_tick() and dns_fallback_tick() methods but dropped the
shutdown_token checks that this branch had added inside the loops.
Restore them so shutdown is not delayed by in-progress peer connections
or pings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Collaborator

@xdustinface xdustinface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left few comments. There is a lot of unrelated stuff mixed into here, i didnt review the rest properly yet. And i think the PR is mixing different topics, like the coordinator shutdown probably could be in a separate PR too.

@lklimek lklimek requested a review from xdustinface February 17, 2026 11:18
thepastaclaw added a commit to thepastaclaw/dash-evo-tool that referenced this pull request Feb 17, 2026
…y#440)

- Cargo.toml: use rust-dashcore branch fix/block-header-error-retry-loop (PR dashpay/rust-dashcore#440)
- Cargo.lock: regenerated via cargo update
- Keep both lint sections from each side
Copy link
Collaborator

@xdustinface xdustinface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lklimek I think all the recover_from_network stuff is kind of pointless? Im not quite sure what you want to archive with it. I think if you want to get the shutdown/deadlock fixes in, separate this into targeted PR instead i think they are reasonable or drop the other stuff from here. Then look further into whatever you wanted to do with the network error stuff in a separate PR if really needed.

lklimek and others added 2 commits February 18, 2026 12:27
Revert network error recovery, doc comment changes, and sync_manager
additions so this branch contains only shutdown/deadlock fixes.
Non-shutdown improvements are moved to fix/spv-network-error-recovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lklimek lklimek changed the title fix(spv): self-recover on network errors and fix shutdown deadlock fix(spv): fix shutdown deadlock and ordering Feb 18, 2026
@lklimek lklimek requested a review from xdustinface February 18, 2026 12:39
@xdustinface xdustinface merged commit 542a617 into v0.42-dev Feb 18, 2026
53 checks passed
@xdustinface xdustinface deleted the fix/block-header-error-retry-loop branch February 18, 2026 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants