fix(spv): fix shutdown deadlock and ordering#440
Conversation
…n before disconnect The tick branch in SyncManager::run() now breaks on SyncError::Network instead of logging indefinitely, preventing ~26s of channel-closed spam during shutdown. SyncCoordinator gains a signal_shutdown() method that cancels the shared token without awaiting joins, and DashSpvClient::stop() calls it before network.disconnect() so managers observe cancellation promptly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lukasz Klimek <842586+lklimek@users.noreply.github.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds awaiting of the sync coordinator during client shutdown; converts logging to tracing and makes peer connection flows cancellation-aware; introduces network-error recovery and cooldown logic in the sync manager trait and run loop; updates related docs and small docstrings. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Coordinator
participant SyncManager
participant NetworkManager
participant Peers
Client->>Coordinator: stop()
Coordinator->>SyncManager: signal shutdown / await
rect rgba(255, 0, 0, 0.5)
Note over SyncManager: run loop may detect network error
SyncManager->>SyncManager: recover_from_network_error() -> set WaitingForConnections\nemit ManagerError, enable cooldown
end
Coordinator->>NetworkManager: begin shutdown
rect rgba(0, 0, 255, 0.5)
NetworkManager->>NetworkManager: drain tasks (mem::take) and check shutdown guards
NetworkManager->>Peers: disconnect/remove peers
end
Coordinator->>Client: stop() completes
sequenceDiagram
participant NetworkManager
participant Peer
participant ShutdownToken
rect rgba(0, 255, 0, 0.5)
Note over NetworkManager: connect_to_peer uses tokio::select!
par Connection Path
NetworkManager->>Peer: Peer::connect() (async)
and Cancellation Path
ShutdownToken->>NetworkManager: cancelled()
end
end
alt Connect succeeds
Peer->>NetworkManager: connected
NetworkManager->>NetworkManager: proceed with handshake & add to pool
else Shutdown wins
ShutdownToken->>NetworkManager: cancellation observed
NetworkManager->>NetworkManager: log cancellation, return early
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
dash-spv/src/client/lifecycle.rs (1)
196-198: The intended shutdown flow callssync_coordinator.shutdown().awaitthroughmonitor_network()(line 77 in sync_coordinator.rs), which properly drains spawned manager tasks with timeout. Thestop()method is designed to be called aftermonitor_network()exits, not as a standalone shutdown path.However, there is a real robustness concern: if
stop()is called directly without completingmonitor_network(), manager tasks will only be signaled to stop viasignal_shutdown()but never awaited. While tasks will exit viashutdown.cancelled()listeners, addingself.sync_coordinator.shutdown().awaittostop()would provide defensive cleanup for this edge case and align with the coding guideline to drain tasks explicitly.
- Emit SyncEvent::ManagerError before breaking on network tick errors so the coordinator has visibility into why a manager stopped - Clear pool connecting state when shutdown cancels a peer connection - Import SyncError and use short form instead of fully-qualified path - Document SyncError::Network variant's tick() exit semantics Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Await sync_coordinator.shutdown() in stop() to drain manager tasks before tearing down network and storage. Remove the now-unused signal_shutdown() method. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PeerNetworkManager::shutdown() held self.tasks lock while draining the JoinSet. The maintenance loop task, being drained, called connect_to_peer() which also needed self.tasks lock — classic deadlock. Fix with two changes: - shutdown(): use std::mem::take to extract JoinSet from mutex before draining, so the lock is released immediately - connect_to_peer(): wrap lock acquisition with tokio::select! against shutdown token so it returns immediately during shutdown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SyncEvent::ManagerExited variant and set SyncState::Error on fatal network exits so consumers can detect when a manager stops permanently. Extract handle_fatal_exit helper to deduplicate the 4 FatalNetwork paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (4)
dash-spv/src/error.rs (1)
340-372: Missing test coverage for the newFatalNetworkcategory.The existing
test_sync_error_categorytest covers all other variants but does not verifySyncError::FatalNetwork. Consider adding an assertion to maintain completeness.🧪 Proposed test addition
assert_eq!(SyncError::Network("test".to_string()).category(), "network"); + assert_eq!(SyncError::FatalNetwork("test".to_string()).category(), "network"); assert_eq!(SyncError::Validation("test".to_string()).category(), "validation");dash-spv/src/sync/events.rs (1)
237-243: Unnecessary..in pattern destructure.
ManagerExitedonly hasmanageranderrorfields — both are already bound. The..is harmless and consistent with theManagerErrorarm (line 233), so this is purely cosmetic.✨ Optional cleanup
SyncEvent::ManagerExited { manager, error, - .. } => {dash-spv/src/network/manager.rs (1)
251-282: Inconsistent use oftracing::vslog::macros in new code.New lines (e.g., 252, 257, 270, 274, 282) use
tracing::while nearby unchanged code (e.g., lines 320, 336) continues usinglog::. This is a pre-existing inconsistency in the file but is being extended. Consider aligning to one framework in future cleanup.dash-spv/src/sync/sync_manager.rs (1)
71-76: Consider enforcing theset_state(Error)precondition.The doc at line 69 states the caller "must call
set_state(SyncState::Error)before this method", but this isn't enforced. If a caller forgets,progresswill not reflect the error state. You could either accept&mut dyn SyncManagerand callset_stateinternally, or add a debug assertion.💡 Option: assert the precondition in debug builds
pub(super) fn handle_fatal_exit( &self, manager: ManagerIdentifier, source: &str, progress: SyncManagerProgress, msg: &str, ) { + debug_assert_eq!(progress.state(), SyncState::Error, "caller must set_state(Error) before handle_fatal_exit"); tracing::warn!("{} {} fatal network error, exiting: {}", manager, source, msg);
Managers now reset to WaitingForConnections and continue their loop on FatalNetwork errors, letting PeersUpdated events re-trigger sync. This eliminates the need for consumers to detect ManagerExited and restart the entire client. Managers only exit when the shutdown token fires. - Rename handle_fatal_exit → handle_fatal_network_error - Change all 4 FatalNetwork paths from break to continue - Remove SyncEvent::ManagerExited variant (no longer needed) - Update FatalNetwork doc to reflect reset behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@dash-spv/src/sync/sync_manager.rs`:
- Around line 585-592: The test's timing is fragile: with error_after set to 3
and a 100ms tick interval, the 500ms sleep can yield a flaky tick_count
assertion; update the test in sync_manager (references: progress_rx,
SyncState::WaitingForConnections, tick_count, error_after) to make it
tolerant—either increase the sleep Duration.from_millis to ~800 (to allow ≥5
ticks reliably) or relax the assertion to
assert!(tick_count.load(Ordering::Relaxed) > 3) so the test is robust on slow CI
while still ensuring the manager keeps ticking after the FatalNetwork error.
🧹 Nitpick comments (1)
dash-spv/src/sync/sync_manager.rs (1)
257-261: Consider extracting the repeated FatalNetwork handling into a small closure or macro.All four
tokio::select!arms repeat the same three-line sequence:set_state → handle_fatal_network_error → continue. The only varying parameter is thesource: &str. A local closure (or a small macro) could reduce duplication and prevent future drift (e.g., someone forgetting thecontinuein one arm).Example sketch:
♻️ Optional refactor
A closure won't work directly due to mutable borrow conflicts inside
tokio::select!, but a macro_rules would:macro_rules! handle_fatal { ($self:ident, $ctx:ident, $identifier:ident, $source:expr, $msg:ident) => {{ $self.set_state(SyncState::WaitingForConnections); $ctx.handle_fatal_network_error($identifier, $source, $self.progress(), &$msg); continue; }}; }Then each arm becomes:
Err(SyncError::FatalNetwork(ref msg)) => handle_fatal!(self, context, identifier, "tick", msg),Also applies to: 288-292, 320-324, 346-350
The JoinSet was moved out of the mutex via std::mem::take before draining. This created a race where in-flight tasks that had not yet observed the cancellation token could spawn new tasks into the original (now empty) JoinSet, leaving them orphaned. Since connect_to_peer() already uses select! with the cancellation token when acquiring the tasks lock, no deadlock can occur. Drain the JoinSet while holding the lock instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add shutdown_token check before connect_to_peer in the DNS discovery peer loop, and add an early-exit check before the ping/save section so the maintenance loop exits promptly during shutdown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove FatalNetwork variant (never produced in production) and match on Network instead. Add 2-second cooldown after network-error recovery to prevent log/event flooding from the 100ms tick loop. Replace the error-prone handle_fatal_network_error (which required callers to call set_state first) with a recover_from_network_error default trait method that enforces correct ordering. Migrate all log:: calls to tracing:: in manager.rs to match project convention. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
dash-spv/src/sync/sync_manager.rs (1)
239-242: Consider extracting the 2-second cooldown as a named constant.The
Duration::from_secs(2)at line 351 is a magic number. Extracting it as aconst NETWORK_ERROR_COOLDOWN: Duration = Duration::from_secs(2);at the module level would improve readability and make it easier to tune later.Proposed constant extraction
+/// Cooldown period after a network-error recovery to avoid log/event flooding. +const NETWORK_ERROR_COOLDOWN: Duration = Duration::from_secs(2); + // ... inside run(): - if since.elapsed() < Duration::from_secs(2) { + if since.elapsed() < NETWORK_ERROR_COOLDOWN {Also applies to: 346-355
|
This PR has merge conflicts with the base branch. Please rebase or merge the base branch into your branch to resolve them. |
…-error-retry-loop # Conflicts: # dash-spv/src/network/manager.rs
Replace inline `Duration::from_secs(2)` with `NETWORK_ERROR_COOLDOWN` constant for clarity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The upstream refactor (#430) extracted maintenance loop logic into maintenance_tick() and dns_fallback_tick() methods but dropped the shutdown_token checks that this branch had added inside the loops. Restore them so shutdown is not delayed by in-progress peer connections or pings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
xdustinface
left a comment
There was a problem hiding this comment.
Left few comments. There is a lot of unrelated stuff mixed into here, i didnt review the rest properly yet. And i think the PR is mixing different topics, like the coordinator shutdown probably could be in a separate PR too.
…y#440) - Cargo.toml: use rust-dashcore branch fix/block-header-error-retry-loop (PR dashpay/rust-dashcore#440) - Cargo.lock: regenerated via cargo update - Keep both lint sections from each side
There was a problem hiding this comment.
@lklimek I think all the recover_from_network stuff is kind of pointless? Im not quite sure what you want to archive with it. I think if you want to get the shutdown/deadlock fixes in, separate this into targeted PR instead i think they are reasonable or drop the other stuff from here. Then look further into whatever you wanted to do with the network error stuff in a separate PR if really needed.
Revert network error recovery, doc comment changes, and sync_manager additions so this branch contains only shutdown/deadlock fixes. Non-shutdown improvements are moved to fix/spv-network-error-recovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Fix shutdown-related deadlocks and ordering issues in the SPV client:
SyncCoordinator::shutdown()is awaited instop()beforenetwork.disconnect()so manager tasks drain gracefully instead of hitting channel-closed errors.connect_to_peer()now acquires the taskJoinSetlock viatokio::select!with the shutdown token, preventing a deadlock wherePeerNetworkManager::shutdown()would block draining theJoinSetwhile the maintenance loop held the same lock.Peer::connectcalls are wrapped with a shutdown token check so they abort promptly on shutdown, with proper cleanup of the connecting state.maintenance_tick()returns early when the shutdown token is cancelled, avoiding unnecessary work after shutdown is signalled.Non-shutdown changes (network error recovery, progress refactoring, etc.) have been moved to the
fix/spv-network-error-recoverybranch.Used by dashpay/dash-evo-tool#577
Test plan
test_spv_client_start_stop— 10/10 runs pass (previously hung intermittently due to deadlock)🤖 Generated with Claude Code