Fail holding-cell AddHTLCs on Channel deser to match disconnection#754
Fail holding-cell AddHTLCs on Channel deser to match disconnection#754TheBlueMatt wants to merge 1 commit intolightningdevkit:mainfrom
Conversation
|
Note that I only think we need to take this for 0.0.12 because of the risk of forwarding something when we don't have a current chain - the assertion is only a debug assertion, and as far as I can tell the behavior even when hitting it is correct. |
As Channel::write says in the comment at the top: "we write out as if remove_uncommitted_htlcs_and_mark_paused had just been called", except that we previously deliberately included holding-cell AddHTLC events in the serialization. On the flip side, in remove_uncommitted_htlcs_and_mark_paused, we removed pending AddHTLC events under the assumption that, if we can't forward something ASAP, its better to fail it back to the origin than to sit on it for a while. Given there's likely to be just as large a time-lag between ser/deserialization as between when a peer dis/reconnects, there isn't much of a reason for this difference. Worse, we debug_assert that there are no pending AddHTLC holding cell events when doing a reconnect, so any tests or fuzzers which deserialized a ChannelManager with AddHTLC events would panic. We resolve this by adding logic to fail any holding-cell AddHTLC events upon deserialization, in part because trying to forward it before we're sure we have an up-to-date chain is somewhat risky - the sender may have already gone to chain while our upstream has not.
4956ad6 to
1ecee17
Compare
Codecov Report
@@ Coverage Diff @@
## main #754 +/- ##
==========================================
+ Coverage 91.45% 91.49% +0.04%
==========================================
Files 37 37
Lines 22249 22340 +91
==========================================
+ Hits 20347 20440 +93
+ Misses 1902 1900 -2
Continue to review full report at Codecov.
|
|
In terms of "concept," the one question I have is whether this could lead to unexpected behavior for users. It's not obvious that reading state from disk would lead to failing HTLCs back. For example, if a user has a method that reads all of RL's info from disk, and they batch Maybe this PR speaks to a need for (edit: possibly relevant tweet: https://twitter.com/jaredpalmer/status/1171415929865064449?s=20) |
I'm a little confused by this - you have to provide the
Hopefully not. I tend to subscribe to the "having clean shutdown logic lulls you into a false sense of security" - user apps crash, especially on mobile, and getting a |
|
Discussed it offline for a while with @valentinewallace, in short I think this at least doesn't make sense because we should be going a different direction - solve #661 and the debug-assert failure by not failing-back holding cell HTLCs just because monitor updating has paused or we went to disk, and think harder about how to handle pre-chain-sync load-time HTLC relaying more generally (probably for now just say that you shouldn't connect to peers until chain is synced). |
Note that this bug was found thanks to the changes in #753, but its reasonable to take this by itself for 0.0.12 without 753.
As Channel::write says in the comment at the top: "we write out as
if remove_uncommitted_htlcs_and_mark_paused had just been called",
except that we previously deliberately included holding-cell
AddHTLC events in the serialization. On the flip side, in
remove_uncommitted_htlcs_and_mark_paused, we removed pending
AddHTLC events under the assumption that, if we can't forward
something ASAP, its better to fail it back to the origin than to
sit on it for a while.
Given there's likely to be just as large a time-lag between
ser/deserialization as between when a peer dis/reconnects, there
isn't much of a reason for this difference. Worse, we debug_assert
that there are no pending AddHTLC holding cell events when doing a
reconnect, so any tests or fuzzers which deserialized a
ChannelManager with AddHTLC events would panic.
We resolve this by adding logic to fail any holding-cell AddHTLC
events upon deserialization, in part because trying to forward it
before we're sure we have an up-to-date chain is somewhat risky -
the sender may have already gone to chain while our upstream has
not.