-
Notifications
You must be signed in to change notification settings - Fork 438
Prevent HTLC double-forwards, prune forwarded onions #4303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent HTLC double-forwards, prune forwarded onions #4303
Conversation
|
👋 Thanks for assigning @joostjager as a reviewer! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4303 +/- ##
==========================================
+ Coverage 86.03% 86.05% +0.01%
==========================================
Files 156 156
Lines 103036 103334 +298
Branches 103036 103334 +298
==========================================
+ Hits 88652 88919 +267
- Misses 11876 11902 +26
- Partials 2508 2513 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
01aa4c0 to
4012864
Compare
4012864 to
e2ea1ed
Compare
e2ea1ed to
c40741b
Compare
|
Pushed a rebase on |
c40741b to
3f0f811
Compare
3f0f811 to
fce5071
Compare
|
Haven't reviewed yet, but the following question came up: in the previous pr and in this pr, double-forward bugs are fixed. Couldn't the fuzzer detect this? |
joostjager
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in this PR is quite complex, with multiple interacting state machines (inbound HTLC state, outbound HTLC state, holding cells, monitor updates) that need to stay consistent across restarts. The fact that multiple double-forward bugs were discovered during development suggests the state space is large enough that targeted unit tests may not provide sufficient coverage.
I keep wondering if the existing fuzzer can exercise these restart scenarios with in-flight HTLCs at various stages. The current strategy of adding a specific regression test might be too reactive?
lightning/src/ln/channelmanager.rs
Outdated
| // | ||
| // If 0.3 or 0.4 reads this manager version, it knows that the legacy maps were not written and | ||
| // acts accordingly. | ||
| const RECONSTRUCT_HTLCS_FROM_CHANS_VERSION: u8 = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't you want to make this version 10, to be able to insert other upgrades in between?
| Forwarded { | ||
| /// Useful if we need to fail or claim this HTLC backwards after restart, if it's missing in the | ||
| /// outbound edge. | ||
| hop_data: HTLCPreviousHopData, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is some data in here that is redundant with InboundHTLCOutput. Not sure if we care
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure either. @TheBlueMatt any opinion? This is the simplest way so I might save the dedup for followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yea, would still be quite nice to reduce size here but its alright to do it in a followup (just has to happen in 0.3!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheBlueMatt @joostjager I sketched this out: 9c8a92e
It adds a fair amount of code and only saves the htlc_id and cltv_expiry fields. Does that seem worth it? I'm not super convinced but I guess it could add up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm yes, I don't know if it is worth it. Note that in WithOnion there are also duplications. I think it could have been
WithOnion {
onion_routing_packet: OnionPacket,
skimmed_fee_msat: Option<u64>,
blinding_point: Option<PublicKey>,
},
Generally I like deduplicated data. No risk of an invalid state. But no strong opinion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait can't we also drop the channel_id, outpoint, counterparty_node_id, user_channel_id, and prev_outbound_scid_alias?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
channel_id and prev_outbound_scid_alias are used in fail_htlc_backwards_internal. counteparty_node_id, user_channel_id, and outpoint are used in claim_fund_internal / claim_funds_from_hop.
Duh, we do have those fields already because we have the whole inbound Channel. Will proceed with de-duplicating in follow-up.
| Forwarded { | ||
| /// Useful if we need to fail or claim this HTLC backwards after restart, if it's missing in the | ||
| /// outbound edge. | ||
| hop_data: HTLCPreviousHopData, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yea, would still be quite nice to reduce size here but its alright to do it in a followup (just has to happen in 0.3!)
fce5071 to
84093ca
Compare
|
Rebased since #4332 landed, haven't pushed updates addressing feedback yet |
84093ca to
94815bd
Compare
TheBlueMatt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for the last hunk in the second-to-last commit. Really have no idea what's going on there.
lightning/src/ln/channelmanager.rs
Outdated
| for (hash, hop_data, outbound_amt_msat) in | ||
| mem::take(&mut already_forwarded_htlcs).drain(..) | ||
| { | ||
| if hash != payment_hash { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really confused by this hunk. This loop is really just being used for inbound payments, not forwarded ones, so this seems like it should belong further up where we're creating pending_claims_to_replay. I'm not really sure why this is in the loop at all - it looks at the channel we're iterating over when scanning pending_claims_to_replay but not when looking at the already_forwarded_htlcs entry?
We're also keying by payment hash - skipping any failures that happen to have the same payment hash as a forwarded payment, possibly resulting in unrelated HTLCs (eg an MPP that got forwarded through us twice) getting forgotten (not that its critical to forget them, just that its weird).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed quite a bit offline. I added a fixup converting already_forwarded_htlcs to a hashmap.
The conclusion of the discussion was, instead of the current approach, to: "when iterating through stored preimages, rather than using already_forwarded_htlcs, ask the channel for its list of [all] inbound htlcs and add the unclaimed ones in general to pending_claims_to_replay."
However, I think the only HTLCs we can easily add to pending_claims_to_replay are the ones that are already in already_forwarded_htlcs. All other types of inbound unclaimed HTLCs don't have an HTLCSource available. So the discussed approach seems like a decent amount of extra code to manually construct the HTLCSources, for what seem like unreachable codepaths (since if the monitor and the manager are that out-of-sync about HTLC state, then the manager is stale and we probably FC'd the channel anyway).
I think a big part of the complaint about this code hunk is that it wasn't accessing data that was specific to a channel, so that part should at least be fixed now with the hashmap update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Btw, I will push the ChannelMonitor docs updates we discussed on the next push.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed more offline yesterday, pushed the approach we seemed to land on.
It's not ideal that we claim an inbound HTLC that doesn't come from a specific outbound HTLC's source, but it doesn't seem worth it to worry about too much here. Basically, we don't always want to claim an HTLC just because we have the preimage for it, because there can be an attack where a malicious intermediate node receives an HTLC to forward, then forwards a 1-msat to another node with the same payment_hash to see if they'll claim it, then claims the inbound HTLC without forwarding. It's unlikely to happen in this case because it relies on a node restart.
4e69514 to
18268f2
Compare
lightning/src/ln/channelmanager.rs
Outdated
| >, | ||
| prev_hop: &HTLCPreviousHopData| { | ||
| if let hash_map::Entry::Occupied(mut entry) = | ||
| already_forwarded_htlcs.entry(prev_hop.channel_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems one other change here is that prev_outbound_scid_alias is no longer used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will fix on the next push after resolving the discussion with Matt above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm not sure it makes sense to check the scid since we're already checking the channel_id.
18268f2 to
1b1408c
Compare
|
Rebased due to conflicts, will address feedback on the next push |
We recently added support for reconstructing ChannelManager::decode_update_add_htlcs on startup, using data present in the Channels. However, we failed to prune HTLCs from this rebuilt map if a given inbound HTLC was already forwarded to the outbound edge and in the outbound holding cell (this bug could've caused us to double-forward HTLCs, fortunately it never shipped). As part of fixing this bug, we clean up the overall pruning approach by: 1. If the Channel is open, then it is the source of truth for what HTLCs are outbound+pending (including pending in the holding cell) 2. If the Channel is closed, then the corresponding ChannelMonitor is the source of truth for what HTLCs are outbound+pending Previously, we would only consider the monitor's pending HTLCs, which ignored holding cell HTLCs. Co-Authored-By: Claude Opus 4.5 <[email protected]>
This cleanup falls out of the changes made in the previous commit. Separated out here for reviewability.
We recently added support for reconstructing ChannelManager::decode_update_add_htlcs on startup, using data present in the Channels. However, we failed to prune HTLCs from this rebuilt map if a given HTLC was already forwarded+removed from the outbound edge and resolved in the inbound edge's holding cell. Here we fix this bug that would have caused us to double-forward inbound HTLC forwards, which fortunately was not shipped. Co-Authored-By: Claude Opus 4.5 <[email protected]>
In 0.3+, we are taking steps to remove the requirement of regularly persisting
the ChannelManager and instead rebuild the set of HTLC forwards (and the
manager generally) from Channel{Monitor} data.
We previously merged support for reconstructing the
ChannelManager::decode_update_add_htlcs map from channel data, using a new
HTLC onion field that will be present for inbound HTLCs received on 0.3+ only.
However, we now want to add support for pruning this field once it's no longer
needed so it doesn't get persisted every time the manager gets persisted. At
the same time, in a future LDK version we need to detect whether the field was
ever present to begin with to prevent upgrading with legacy HTLCs present.
We accomplish both by converting the plain update_add option that was
previously serialized to an enum that can indicate whether the HTLC is from
0.2- versus 0.3+-with-onion-pruned (a variant for the latter is added in the
next commit).
Actual pruning of the new update_add field is added in the next commit.
We store inbound committed HTLCs' onions in Channels for use in reconstructing the pending HTLC set on ChannelManager read. If an HTLC has been forwarded to the outbound edge, we no longer need to persist the inbound edge's onion and can prune it here.
We recently merged (test-only, for now) support for the ChannelManager
reconstructing its set of pending HTLCs from Channel{Monitor} data, rather than
using its own persisted maps. But because we want test coverage of both the new
reconstruction codepaths as well as the old persisted map codepaths,
in tests we would decide between those two sets of codepaths randomly.
We now want to add some tests that require using the new codepaths, so here we
add an option to explicitly set whether to reconstruct or not rather than
choosing randomly.
Cleans it up a bit in preparation for adding a new variant in the next commit.
In a recent commit, we added support for pruning an inbound HTLC's persisted onion once the HTLC has been irrevocably forwarded to the outbound edge. Here, we add a check on startup that those inbound HTLCs were actually handled. Specifically, we check that the inbound HTLC is either (a) currently present in the outbound edge or (b) was removed via claim. If neither of those are true, we infer that the HTLC was removed from the outbound edge via fail and fail the inbound HTLC backwards.
In 0.3+, we are taking steps to remove the requirement of regularly persisting
the ChannelManager and instead rebuild the set of HTLC forwards (and the
manager generally) from Channel{Monitor} data.
We previously merged support for reconstructing the
ChannelManager::decode_update_add_htlcs map from channel data, using a new
HTLC onion field that will be present for inbound HTLCs received on 0.3+ only.
The plan is that in upcoming LDK versions, the manager will reconstruct this
map and the other forward/claimable/pending HTLC maps will automatically
repopulate themselves on the next call to process_pending_htlc_forwards.
As such, once we're in a future version that reconstructs the pending HTLC set,
we can stop persisting the legacy ChannelManager maps such as forward_htlcs,
pending_intercepted_htlcs since they will never be used.
For 0.3 to be compatible with this future version, in this commit we detect
that the manager was last written on a version of LDK that doesn't persist the
legacy maps. In that case, we don't try to read the old forwards map and run
the new reconstruction logic only.
1b1408c to
3b75eee
Compare
|
Addressed feedback (sorry, there's a spurious change from accidentally rebasing on main a 2nd time): diff |
| // that it is handled. | ||
| let mut already_forwarded_htlcs: HashMap< | ||
| (ChannelId, PaymentHash), | ||
| Vec<(HTLCPreviousHopData, u64)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheBlueMatt I think we can't avoid having a Vec here, because there can be multiple HTLCs corresponding to the same (ChannelId, PaymentHash) tuple due to MPP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that makes sense. Can claude add a test for having two HTLCs forwarded over the same path with MPP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'll go for that test. We'll want two claims for the same payment where both HTLC claims went into the inbound holding cell prior to shutdown (and the holding cell was lost after restart).
But now I'm wondering, since we need to loop anyway, is it really worth having a hashmap given that each value is a Vec? Seems like extra allocations...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: going to add the test in follow-up PR #4405
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot less to loop over :). At some point we really need to do a prevec like Bitcoin Core's...
| // that it is handled. | ||
| let mut already_forwarded_htlcs: HashMap< | ||
| (ChannelId, PaymentHash), | ||
| Vec<(HTLCPreviousHopData, u64)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot less to loop over :). At some point we really need to do a prevec like Bitcoin Core's...
| // | ||
| // If 0.3 or 0.4 reads this manager version, it knows that the legacy maps were not written and | ||
| // acts accordingly. | ||
| const RECONSTRUCT_HTLCS_FROM_CHANS_VERSION: u8 = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 5 and not 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested by @joostjager in case we need to bump again in the meantime, cc #4303 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we haven't bumped it since we implemented persistence years ago I'm pretty skeptical lol. Also, if we do define a new version we'll need a similar migration gap so we can just call it 2 and any new version upgrades will presumably need to start at 3 and go to 0.6 or whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I mentioned that first bit, I think he was thinking we might the flexibility of having a bump that's in-between? @joostjager I don't mind either way, let us know your thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the version field is only really useful for stuff like this - backwards-incompat changes where we're removing fields. Anything else can go through TLVs, so I don't really think we can have some change in between.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't it be possible that we want to remove other non-tlv fields on a different timeline, in between?
joostjager
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack on hashmap key change. My concerns about the complexity of this PR and the absence of fuzzing coverage remain.
| payment_preimage, | ||
| outbound_amt_msat, | ||
| is_channel_closed, | ||
| counterparty_node_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pending_claims_to_replay gets destructured further down into downstream_* fields (the outgoing side, and ending up in PaymentForwarded as "next_*"?), but it is populated here with data from the inbound monitor? Maybe this was covered in previous discussion, but not 100% sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right :( I think early on in the PR I thought this was the downstream channel and forgot to go back and correct it. I have an unaddressed todo that probably would've caught this... Unfortunately, fixing this means persisting more data in the inbound edge HTLC so we can remember more stuff about the outbound edge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this means that the event data is incorrect. Maybe the event fields can then better be made optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the incorrect data in #4405.
|
Landing, changes can happen in a followup. |
ChannelManagerforward maps and fully reconstruct them fromChanneldata. See the follow-up tagged 0.5: (0.5) Deprecate legacy pending HTLCChannelManagerpersistence #4359Closes #4280, partially addresses #4286
Based on #4289Channels in prod based on the serialized manager version Fix double-forward, prefer legacy forward maps #4289 (comment)