LSP-Plugin: Add MPP support#8948
Conversation
dc5b671 to
44f666d
Compare
44f666d to
df7c132
Compare
73b18ed to
da925c6
Compare
|
Cache cleared, and restarting the CI while I review the code 👍 |
|
Nice PR, maybe a bit on the long side, and a bit of duplication, but the architecture is nice. ACK |
| #[derive(Debug, Clone, PartialEq, Eq, thiserror::Error)] | ||
| pub enum Error { | ||
| #[error("variable amount payments are not supported")] | ||
| UnimplementedVarAmount, |
There was a problem hiding this comment.
Interesting, I didn't know that LSPS2 has a varamount mode. How is the signalling handled? I.e., how does the LSP learn the total amount, and the fee amount it is allowed to retain?
There was a problem hiding this comment.
If the client doesn't specify payment_size_msat on the lsps2.buy request, the LSP expects a single HTLC with a variable amount for the payment hash.
| )] | ||
| InsufficientDeductibleCapacity { | ||
| opening_fee_msat: u64, | ||
| deductible_capacity_msat: u128, |
|
|
||
| #[derive(Debug, Clone, PartialEq, Eq)] | ||
| pub struct PaymentPart { | ||
| pub htlc_id: u64, |
There was a problem hiding this comment.
Just a quick doubt I had: this refers to which ID? As far as I remember the numbering of HTLCs was using a composite (channel_id, htlc_id) key because the protocol insists on numbering HTLCs, making either an alias necessary or the composite key to make them unique.
This is the DB HTLC ID that counts up monotonically, not the protocol ID which is per-channel, correct?
There was a problem hiding this comment.
Very good question indeed, I'll need to check what exactly the htlc_accepted hook provides.
| } | ||
|
|
||
| #[derive(Debug, Clone, PartialEq, Eq)] | ||
| pub enum SessionEvent { |
There was a problem hiding this comment.
That's quite the extensive list of potential errors and outcomes, nice listing 👍
| } | ||
| } | ||
|
|
||
| pub fn apply(&mut self, input: SessionInput) -> Result<ApplyResult> { |
There was a problem hiding this comment.
You'll probably want to break this up into a dispatch method (containing the match (field_a, field_b, input)) and several handler methods that take the involved parts and pieces of information as explicit parameters. That'll help get a quick overview, and allow diving deep on specific operations if interested.
There was a problem hiding this comment.
do you mean something like the following? This would indeed make it more readable.
pub fn apply(&mut self, input: SessionInput) -> Result<ApplyResult> {
match(&mut self.stat) {
SessionState::Collecting => apply_on_collecting(input),
SessionState::AwaitingChannelReady => apply_on_awaiting_channel_ready(input),
....
}
}
// Maybe needs to get the state passed here and reinjected into the Session later
fn apply_on_collecting(&mut self, input: SessionInput) -> Result<ApplyResult> {
// Check what needs to be "taken" here to please the ownership model
match(input) {
....
}
}There was a problem hiding this comment.
Yep, pretty much, can be a cleanup though. Also a quick diagram on allowable state changes would make this much easier to follow as well. All followup things though, not a blocker.
There was a problem hiding this comment.
That is not to say I don't like the (state, event) matching, as that automatically gives you an indication whether all states and transitions are covered, via the exhaustive enumeration rule in Rust.
| scid: ShortChannelId, | ||
| datastore: D, | ||
| ) -> ActorInboxHandle { | ||
| let (tx, inbox) = mpsc::channel(128); // Should we use max_htlcs? |
There was a problem hiding this comment.
The backlog is mostly intended for bursty behavior, and should be set to the maximum number of events in flight. If there is no more room, we will drop block the sending side, no messages should be lost. If the sender does not need to make progress, and the expectation is that we can process messages in the order they arrive (no interleaving) it should be safe to set the backlog to 1. Don't think about elements queued up, rather consider if you need to make progress on the sendign side at all, while elements are being processed.
| } | ||
| } | ||
|
|
||
| fn execute_action(&mut self, action: SessionAction) { |
There was a problem hiding this comment.
Love the match () {} matrix, less so the deep nesting, as it pulls the matrix apart and makes reasoning about its relations and transitions harder :-)
There was a problem hiding this comment.
I'll clean it up and move stuff into handler functions
| return Ok(serde_json::json!({ | ||
| "result": "continue", | ||
| "mindepth": 0, | ||
| "reserve": 0, |
There was a problem hiding this comment.
Hm, this will break megalithic LSP? They do not have a way to set no reserve.
There was a problem hiding this comment.
Whoo! GOOD catch. totally forgot about megalith here. I'll make it an option with a sane (set nothing) default
| }; | ||
|
|
||
| // Main loop: process inbox events | ||
| loop { |
There was a problem hiding this comment.
Is this not duplicating the entire FSM logic, just because we enter the system through recovery, rather than kicking off a new session? We could just call into the dispatch of events here, and everything else would be the same, or am I missing something?
There was a problem hiding this comment.
Let me see if I can clean it up a bit
| ) -> Result<(String, String)> { | ||
| (**self) | ||
| .fund_channel(peer_id, channel_capacity_msat, opening_fee_params) | ||
| .fund_channel(peer_id, channel_capacity_msat, opening_fee_params, scid) |
There was a problem hiding this comment.
Are there "let's define ALL the operations on a variant of the original behavior" useful? I am failing to see how defining the operations on Arc<T> and then just having them forward to T could be useful 🤔
|
The CI failures appear to NOT be flaky tests, as I am seeing a lot of the new tests failing on us. |
da925c6 to
d0fc010
Compare
|
@nepet can you rebase, resolve conflicts and clear CI by the end of this week so we can add this to 26.06? |
Introduce a state-machine-based approach to managing LSPS2 JIT channel sessions. The FSM tracks payment collection from initial channel open through HTLC forwarding to completion, replacing the previous ad-hoc state tracking. Also adds Sum trait for Msat and PartialEq for protocol types needed by the FSM. Changelog-Experimental: LSPS2 session state machine for JIT channels
Introduce a session actor that runs the FSM in an async task and communicates side effects through an ActionExecutor trait. This separates state machine logic from I/O concerns like RPC calls and datastore writes.
Add SessionManager that routes incoming HTLCs to the correct session actor by payment hash, replacing the previous handler-based approach. Reworks the policy plugin API and integrates the CLN RPC executor, unifies HTLC handling into the session FSM, and removes the now deprecated handler.rs.
Add integration tests covering the full session lifecycle: channel opening, HTLC forwarding, payment collection, and session completion.
Implement crash recovery for LSPS2 sessions so that in-progress JIT channel sessions survive plugin restarts. Adds recovery traits and datastore methods, a RecoveryProvider implementation for ClnApiRpc, forward monitoring for recovered sessions, and integration tests for recovery scenarios. Makes broadcast_tx and abandon_session idempotent to handle replayed actions safely.
Reduce DatastoreProvider from many methods to 5, with the actor owning the DatastoreEntry and driving all writes through the actor loop. This makes the datastore boundary simpler and testable.
Add an EventSink trait that decouples session event reporting from the transport layer. Includes a composite sink and a channel-based implementation. Wires EventSink through SessionActor and SessionManager, and persists payment_hash in DatastoreEntry.
Replace CLN-specific types (cln_rpc PublicKey, ShortChannelId alias) with standalone alternatives, feature-gate CLN dependencies behind a "cln" feature flag, split ClnApiRpc into focused adapter structs, and refactor Lsps2ServiceHandler generics for cleaner trait boundaries. This makes the lsps2 core reusable outside of CLN.
Merge BlockheightProvider into Lsps2PolicyProvider, extract check_cltv_timeout helper in the session FSM, flatten recovery branching in SessionManager, simplify the actor loop with convert_input and tokio::select!, and remove the unused CollectTimeout ActorInput variant.
We actually only use this in tests Signed-off-by: Peter Neuroth <pet.v.ne@gmail.com>
After restart, recovered session actors were stored in a separate recovery_handles Vec, unreachable by the forward_event notification path that routes via the sessions HashMap. This caused intermittent CI failures where on_payment_settled could not find the session and the internal forward-monitoring loop failed to detect settlement. Register recovered sessions in the sessions HashMap keyed by payment_hash so forward_event notifications reach them directly. For already-settled forwards, recover into Broadcasting state so the actor self-drives to completion without needing forward_event re-delivery. Remove the now-redundant internal polling loop (get_forward_activity + wait_for_forward_resolution).
4d6bfcf to
16d662b
Compare
|
rebased on master |
| pub opening_fee_params: OpeningFeeParams, | ||
| #[serde(skip_serializing_if = "Option::is_none")] | ||
| pub expected_payment_size: Option<Msat>, | ||
| pub channel_capacity_msat: Msat, |
There was a problem hiding this comment.
These are guaranteed to be new entries, right? If you try to load an existing one with non-optional fields it'll likely fail.
| } | ||
| } | ||
|
|
||
| pub fn apply(&mut self, input: SessionInput) -> Result<ApplyResult> { |
There was a problem hiding this comment.
Yep, pretty much, can be a cleanup though. Also a quick diagram on allowable state changes would make this much easier to follow as well. All followup things though, not a blocker.
| return Ok(ApplyResult { | ||
| actions: vec![ | ||
| SessionAction::FailHtlcs { | ||
| failure_code: UNKNOWN_NEXT_PEER, |
There was a problem hiding this comment.
You probably don't want to use UNKNOWN_NEXT_PEER as that is a permanent error, and the client is supposed to remember that the channel never existed, and future attempts with blocklist them. It's a common error lnd always gets wrong too.
| // We don't check for max parts here as we are in the middle of | ||
| // the channel funding. We'll check once we transitioned. | ||
|
|
||
| Ok(ApplyResult { |
There was a problem hiding this comment.
Sorry my eyes glazed over at this point, I trust the logic is sensible, as reconstructing it from code is painful.
| } | ||
| } | ||
|
|
||
| pub fn apply(&mut self, input: SessionInput) -> Result<ApplyResult> { |
There was a problem hiding this comment.
That is not to say I don't like the (state, event) matching, as that automatically gives you an indication whether all states and transitions are covered, via the exhaustive enumeration rule in Rust.
The UNKNOWN_NEXT_PEER constant carried the value 0x4010 which is not a valid BOLT4 failure code; unknown_next_peer is PERM|10 = 0x400a. Every path that was supposed to fail with unknown_next_peer actually sent the deprecated incorrect_payment_amount code. Also stop using the permanent unknown_next_peer for retryable failures (too many parts, fee allocation failure, channel funding failure) as senders are expected to blocklist the route on permanent errors. These now fail with temporary_channel_failure; LSPS2 only mandates unknown_next_peer when the payment cannot cover the opening fee or valid_until has passed, and requires temporary_channel_failure for a client disconnect during channel establishment. Addresses review feedback by @cdecker on ElementsProject#8948.
HTLC ids are a per-channel counter, so MPP parts of one payment arriving over different incoming channels can share the same id -- the common case for multi-part payments. The session actor tracked pending HTLC replies in a map keyed by the bare id, so a colliding part silently replaced the earlier part's reply channel: the earlier hook call resolved to a generic continue (failing the part upstream with WIRE_UNKNOWN_NEXT_PEER), the manager misread the dropped reply as a dead actor and removed the live session, and forward amounts could be cross-wired between parts. Introduce HtlcId as the composite of the incoming channel scid and the per-channel id, carry it in PaymentPart/ForwardPart and key the pending-reply map with it. Addresses review feedback by @cdecker on ElementsProject#8948 (HTLC id ambiguity).
Both sides unconditionally requested a zero channel reserve on JIT channels: the client's openchannel hook always replied reserve=0 and the LSP's fundchannel_start always sent reserve=0. Some LSP implementations (e.g. Megalithic) cannot handle an explicit zero reserve. Leave the reserve at lightningd's default and add opt-in flags: experimental-lsps-client-zero-reserve on the client and experimental-lsps2-zero-reserve on the service. Also add uniform #[serde(default)] on DatastoreEntry's optional fields and a regression test pinning that entries persisted without the newer fields keep deserializing. Addresses review feedback by @cdecker on ElementsProject#8948 (Megalithic reserve).
The UNKNOWN_NEXT_PEER constant carried the value 0x4010 which is not a valid BOLT4 failure code; unknown_next_peer is PERM|10 = 0x400a. Every path that was supposed to fail with unknown_next_peer actually sent the deprecated incorrect_payment_amount code. Also stop using the permanent unknown_next_peer for retryable failures (too many parts, fee allocation failure, channel funding failure) as senders are expected to blocklist the route on permanent errors. These now fail with temporary_channel_failure; LSPS2 only mandates unknown_next_peer when the payment cannot cover the opening fee or valid_until has passed, and requires temporary_channel_failure for a client disconnect during channel establishment. Addresses review feedback by @cdecker on ElementsProject#8948.
HTLC ids are a per-channel counter, so MPP parts of one payment arriving over different incoming channels can share the same id -- the common case for multi-part payments. The session actor tracked pending HTLC replies in a map keyed by the bare id, so a colliding part silently replaced the earlier part's reply channel: the earlier hook call resolved to a generic continue (failing the part upstream with WIRE_UNKNOWN_NEXT_PEER), the manager misread the dropped reply as a dead actor and removed the live session, and forward amounts could be cross-wired between parts. Introduce HtlcId as the composite of the incoming channel scid and the per-channel id, carry it in PaymentPart/ForwardPart and key the pending-reply map with it. Addresses review feedback by @cdecker on ElementsProject#8948 (HTLC id ambiguity).
On restart CLN re-plays the htlc_accepted hook for incoming HTLCs that were held but not resolved before the crash. The recovered session actor silently dropped these AddPart inputs, which also dropped the hook's reply channel: the manager misread that as a dead actor, removed the recovered session from its map, and subsequent forward_event notifications found no session -- so the withheld funding transaction was never broadcast even though the client could still settle the payment. Route AddPart through the same convert/apply pipeline as the other recovered inputs (the FSM already forwards late parts in the AwaitingSettlement and Broadcasting states) and release any pending replies when a recovered actor terminates.
Three related races allowed a second on-chain channel to be funded for a single lsps2.buy: - create_session did not check whether the active datastore entry already carried a channel_id, so any part arriving for a hash without a live session (payer retry, late relay, replay) started a fresh Collecting session that could reach the threshold and fund again. Refuse with a new SessionAlreadyFunded error instead. - on_payment_settled/on_payment_failed removed the session handle from the map while the actor was still alive, making the FSM's late-part forwarding paths (AwaitingSettlement/Broadcasting + AddPart) unreachable and opening the retry window above. Keep the handle registered and prune dead handles lazily, as on_part and on_new_block already do. - Map-entry removal on a dead handle was not identity-checked and could tear down a newer live session created concurrently for the same hash. Only remove the entry when it still refers to the same actor. The htlc_accepted hook now fails parts hitting a terminated or already-funded session with temporary_channel_failure instead of falling through to continue (which surfaced upstream as the permanent unknown_next_peer).
The CLTV check failed held HTLCs only once the chain tip was already past the earliest cltv_expiry. At that point the upstream peer is entitled to force-close to claim the HTLC timeout on-chain, so failing off-chain is too late. Fail once fewer than CLTV_SAFETY_BUFFER (6) blocks remain before the earliest expiry.
Every ForwardHtlcs action started a new 5-second listpeerchannels poller and overwrote the previous JoinHandle without aborting it, so each late-arriving part leaked a poll task that keeps running for the lifetime of the channel on successful sessions. Start the poller only once and abort it when the actor terminates.
Recovery aborted on the first entry that failed to recover (RPC error, malformed entry), leaving every other persisted session unrecovered while CLN replays their HTLCs. Log and continue per entry instead. An overflowing opening-fee computation surfaced as an FSM error that the actor only logged, leaving the HTLCs hanging until the collect timeout failed them with the wrong code. Fail the session immediately with unknown_next_peer as LSPS2 mandates.
… range The buy response advertised client_trusts_lsp: false while the funding flow withholds the funding transaction until the preimage is revealed -- which is exactly the client-trusts-LSP model. A spec-following client is entitled to wait for the funding tx in its mempool before settling when the flag is false, which would deadlock against the withhold. Advertise true until the broadcast-on-channel_ready mode is implemented. Also add the LSPS2-mandated check that a requested payment_size_msat lies within min/max_payment_size_msat, returning error 202/203. The existing tests for this passed for the wrong reasons (the mock had no blockheight, and the below-min case tripped the fee check instead); they now pin the exact error codes with a fully configured mock.
Compute the opening fee from the negotiated payment_size_msat instead of the sum of collected parts. BOLT4 permits the payer to overpay, and the client verifies the deducted extra_fees against the fee promised for payment_size_msat -- charging on parts_sum makes a compliant client reject overpaid payments. Var-invoice mode keeps using the first HTLC's value as mandated. Only attach the extra_fee TLV (type 65537) to forwarded parts that actually have a fee deducted; LSPS2 forbids including it otherwise, and we attached a zero-value TLV to every part.
Both sides unconditionally requested a zero channel reserve on JIT channels: the client's openchannel hook always replied reserve=0 and the LSP's fundchannel_start always sent reserve=0. Some LSP implementations (e.g. Megalithic) cannot handle an explicit zero reserve. Leave the reserve at lightningd's default and add opt-in flags: experimental-lsps-client-zero-reserve on the client and experimental-lsps2-zero-reserve on the service. Also add uniform #[serde(default)] on DatastoreEntry's optional fields and a regression test pinning that entries persisted without the newer fields keep deserializing. Addresses review feedback by @cdecker on ElementsProject#8948 (Megalithic reserve).
A single failed forward (e.g. a part exceeding the channel's max_accepted_htlcs) abandoned the whole session and closed the channel even though the client could still settle the other offered parts. Thread the failed part's (channel, htlc id) through the forward_event path and drop just that part from the session; abandon only when no forwarded parts are left (or the failed part is unknown, as for recovered sessions, preserving the previous behavior). Resolves the TODO about #HTLCs > max_accepted_htlcs.
- Recovered session actors now carry the peer id from the datastore entry so the Disconnect action works after recovery, instead of warning on an empty public key. - Move tokio's test-util feature to dev-dependencies. - Resolve the inbox-capacity TODO with a rationale comment: senders use backpressure, the buffer only needs to cover in-flight inputs. - Document the Msat/u64/u128 amount conventions in the session module. - Drop the unused LspsBuyJitChannelResponse struct and unused test helper, fix needless returns, a clone on a Copy type, and a while-let loop in code this branch introduced.
36c53dd to
4c58bff
Compare
Important
26.04 FREEZE March 11th: Non-bugfix PRs not ready by this date will wait for 26.06.
RC1 is scheduled on March 23rd
The final release is scheduled for April 15th.
Checklist
Before submitting the PR, ensure the following tasks are completed. If an item is not applicable to your PR, please mark it as checked:
tools/lightning-downgradeIntroduces a state-machine-based approach to managing LSPS2 JIT channel sessions, replacing the previous ad-hoc state tracking with a structured FSM that tracks payment collection from initial channel open through HTLC forwarding to completion.