LSP-Plugin: Add MPP support by nepet · Pull Request #8948 · ElementsProject/lightning

nepet · 2026-03-18T23:28:44Z

Important

26.04 FREEZE March 11th: Non-bugfix PRs not ready by this date will wait for 26.06.

RC1 is scheduled on March 23rd

The final release is scheduled for April 15th.

Checklist

Before submitting the PR, ensure the following tasks are completed. If an item is not applicable to your PR, please mark it as checked:

The changelog has been updated in the relevant commit(s) according to the guidelines.
Tests have been added or modified to reflect the changes.
Documentation has been reviewed and updated as needed.
Related issues have been listed and linked, including any that this PR closes.
Important All PRs must consider how to reverse any persistent changes for tools/lightning-downgrade

Introduces a state-machine-based approach to managing LSPS2 JIT channel sessions, replacing the previous ad-hoc state tracking with a structured FSM that tracks payment collection from initial channel open through HTLC forwarding to completion.

Is based on, and requires Withheld channel close: fail back inflight HTLCs and free UTXOs #8943 to be merged first.

cdecker · 2026-03-23T13:15:29Z

Cache cleared, and restarting the CI while I review the code 👍

cdecker · 2026-03-23T13:49:37Z

Nice PR, maybe a bit on the long side, and a bit of duplication, but the architecture is nice.

ACK

cdecker · 2026-03-23T13:18:06Z

+#[derive(Debug, Clone, PartialEq, Eq, thiserror::Error)]
+pub enum Error {
+    #[error("variable amount payments are not supported")]
+    UnimplementedVarAmount,


Interesting, I didn't know that LSPS2 has a varamount mode. How is the signalling handled? I.e., how does the LSP learn the total amount, and the fee amount it is allowed to retain?

If the client doesn't specify payment_size_msat on the lsps2.buy request, the LSP expects a single HTLC with a variable amount for the payment hash.

cdecker · 2026-03-23T13:18:38Z

+    )]
+    InsufficientDeductibleCapacity {
+        opening_fee_msat: u64,
+        deductible_capacity_msat: u128,


Why the different types here?

cdecker · 2026-03-23T13:20:48Z

+
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct PaymentPart {
+    pub htlc_id: u64,


Just a quick doubt I had: this refers to which ID? As far as I remember the numbering of HTLCs was using a composite (channel_id, htlc_id) key because the protocol insists on numbering HTLCs, making either an alias necessary or the composite key to make them unique.

This is the DB HTLC ID that counts up monotonically, not the protocol ID which is per-channel, correct?

Very good question indeed, I'll need to check what exactly the htlc_accepted hook provides.

cdecker · 2026-03-23T13:21:49Z

+}
+
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub enum SessionEvent {


That's quite the extensive list of potential errors and outcomes, nice listing 👍

cdecker · 2026-03-23T13:25:30Z

+        }
+    }
+
+    pub fn apply(&mut self, input: SessionInput) -> Result<ApplyResult> {


You'll probably want to break this up into a dispatch method (containing the match (field_a, field_b, input)) and several handler methods that take the involved parts and pieces of information as explicit parameters. That'll help get a quick overview, and allow diving deep on specific operations if interested.

do you mean something like the following? This would indeed make it more readable.

pub fn apply(&mut self, input: SessionInput) -> Result<ApplyResult> { match(&mut self.stat) { SessionState::Collecting => apply_on_collecting(input), SessionState::AwaitingChannelReady => apply_on_awaiting_channel_ready(input), .... } } // Maybe needs to get the state passed here and reinjected into the Session later fn apply_on_collecting(&mut self, input: SessionInput) -> Result<ApplyResult> { // Check what needs to be "taken" here to please the ownership model match(input) { .... } }

Yep, pretty much, can be a cleanup though. Also a quick diagram on allowable state changes would make this much easier to follow as well. All followup things though, not a blocker.

That is not to say I don't like the (state, event) matching, as that automatically gives you an indication whether all states and transitions are covered, via the exhaustive enumeration rule in Rust.

cdecker · 2026-03-23T13:30:25Z

+        scid: ShortChannelId,
+        datastore: D,
+    ) -> ActorInboxHandle {
+        let (tx, inbox) = mpsc::channel(128); // Should we use max_htlcs?


The backlog is mostly intended for bursty behavior, and should be set to the maximum number of events in flight. If there is no more room, we will drop block the sending side, no messages should be lost. If the sender does not need to make progress, and the expectation is that we can process messages in the order they arrive (no interleaving) it should be safe to set the backlog to 1. Don't think about elements queued up, rather consider if you need to make progress on the sendign side at all, while elements are being processed.

cdecker · 2026-03-23T13:31:58Z

+        }
+    }
+
+    fn execute_action(&mut self, action: SessionAction) {


Love the match () {} matrix, less so the deep nesting, as it pulls the matrix apart and makes reasoning about its relations and transitions harder :-)

I'll clean it up and move stuff into handler functions

cdecker · 2026-03-23T13:36:12Z

        return Ok(serde_json::json!({
            "result": "continue",
            "mindepth": 0,
+            "reserve": 0,


Hm, this will break megalithic LSP? They do not have a way to set no reserve.

Whoo! GOOD catch. totally forgot about megalith here. I'll make it an option with a sane (set nothing) default

cdecker · 2026-03-23T13:40:50Z

+        };
+
+        // Main loop: process inbox events
+        loop {


Is this not duplicating the entire FSM logic, just because we enter the system through recovery, rather than kicking off a new session? We could just call into the dispatch of events here, and everything else would be the same, or am I missing something?

Let me see if I can clean it up a bit

cdecker · 2026-03-23T13:42:42Z

    ) -> Result<(String, String)> {
        (**self)
-            .fund_channel(peer_id, channel_capacity_msat, opening_fee_params)
+            .fund_channel(peer_id, channel_capacity_msat, opening_fee_params, scid)


Are there "let's define ALL the operations on a variant of the original behavior" useful? I am failing to see how defining the operations on Arc<T> and then just having them forward to T could be useful 🤔

cdecker · 2026-03-23T13:53:09Z

The CI failures appear to NOT be flaky tests, as I am seeing a lot of the new tests failing on us.

madelinevibes · 2026-05-04T06:38:03Z

@nepet can you rebase, resolve conflicts and clear CI by the end of this week so we can add this to 26.06?
Release candidate planned for Monday 11 May.

Introduce a state-machine-based approach to managing LSPS2 JIT channel sessions. The FSM tracks payment collection from initial channel open through HTLC forwarding to completion, replacing the previous ad-hoc state tracking. Also adds Sum trait for Msat and PartialEq for protocol types needed by the FSM. Changelog-Experimental: LSPS2 session state machine for JIT channels

Introduce a session actor that runs the FSM in an async task and communicates side effects through an ActionExecutor trait. This separates state machine logic from I/O concerns like RPC calls and datastore writes.

Add SessionManager that routes incoming HTLCs to the correct session actor by payment hash, replacing the previous handler-based approach. Reworks the policy plugin API and integrates the CLN RPC executor, unifies HTLC handling into the session FSM, and removes the now deprecated handler.rs.

Add integration tests covering the full session lifecycle: channel opening, HTLC forwarding, payment collection, and session completion.

Implement crash recovery for LSPS2 sessions so that in-progress JIT channel sessions survive plugin restarts. Adds recovery traits and datastore methods, a RecoveryProvider implementation for ClnApiRpc, forward monitoring for recovered sessions, and integration tests for recovery scenarios. Makes broadcast_tx and abandon_session idempotent to handle replayed actions safely.

Reduce DatastoreProvider from many methods to 5, with the actor owning the DatastoreEntry and driving all writes through the actor loop. This makes the datastore boundary simpler and testable.

Add an EventSink trait that decouples session event reporting from the transport layer. Includes a composite sink and a channel-based implementation. Wires EventSink through SessionActor and SessionManager, and persists payment_hash in DatastoreEntry.

Replace CLN-specific types (cln_rpc PublicKey, ShortChannelId alias) with standalone alternatives, feature-gate CLN dependencies behind a "cln" feature flag, split ClnApiRpc into focused adapter structs, and refactor Lsps2ServiceHandler generics for cleaner trait boundaries. This makes the lsps2 core reusable outside of CLN.

Merge BlockheightProvider into Lsps2PolicyProvider, extract check_cltv_timeout helper in the session FSM, flatten recovery branching in SessionManager, simplify the actor loop with convert_input and tokio::select!, and remove the unused CollectTimeout ActorInput variant.

We actually only use this in tests Signed-off-by: Peter Neuroth <pet.v.ne@gmail.com>

After restart, recovered session actors were stored in a separate recovery_handles Vec, unreachable by the forward_event notification path that routes via the sessions HashMap. This caused intermittent CI failures where on_payment_settled could not find the session and the internal forward-monitoring loop failed to detect settlement. Register recovered sessions in the sessions HashMap keyed by payment_hash so forward_event notifications reach them directly. For already-settled forwards, recover into Broadcasting state so the actor self-drives to completion without needing forward_event re-delivery. Remove the now-redundant internal polling loop (get_forward_activity + wait_for_forward_resolution).

daywalker90 · 2026-05-05T07:35:31Z

rebased on master

cdecker · 2026-05-07T10:23:35Z

    pub opening_fee_params: OpeningFeeParams,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub expected_payment_size: Option<Msat>,
+    pub channel_capacity_msat: Msat,


These are guaranteed to be new entries, right? If you try to load an existing one with non-optional fields it'll likely fail.

cdecker · 2026-05-07T10:36:23Z

+        }
+    }
+
+    pub fn apply(&mut self, input: SessionInput) -> Result<ApplyResult> {


Yep, pretty much, can be a cleanup though. Also a quick diagram on allowable state changes would make this much easier to follow as well. All followup things though, not a blocker.

cdecker · 2026-05-07T10:38:01Z

+                    return Ok(ApplyResult {
+                        actions: vec![
+                            SessionAction::FailHtlcs {
+                                failure_code: UNKNOWN_NEXT_PEER,


You probably don't want to use UNKNOWN_NEXT_PEER as that is a permanent error, and the client is supposed to remember that the channel never existed, and future attempts with blocklist them. It's a common error lnd always gets wrong too.

cdecker · 2026-05-07T11:02:36Z

+                // We don't check for max parts here as we are in the middle of
+                // the channel funding. We'll check once we transitioned.
+
+                Ok(ApplyResult {


Sorry my eyes glazed over at this point, I trust the logic is sensible, as reconstructing it from code is painful.

cdecker · 2026-05-07T11:04:37Z

+        }
+    }
+
+    pub fn apply(&mut self, input: SessionInput) -> Result<ApplyResult> {


That is not to say I don't like the (state, event) matching, as that automatically gives you an indication whether all states and transitions are covered, via the exhaustive enumeration rule in Rust.

@cdecker

The UNKNOWN_NEXT_PEER constant carried the value 0x4010 which is not a valid BOLT4 failure code; unknown_next_peer is PERM|10 = 0x400a. Every path that was supposed to fail with unknown_next_peer actually sent the deprecated incorrect_payment_amount code. Also stop using the permanent unknown_next_peer for retryable failures (too many parts, fee allocation failure, channel funding failure) as senders are expected to blocklist the route on permanent errors. These now fail with temporary_channel_failure; LSPS2 only mandates unknown_next_peer when the payment cannot cover the opening fee or valid_until has passed, and requires temporary_channel_failure for a client disconnect during channel establishment. Addresses review feedback by @cdecker on ElementsProject#8948.

@cdecker

HTLC ids are a per-channel counter, so MPP parts of one payment arriving over different incoming channels can share the same id -- the common case for multi-part payments. The session actor tracked pending HTLC replies in a map keyed by the bare id, so a colliding part silently replaced the earlier part's reply channel: the earlier hook call resolved to a generic continue (failing the part upstream with WIRE_UNKNOWN_NEXT_PEER), the manager misread the dropped reply as a dead actor and removed the live session, and forward amounts could be cross-wired between parts. Introduce HtlcId as the composite of the incoming channel scid and the per-channel id, carry it in PaymentPart/ForwardPart and key the pending-reply map with it. Addresses review feedback by @cdecker on ElementsProject#8948 (HTLC id ambiguity).

@cdecker

Both sides unconditionally requested a zero channel reserve on JIT channels: the client's openchannel hook always replied reserve=0 and the LSP's fundchannel_start always sent reserve=0. Some LSP implementations (e.g. Megalithic) cannot handle an explicit zero reserve. Leave the reserve at lightningd's default and add opt-in flags: experimental-lsps-client-zero-reserve on the client and experimental-lsps2-zero-reserve on the service. Also add uniform #[serde(default)] on DatastoreEntry's optional fields and a regression test pinning that entries persisted without the newer fields keep deserializing. Addresses review feedback by @cdecker on ElementsProject#8948 (Megalithic reserve).

@cdecker

The UNKNOWN_NEXT_PEER constant carried the value 0x4010 which is not a valid BOLT4 failure code; unknown_next_peer is PERM|10 = 0x400a. Every path that was supposed to fail with unknown_next_peer actually sent the deprecated incorrect_payment_amount code. Also stop using the permanent unknown_next_peer for retryable failures (too many parts, fee allocation failure, channel funding failure) as senders are expected to blocklist the route on permanent errors. These now fail with temporary_channel_failure; LSPS2 only mandates unknown_next_peer when the payment cannot cover the opening fee or valid_until has passed, and requires temporary_channel_failure for a client disconnect during channel establishment. Addresses review feedback by @cdecker on ElementsProject#8948.

@cdecker

HTLC ids are a per-channel counter, so MPP parts of one payment arriving over different incoming channels can share the same id -- the common case for multi-part payments. The session actor tracked pending HTLC replies in a map keyed by the bare id, so a colliding part silently replaced the earlier part's reply channel: the earlier hook call resolved to a generic continue (failing the part upstream with WIRE_UNKNOWN_NEXT_PEER), the manager misread the dropped reply as a dead actor and removed the live session, and forward amounts could be cross-wired between parts. Introduce HtlcId as the composite of the incoming channel scid and the per-channel id, carry it in PaymentPart/ForwardPart and key the pending-reply map with it. Addresses review feedback by @cdecker on ElementsProject#8948 (HTLC id ambiguity).

On restart CLN re-plays the htlc_accepted hook for incoming HTLCs that were held but not resolved before the crash. The recovered session actor silently dropped these AddPart inputs, which also dropped the hook's reply channel: the manager misread that as a dead actor, removed the recovered session from its map, and subsequent forward_event notifications found no session -- so the withheld funding transaction was never broadcast even though the client could still settle the payment. Route AddPart through the same convert/apply pipeline as the other recovered inputs (the FSM already forwards late parts in the AwaitingSettlement and Broadcasting states) and release any pending replies when a recovered actor terminates.

Three related races allowed a second on-chain channel to be funded for a single lsps2.buy: - create_session did not check whether the active datastore entry already carried a channel_id, so any part arriving for a hash without a live session (payer retry, late relay, replay) started a fresh Collecting session that could reach the threshold and fund again. Refuse with a new SessionAlreadyFunded error instead. - on_payment_settled/on_payment_failed removed the session handle from the map while the actor was still alive, making the FSM's late-part forwarding paths (AwaitingSettlement/Broadcasting + AddPart) unreachable and opening the retry window above. Keep the handle registered and prune dead handles lazily, as on_part and on_new_block already do. - Map-entry removal on a dead handle was not identity-checked and could tear down a newer live session created concurrently for the same hash. Only remove the entry when it still refers to the same actor. The htlc_accepted hook now fails parts hitting a terminated or already-funded session with temporary_channel_failure instead of falling through to continue (which surfaced upstream as the permanent unknown_next_peer).

The CLTV check failed held HTLCs only once the chain tip was already past the earliest cltv_expiry. At that point the upstream peer is entitled to force-close to claim the HTLC timeout on-chain, so failing off-chain is too late. Fail once fewer than CLTV_SAFETY_BUFFER (6) blocks remain before the earliest expiry.

Every ForwardHtlcs action started a new 5-second listpeerchannels poller and overwrote the previous JoinHandle without aborting it, so each late-arriving part leaked a poll task that keeps running for the lifetime of the channel on successful sessions. Start the poller only once and abort it when the actor terminates.

Recovery aborted on the first entry that failed to recover (RPC error, malformed entry), leaving every other persisted session unrecovered while CLN replays their HTLCs. Log and continue per entry instead. An overflowing opening-fee computation surfaced as an FSM error that the actor only logged, leaving the HTLCs hanging until the collect timeout failed them with the wrong code. Fail the session immediately with unknown_next_peer as LSPS2 mandates.

… range The buy response advertised client_trusts_lsp: false while the funding flow withholds the funding transaction until the preimage is revealed -- which is exactly the client-trusts-LSP model. A spec-following client is entitled to wait for the funding tx in its mempool before settling when the flag is false, which would deadlock against the withhold. Advertise true until the broadcast-on-channel_ready mode is implemented. Also add the LSPS2-mandated check that a requested payment_size_msat lies within min/max_payment_size_msat, returning error 202/203. The existing tests for this passed for the wrong reasons (the mock had no blockheight, and the below-min case tripped the fee check instead); they now pin the exact error codes with a fully configured mock.

Compute the opening fee from the negotiated payment_size_msat instead of the sum of collected parts. BOLT4 permits the payer to overpay, and the client verifies the deducted extra_fees against the fee promised for payment_size_msat -- charging on parts_sum makes a compliant client reject overpaid payments. Var-invoice mode keeps using the first HTLC's value as mandated. Only attach the extra_fee TLV (type 65537) to forwarded parts that actually have a fee deducted; LSPS2 forbids including it otherwise, and we attached a zero-value TLV to every part.

@cdecker

Both sides unconditionally requested a zero channel reserve on JIT channels: the client's openchannel hook always replied reserve=0 and the LSP's fundchannel_start always sent reserve=0. Some LSP implementations (e.g. Megalithic) cannot handle an explicit zero reserve. Leave the reserve at lightningd's default and add opt-in flags: experimental-lsps-client-zero-reserve on the client and experimental-lsps2-zero-reserve on the service. Also add uniform #[serde(default)] on DatastoreEntry's optional fields and a regression test pinning that entries persisted without the newer fields keep deserializing. Addresses review feedback by @cdecker on ElementsProject#8948 (Megalithic reserve).

A single failed forward (e.g. a part exceeding the channel's max_accepted_htlcs) abandoned the whole session and closed the channel even though the client could still settle the other offered parts. Thread the failed part's (channel, htlc id) through the forward_event path and drop just that part from the session; abandon only when no forwarded parts are left (or the failed part is unknown, as for recovered sessions, preserving the previous behavior). Resolves the TODO about #HTLCs > max_accepted_htlcs.

- Recovered session actors now carry the peer id from the datastore entry so the Disconnect action works after recovery, instead of warning on an empty public key. - Move tokio's test-util feature to dev-dependencies. - Resolve the inbox-capacity TODO with a rationale comment: senders use backpressure, the buffer only needs to cover in-flight inputs. - Document the Msat/u64/u128 amount conventions in the session module. - Drop the unused LspsBuyJitChannelResponse struct and unused test helper, fix needless returns, a clone on a Copy type, and a while-let loop in code this branch introduced.

nepet force-pushed the plugins/lsps2/mpp-fsm branch from dc5b671 to 44f666d Compare March 18, 2026 23:33

nepet requested a review from cdecker March 18, 2026 23:36

nepet force-pushed the plugins/lsps2/mpp-fsm branch from 44f666d to df7c132 Compare March 19, 2026 10:33

madelinevibes added this to the v26.04 milestone Mar 20, 2026

nepet force-pushed the plugins/lsps2/mpp-fsm branch from 73b18ed to da925c6 Compare March 22, 2026 10:06

madelinevibes added the PLEASE clear CI 🫠 label Mar 22, 2026

madelinevibes modified the milestones: v26.04, 26.06 Mar 22, 2026

cdecker removed the PLEASE clear CI 🫠 label Mar 23, 2026

cdecker approved these changes Mar 23, 2026

View reviewed changes

sangbida force-pushed the plugins/lsps2/mpp-fsm branch from da925c6 to d0fc010 Compare April 1, 2026 03:07

madelinevibes added the needs-rebase label May 4, 2026

nepet and others added 12 commits May 5, 2026 09:27

plugins(lsps2): add session actor with action executor boundary

ecdbc64

Introduce a session actor that runs the FSM in an async task and communicates side effects through an ActionExecutor trait. This separates state machine logic from I/O concerns like RPC calls and datastore writes.

plugins(lsps2): add integration tests for session lifecycle

c45e5cb

Add integration tests covering the full session lifecycle: channel opening, HTLC forwarding, payment collection, and session completion.

plugins(lsps2): simplify DatastoreProvider to focused trait

8b35b6a

Reduce DatastoreProvider from many methods to 5, with the actor owning the DatastoreEntry and driving all writes through the actor loop. This makes the datastore boundary simpler and testable.

plugins(lsps2): make collect timeout dev config

662fdf0

We actually only use this in tests Signed-off-by: Peter Neuroth <pet.v.ne@gmail.com>

lsps2: adding list_finalized_sessions to the DatastoreProvider

16d662b

daywalker90 force-pushed the plugins/lsps2/mpp-fsm branch from 4d6bfcf to 16d662b Compare May 5, 2026 07:34

cdecker reviewed May 7, 2026

View reviewed changes

cdecker approved these changes May 7, 2026

View reviewed changes

madelinevibes modified the milestones: v26.06, v26.09 May 11, 2026

madelinevibes assigned nepet Jun 23, 2026

nepet added 12 commits July 3, 2026 18:01

nepet force-pushed the plugins/lsps2/mpp-fsm branch from 36c53dd to 4c58bff Compare July 3, 2026 16:04

Uh oh!

Conversation

nepet commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

cdecker commented Mar 23, 2026

Uh oh!

cdecker commented Mar 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdecker commented Mar 23, 2026

Uh oh!

madelinevibes commented May 4, 2026

Uh oh!

daywalker90 commented May 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nepet commented Mar 18, 2026 •

edited

Loading