Skip to content

LSP-Plugin: Add MPP support#8948

Open
nepet wants to merge 24 commits into
ElementsProject:masterfrom
nepet:plugins/lsps2/mpp-fsm
Open

LSP-Plugin: Add MPP support#8948
nepet wants to merge 24 commits into
ElementsProject:masterfrom
nepet:plugins/lsps2/mpp-fsm

Conversation

@nepet

@nepet nepet commented Mar 18, 2026

Copy link
Copy Markdown
Member

Important

26.04 FREEZE March 11th: Non-bugfix PRs not ready by this date will wait for 26.06.

RC1 is scheduled on March 23rd

The final release is scheduled for April 15th.

Checklist

Before submitting the PR, ensure the following tasks are completed. If an item is not applicable to your PR, please mark it as checked:

  • The changelog has been updated in the relevant commit(s) according to the guidelines.
  • Tests have been added or modified to reflect the changes.
  • Documentation has been reviewed and updated as needed.
  • Related issues have been listed and linked, including any that this PR closes.
  • Important All PRs must consider how to reverse any persistent changes for tools/lightning-downgrade

Introduces a state-machine-based approach to managing LSPS2 JIT channel sessions, replacing the previous ad-hoc state tracking with a structured FSM that tracks payment collection from initial channel open through HTLC forwarding to completion.

@nepet nepet force-pushed the plugins/lsps2/mpp-fsm branch from dc5b671 to 44f666d Compare March 18, 2026 23:33
@nepet nepet requested a review from cdecker March 18, 2026 23:36
@nepet nepet force-pushed the plugins/lsps2/mpp-fsm branch from 44f666d to df7c132 Compare March 19, 2026 10:33
@madelinevibes madelinevibes added this to the v26.04 milestone Mar 20, 2026
@nepet nepet force-pushed the plugins/lsps2/mpp-fsm branch from 73b18ed to da925c6 Compare March 22, 2026 10:06
@madelinevibes madelinevibes modified the milestones: v26.04, 26.06 Mar 22, 2026
@cdecker

cdecker commented Mar 23, 2026

Copy link
Copy Markdown
Member

Cache cleared, and restarting the CI while I review the code 👍

@cdecker

cdecker commented Mar 23, 2026

Copy link
Copy Markdown
Member

Nice PR, maybe a bit on the long side, and a bit of duplication, but the architecture is nice.

ACK

#[derive(Debug, Clone, PartialEq, Eq, thiserror::Error)]
pub enum Error {
#[error("variable amount payments are not supported")]
UnimplementedVarAmount,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I didn't know that LSPS2 has a varamount mode. How is the signalling handled? I.e., how does the LSP learn the total amount, and the fee amount it is allowed to retain?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the client doesn't specify payment_size_msat on the lsps2.buy request, the LSP expects a single HTLC with a variable amount for the payment hash.

)]
InsufficientDeductibleCapacity {
opening_fee_msat: u64,
deductible_capacity_msat: u128,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the different types here?


#[derive(Debug, Clone, PartialEq, Eq)]
pub struct PaymentPart {
pub htlc_id: u64,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick doubt I had: this refers to which ID? As far as I remember the numbering of HTLCs was using a composite (channel_id, htlc_id) key because the protocol insists on numbering HTLCs, making either an alias necessary or the composite key to make them unique.

This is the DB HTLC ID that counts up monotonically, not the protocol ID which is per-channel, correct?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good question indeed, I'll need to check what exactly the htlc_accepted hook provides.

}

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum SessionEvent {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's quite the extensive list of potential errors and outcomes, nice listing 👍

}
}

pub fn apply(&mut self, input: SessionInput) -> Result<ApplyResult> {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably want to break this up into a dispatch method (containing the match (field_a, field_b, input)) and several handler methods that take the involved parts and pieces of information as explicit parameters. That'll help get a quick overview, and allow diving deep on specific operations if interested.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean something like the following? This would indeed make it more readable.

pub fn apply(&mut self, input: SessionInput) -> Result<ApplyResult> {
    match(&mut self.stat) {
        SessionState::Collecting => apply_on_collecting(input),
        SessionState::AwaitingChannelReady => apply_on_awaiting_channel_ready(input),
        ....
    }
}

// Maybe needs to get the state passed here and reinjected into the Session later
fn apply_on_collecting(&mut self, input: SessionInput) -> Result<ApplyResult> {
    // Check what needs to be "taken" here to please the ownership model
    match(input) {
        ....
    }
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, pretty much, can be a cleanup though. Also a quick diagram on allowable state changes would make this much easier to follow as well. All followup things though, not a blocker.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not to say I don't like the (state, event) matching, as that automatically gives you an indication whether all states and transitions are covered, via the exhaustive enumeration rule in Rust.

scid: ShortChannelId,
datastore: D,
) -> ActorInboxHandle {
let (tx, inbox) = mpsc::channel(128); // Should we use max_htlcs?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backlog is mostly intended for bursty behavior, and should be set to the maximum number of events in flight. If there is no more room, we will drop block the sending side, no messages should be lost. If the sender does not need to make progress, and the expectation is that we can process messages in the order they arrive (no interleaving) it should be safe to set the backlog to 1. Don't think about elements queued up, rather consider if you need to make progress on the sendign side at all, while elements are being processed.

}
}

fn execute_action(&mut self, action: SessionAction) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the match () {} matrix, less so the deep nesting, as it pulls the matrix apart and makes reasoning about its relations and transitions harder :-)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clean it up and move stuff into handler functions

Comment thread plugins/lsps-plugin/src/client.rs Outdated
return Ok(serde_json::json!({
"result": "continue",
"mindepth": 0,
"reserve": 0,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this will break megalithic LSP? They do not have a way to set no reserve.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoo! GOOD catch. totally forgot about megalith here. I'll make it an option with a sane (set nothing) default

};

// Main loop: process inbox events
loop {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not duplicating the entire FSM logic, just because we enter the system through recovery, rather than kicking off a new session? We could just call into the dispatch of events here, and everything else would be the same, or am I missing something?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if I can clean it up a bit

) -> Result<(String, String)> {
(**self)
.fund_channel(peer_id, channel_capacity_msat, opening_fee_params)
.fund_channel(peer_id, channel_capacity_msat, opening_fee_params, scid)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there "let's define ALL the operations on a variant of the original behavior" useful? I am failing to see how defining the operations on Arc<T> and then just having them forward to T could be useful 🤔

@cdecker

cdecker commented Mar 23, 2026

Copy link
Copy Markdown
Member

The CI failures appear to NOT be flaky tests, as I am seeing a lot of the new tests failing on us.

@sangbida sangbida force-pushed the plugins/lsps2/mpp-fsm branch from da925c6 to d0fc010 Compare April 1, 2026 03:07
@madelinevibes

Copy link
Copy Markdown
Collaborator

@nepet can you rebase, resolve conflicts and clear CI by the end of this week so we can add this to 26.06?
Release candidate planned for Monday 11 May.

nepet and others added 12 commits May 5, 2026 09:27
Introduce a state-machine-based approach to managing LSPS2 JIT channel
sessions. The FSM tracks payment collection from initial channel open
through HTLC forwarding to completion, replacing the previous ad-hoc
state tracking.

Also adds Sum trait for Msat and PartialEq for protocol types needed
by the FSM.

Changelog-Experimental: LSPS2 session state machine for JIT channels
Introduce a session actor that runs the FSM in an async task and
communicates side effects through an ActionExecutor trait. This
separates state machine logic from I/O concerns like RPC calls and
datastore writes.
Add SessionManager that routes incoming HTLCs to the correct session
actor by payment hash, replacing the previous handler-based approach.

Reworks the policy plugin API and integrates the CLN RPC executor,
unifies HTLC handling into the session FSM, and removes the now
deprecated handler.rs.
Add integration tests covering the full session lifecycle: channel
opening, HTLC forwarding, payment collection, and session completion.
Implement crash recovery for LSPS2 sessions so that in-progress JIT
channel sessions survive plugin restarts.

Adds recovery traits and datastore methods, a RecoveryProvider
implementation for ClnApiRpc, forward monitoring for recovered
sessions, and integration tests for recovery scenarios. Makes
broadcast_tx and abandon_session idempotent to handle replayed
actions safely.
Reduce DatastoreProvider from many methods to 5, with the actor
owning the DatastoreEntry and driving all writes through the actor
loop. This makes the datastore boundary simpler and testable.
Add an EventSink trait that decouples session event reporting from the
transport layer. Includes a composite sink and a channel-based
implementation. Wires EventSink through SessionActor and
SessionManager, and persists payment_hash in DatastoreEntry.
Replace CLN-specific types (cln_rpc PublicKey, ShortChannelId alias)
with standalone alternatives, feature-gate CLN dependencies behind a
"cln" feature flag, split ClnApiRpc into focused adapter structs, and
refactor Lsps2ServiceHandler generics for cleaner trait boundaries.

This makes the lsps2 core reusable outside of CLN.
Merge BlockheightProvider into Lsps2PolicyProvider, extract
check_cltv_timeout helper in the session FSM, flatten recovery
branching in SessionManager, simplify the actor loop with
convert_input and tokio::select!, and remove the unused
CollectTimeout ActorInput variant.
We actually only use this in tests

Signed-off-by: Peter Neuroth <pet.v.ne@gmail.com>
After restart, recovered session actors were stored in a separate
recovery_handles Vec, unreachable by the forward_event notification
path that routes via the sessions HashMap. This caused intermittent
CI failures where on_payment_settled could not find the session and
the internal forward-monitoring loop failed to detect settlement.

Register recovered sessions in the sessions HashMap keyed by
payment_hash so forward_event notifications reach them directly.
For already-settled forwards, recover into Broadcasting state so
the actor self-drives to completion without needing forward_event
re-delivery. Remove the now-redundant internal polling loop
(get_forward_activity + wait_for_forward_resolution).
@daywalker90 daywalker90 force-pushed the plugins/lsps2/mpp-fsm branch from 4d6bfcf to 16d662b Compare May 5, 2026 07:34
@daywalker90

Copy link
Copy Markdown
Collaborator

rebased on master

pub opening_fee_params: OpeningFeeParams,
#[serde(skip_serializing_if = "Option::is_none")]
pub expected_payment_size: Option<Msat>,
pub channel_capacity_msat: Msat,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are guaranteed to be new entries, right? If you try to load an existing one with non-optional fields it'll likely fail.

}
}

pub fn apply(&mut self, input: SessionInput) -> Result<ApplyResult> {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, pretty much, can be a cleanup though. Also a quick diagram on allowable state changes would make this much easier to follow as well. All followup things though, not a blocker.

return Ok(ApplyResult {
actions: vec![
SessionAction::FailHtlcs {
failure_code: UNKNOWN_NEXT_PEER,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't want to use UNKNOWN_NEXT_PEER as that is a permanent error, and the client is supposed to remember that the channel never existed, and future attempts with blocklist them. It's a common error lnd always gets wrong too.

// We don't check for max parts here as we are in the middle of
// the channel funding. We'll check once we transitioned.

Ok(ApplyResult {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my eyes glazed over at this point, I trust the logic is sensible, as reconstructing it from code is painful.

}
}

pub fn apply(&mut self, input: SessionInput) -> Result<ApplyResult> {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not to say I don't like the (state, event) matching, as that automatically gives you an indication whether all states and transitions are covered, via the exhaustive enumeration rule in Rust.

@madelinevibes madelinevibes modified the milestones: v26.06, v26.09 May 11, 2026
nepet added a commit to nepet/lightning that referenced this pull request Jul 3, 2026
The UNKNOWN_NEXT_PEER constant carried the value 0x4010 which is not a
valid BOLT4 failure code; unknown_next_peer is PERM|10 = 0x400a. Every
path that was supposed to fail with unknown_next_peer actually sent the
deprecated incorrect_payment_amount code.

Also stop using the permanent unknown_next_peer for retryable failures
(too many parts, fee allocation failure, channel funding failure) as
senders are expected to blocklist the route on permanent errors. These
now fail with temporary_channel_failure; LSPS2 only mandates
unknown_next_peer when the payment cannot cover the opening fee or
valid_until has passed, and requires temporary_channel_failure for a
client disconnect during channel establishment.

Addresses review feedback by @cdecker on ElementsProject#8948.
nepet added a commit to nepet/lightning that referenced this pull request Jul 3, 2026
HTLC ids are a per-channel counter, so MPP parts of one payment
arriving over different incoming channels can share the same id -- the
common case for multi-part payments. The session actor tracked pending
HTLC replies in a map keyed by the bare id, so a colliding part
silently replaced the earlier part's reply channel: the earlier hook
call resolved to a generic continue (failing the part upstream with
WIRE_UNKNOWN_NEXT_PEER), the manager misread the dropped reply as a
dead actor and removed the live session, and forward amounts could be
cross-wired between parts.

Introduce HtlcId as the composite of the incoming channel scid and the
per-channel id, carry it in PaymentPart/ForwardPart and key the
pending-reply map with it.

Addresses review feedback by @cdecker on ElementsProject#8948 (HTLC id ambiguity).
nepet added a commit to nepet/lightning that referenced this pull request Jul 3, 2026
Both sides unconditionally requested a zero channel reserve on JIT
channels: the client's openchannel hook always replied reserve=0 and
the LSP's fundchannel_start always sent reserve=0. Some LSP
implementations (e.g. Megalithic) cannot handle an explicit zero
reserve. Leave the reserve at lightningd's default and add opt-in
flags: experimental-lsps-client-zero-reserve on the client and
experimental-lsps2-zero-reserve on the service.

Also add uniform #[serde(default)] on DatastoreEntry's optional fields
and a regression test pinning that entries persisted without the newer
fields keep deserializing.

Addresses review feedback by @cdecker on ElementsProject#8948 (Megalithic reserve).
nepet added 12 commits July 3, 2026 18:01
The UNKNOWN_NEXT_PEER constant carried the value 0x4010 which is not a
valid BOLT4 failure code; unknown_next_peer is PERM|10 = 0x400a. Every
path that was supposed to fail with unknown_next_peer actually sent the
deprecated incorrect_payment_amount code.

Also stop using the permanent unknown_next_peer for retryable failures
(too many parts, fee allocation failure, channel funding failure) as
senders are expected to blocklist the route on permanent errors. These
now fail with temporary_channel_failure; LSPS2 only mandates
unknown_next_peer when the payment cannot cover the opening fee or
valid_until has passed, and requires temporary_channel_failure for a
client disconnect during channel establishment.

Addresses review feedback by @cdecker on ElementsProject#8948.
HTLC ids are a per-channel counter, so MPP parts of one payment
arriving over different incoming channels can share the same id -- the
common case for multi-part payments. The session actor tracked pending
HTLC replies in a map keyed by the bare id, so a colliding part
silently replaced the earlier part's reply channel: the earlier hook
call resolved to a generic continue (failing the part upstream with
WIRE_UNKNOWN_NEXT_PEER), the manager misread the dropped reply as a
dead actor and removed the live session, and forward amounts could be
cross-wired between parts.

Introduce HtlcId as the composite of the incoming channel scid and the
per-channel id, carry it in PaymentPart/ForwardPart and key the
pending-reply map with it.

Addresses review feedback by @cdecker on ElementsProject#8948 (HTLC id ambiguity).
On restart CLN re-plays the htlc_accepted hook for incoming HTLCs that
were held but not resolved before the crash. The recovered session
actor silently dropped these AddPart inputs, which also dropped the
hook's reply channel: the manager misread that as a dead actor,
removed the recovered session from its map, and subsequent
forward_event notifications found no session -- so the withheld
funding transaction was never broadcast even though the client could
still settle the payment.

Route AddPart through the same convert/apply pipeline as the other
recovered inputs (the FSM already forwards late parts in the
AwaitingSettlement and Broadcasting states) and release any pending
replies when a recovered actor terminates.
Three related races allowed a second on-chain channel to be funded for
a single lsps2.buy:

- create_session did not check whether the active datastore entry
  already carried a channel_id, so any part arriving for a hash without
  a live session (payer retry, late relay, replay) started a fresh
  Collecting session that could reach the threshold and fund again.
  Refuse with a new SessionAlreadyFunded error instead.

- on_payment_settled/on_payment_failed removed the session handle from
  the map while the actor was still alive, making the FSM's late-part
  forwarding paths (AwaitingSettlement/Broadcasting + AddPart)
  unreachable and opening the retry window above. Keep the handle
  registered and prune dead handles lazily, as on_part and on_new_block
  already do.

- Map-entry removal on a dead handle was not identity-checked and could
  tear down a newer live session created concurrently for the same
  hash. Only remove the entry when it still refers to the same actor.

The htlc_accepted hook now fails parts hitting a terminated or
already-funded session with temporary_channel_failure instead of
falling through to continue (which surfaced upstream as the permanent
unknown_next_peer).
The CLTV check failed held HTLCs only once the chain tip was already
past the earliest cltv_expiry. At that point the upstream peer is
entitled to force-close to claim the HTLC timeout on-chain, so failing
off-chain is too late. Fail once fewer than CLTV_SAFETY_BUFFER (6)
blocks remain before the earliest expiry.
Every ForwardHtlcs action started a new 5-second listpeerchannels
poller and overwrote the previous JoinHandle without aborting it, so
each late-arriving part leaked a poll task that keeps running for the
lifetime of the channel on successful sessions. Start the poller only
once and abort it when the actor terminates.
Recovery aborted on the first entry that failed to recover (RPC error,
malformed entry), leaving every other persisted session unrecovered
while CLN replays their HTLCs. Log and continue per entry instead.

An overflowing opening-fee computation surfaced as an FSM error that
the actor only logged, leaving the HTLCs hanging until the collect
timeout failed them with the wrong code. Fail the session immediately
with unknown_next_peer as LSPS2 mandates.
… range

The buy response advertised client_trusts_lsp: false while the funding
flow withholds the funding transaction until the preimage is revealed
-- which is exactly the client-trusts-LSP model. A spec-following
client is entitled to wait for the funding tx in its mempool before
settling when the flag is false, which would deadlock against the
withhold. Advertise true until the broadcast-on-channel_ready mode is
implemented.

Also add the LSPS2-mandated check that a requested payment_size_msat
lies within min/max_payment_size_msat, returning error 202/203. The
existing tests for this passed for the wrong reasons (the mock had no
blockheight, and the below-min case tripped the fee check instead);
they now pin the exact error codes with a fully configured mock.
Compute the opening fee from the negotiated payment_size_msat instead
of the sum of collected parts. BOLT4 permits the payer to overpay, and
the client verifies the deducted extra_fees against the fee promised
for payment_size_msat -- charging on parts_sum makes a compliant
client reject overpaid payments. Var-invoice mode keeps using the
first HTLC's value as mandated.

Only attach the extra_fee TLV (type 65537) to forwarded parts that
actually have a fee deducted; LSPS2 forbids including it otherwise,
and we attached a zero-value TLV to every part.
Both sides unconditionally requested a zero channel reserve on JIT
channels: the client's openchannel hook always replied reserve=0 and
the LSP's fundchannel_start always sent reserve=0. Some LSP
implementations (e.g. Megalithic) cannot handle an explicit zero
reserve. Leave the reserve at lightningd's default and add opt-in
flags: experimental-lsps-client-zero-reserve on the client and
experimental-lsps2-zero-reserve on the service.

Also add uniform #[serde(default)] on DatastoreEntry's optional fields
and a regression test pinning that entries persisted without the newer
fields keep deserializing.

Addresses review feedback by @cdecker on ElementsProject#8948 (Megalithic reserve).
A single failed forward (e.g. a part exceeding the channel's
max_accepted_htlcs) abandoned the whole session and closed the channel
even though the client could still settle the other offered parts.
Thread the failed part's (channel, htlc id) through the forward_event
path and drop just that part from the session; abandon only when no
forwarded parts are left (or the failed part is unknown, as for
recovered sessions, preserving the previous behavior).

Resolves the TODO about #HTLCs > max_accepted_htlcs.
- Recovered session actors now carry the peer id from the datastore
  entry so the Disconnect action works after recovery, instead of
  warning on an empty public key.
- Move tokio's test-util feature to dev-dependencies.
- Resolve the inbox-capacity TODO with a rationale comment: senders
  use backpressure, the buffer only needs to cover in-flight inputs.
- Document the Msat/u64/u128 amount conventions in the session module.
- Drop the unused LspsBuyJitChannelResponse struct and unused test
  helper, fix needless returns, a clone on a Copy type, and a
  while-let loop in code this branch introduced.
@nepet nepet force-pushed the plugins/lsps2/mpp-fsm branch from 36c53dd to 4c58bff Compare July 3, 2026 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants