[Feat] Important updates to the experimental unified trainer by listar2000 · Pull Request #398 · rllm-org/rllm

listar2000 · 2026-02-17T19:57:51Z

What does this PR do?

This PR contains a few important refactorings to the experimental unified_trainer features and the Tinker backend supports. Most of the contents are non-breaking as they live under the experimental/ folder -- so I will just give a high-level summary of important changes.

Feat 1: Deprecation of the `per_step` mode.

Related files: rllm/experimental/common/advantage.py, rllm/experimental/common/config.py

During config creation, we now will automatically enforces stepwise_mode = "broadcast" in the AlgorithmConfig, and issue an DeprecationWarning for backward compatibility (i.e. the code doesn't directly fail). All other functions using the AlgorithmConfig will strictly check for the mode.

TODO: currently we have a trajectory_grouping_hook kwarg for UnifiedTrainer that allows user to customize reward computation logic -- i.e. implement per_step mode themselves. But this is not made clear yet. Will need a documentation on this.

Feat 2: Validation and bypassing mode for pre-computed advantages

Certain algorithms, such as on-policy (self)-distillation or SFT, can already perform advantage/loss calculation at the "workflow rollout stage" -- as they only rely on the logprobs/tokens for a single step (rather than requiring some grouping operation as in GRPO-ish RL algos). As suggested by @kylemontgomery1, it would be easy to provide the user an option to calculate the step.advantage (either a scalar or a list with length equal to step.logprobs) on their own.

In this PR, we add support for this feature while being very "cautious" about the user-customized input: we will carefully validate whether the steps in a TrajectoryGroup have their advantages already computed (via _check_advantage_already_computed) -- and issue warning for special cases (like 1/2 of the steps have advantages, while 1/2 do not) with mandatory falling back logic. If the advantages are indeed valid, we will then directly skip the compute_advantage stage all-together.

Feat 3: On-policy self-distillation (OPSD; Ongoing)

Previously I've proposed to reproduce results from papers like Self-distilled Reasoner or Self-Distillation Policy Optimization via a specific TinkerOSPDBackend. However, this introduces unnecessary overhead and code complexities (why a new backend?).

In this PR, with support from Feature 2, we are able to abstract the OPSD logic via a postprocess_opsd decorator that can directly wrap on the run function of a workflow. This greatly reduces the user's mental burden and potentially allows better extensibility to more backends (Yes, the current wrapper also only supports Tinker currently).

I will also be discussing with @kylemontgomery1 on how to integrate his OPD codes (for both Tinker and Verl) into this module.

Feat 4: Documentations (Ongoing)

Added two docs, experimental/unified_trainer.md and experimental/backend-protocol.md to clearly explain the API design (for developers). For generic users, I will be writing a more high-level "motivation" doc soon.

… self-distillation

listar2000 added 9 commits February 3, 2026 19:29

feat: support lr scheduling for Tinkter backend and initiateon-policy…

2215380

… self-distillation

feat: add opsd tinker backend & utils

2e51bf1

fix: rename and add math opsd script

ba0a429

deprecate the 'per-step' stepwise mode with warning

5aa2d12

feat: support pre-computation of advantages

fa1c57c

merge with the original on-policy-self-distillation branch

6173d09

refactor osdp: remove backend and add decorator

31b6052

some cleanups

dfb9c0c

fix lr schedule and add docs

60dd779

listar2000 merged commit ecddeae into main Feb 17, 2026
1 check passed

listar2000 deleted the refactor-unified-trainer branch February 17, 2026 22:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Important updates to the experimental unified trainer#398

[Feat] Important updates to the experimental unified trainer#398
listar2000 merged 9 commits into
mainfrom
refactor-unified-trainer

listar2000 commented Feb 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

listar2000 commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Feat 1: Deprecation of the per_step mode.

Feat 2: Validation and bypassing mode for pre-computed advantages

Feat 3: On-policy self-distillation (OPSD; Ongoing)

Feat 4: Documentations (Ongoing)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

listar2000 commented Feb 17, 2026 •

edited

Loading

Feat 1: Deprecation of the `per_step` mode.