Skip to content

[Feat] Important updates to the experimental unified trainer#398

Merged
listar2000 merged 9 commits into
mainfrom
refactor-unified-trainer
Feb 17, 2026
Merged

[Feat] Important updates to the experimental unified trainer#398
listar2000 merged 9 commits into
mainfrom
refactor-unified-trainer

Conversation

@listar2000
Copy link
Copy Markdown
Collaborator

@listar2000 listar2000 commented Feb 17, 2026

What does this PR do?

This PR contains a few important refactorings to the experimental unified_trainer features and the Tinker backend supports. Most of the contents are non-breaking as they live under the experimental/ folder -- so I will just give a high-level summary of important changes.

Feat 1: Deprecation of the per_step mode.

Related files: rllm/experimental/common/advantage.py, rllm/experimental/common/config.py

During config creation, we now will automatically enforces stepwise_mode = "broadcast" in the AlgorithmConfig, and issue an DeprecationWarning for backward compatibility (i.e. the code doesn't directly fail). All other functions using the AlgorithmConfig will strictly check for the mode.

TODO: currently we have a trajectory_grouping_hook kwarg for UnifiedTrainer that allows user to customize reward computation logic -- i.e. implement per_step mode themselves. But this is not made clear yet. Will need a documentation on this.

Feat 2: Validation and bypassing mode for pre-computed advantages

Certain algorithms, such as on-policy (self)-distillation or SFT, can already perform advantage/loss calculation at the "workflow rollout stage" -- as they only rely on the logprobs/tokens for a single step (rather than requiring some grouping operation as in GRPO-ish RL algos). As suggested by @kylemontgomery1, it would be easy to provide the user an option to calculate the step.advantage (either a scalar or a list with length equal to step.logprobs) on their own.

In this PR, we add support for this feature while being very "cautious" about the user-customized input: we will carefully validate whether the steps in a TrajectoryGroup have their advantages already computed (via _check_advantage_already_computed) -- and issue warning for special cases (like 1/2 of the steps have advantages, while 1/2 do not) with mandatory falling back logic. If the advantages are indeed valid, we will then directly skip the compute_advantage stage all-together.

Feat 3: On-policy self-distillation (OPSD; Ongoing)

Previously I've proposed to reproduce results from papers like Self-distilled Reasoner or Self-Distillation Policy Optimization via a specific TinkerOSPDBackend. However, this introduces unnecessary overhead and code complexities (why a new backend?).

In this PR, with support from Feature 2, we are able to abstract the OPSD logic via a postprocess_opsd decorator that can directly wrap on the run function of a workflow. This greatly reduces the user's mental burden and potentially allows better extensibility to more backends (Yes, the current wrapper also only supports Tinker currently).

I will also be discussing with @kylemontgomery1 on how to integrate his OPD codes (for both Tinker and Verl) into this module.

Feat 4: Documentations (Ongoing)

Added two docs, experimental/unified_trainer.md and experimental/backend-protocol.md to clearly explain the API design (for developers). For generic users, I will be writing a more high-level "motivation" doc soon.

@listar2000 listar2000 merged commit ecddeae into main Feb 17, 2026
1 check passed
@listar2000 listar2000 deleted the refactor-unified-trainer branch February 17, 2026 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant