[Feat] Important updates to the experimental unified trainer#398
Merged
Conversation
… self-distillation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR contains a few important refactorings to the experimental
unified_trainerfeatures and theTinkerbackend supports. Most of the contents are non-breaking as they live under theexperimental/folder -- so I will just give a high-level summary of important changes.Feat 1: Deprecation of the
per_stepmode.Related files: rllm/experimental/common/advantage.py, rllm/experimental/common/config.py
During config creation, we now will automatically enforces
stepwise_mode = "broadcast"in theAlgorithmConfig, and issue anDeprecationWarningfor backward compatibility (i.e. the code doesn't directly fail). All other functions using theAlgorithmConfigwill strictly check for the mode.TODO: currently we have a
trajectory_grouping_hookkwarg forUnifiedTrainerthat allows user to customize reward computation logic -- i.e. implementper_stepmode themselves. But this is not made clear yet. Will need a documentation on this.Feat 2: Validation and bypassing mode for pre-computed advantages
Certain algorithms, such as on-policy (self)-distillation or SFT, can already perform advantage/loss calculation at the "workflow rollout stage" -- as they only rely on the logprobs/tokens for a single step (rather than requiring some grouping operation as in GRPO-ish RL algos). As suggested by @kylemontgomery1, it would be easy to provide the user an option to calculate the
step.advantage(either a scalar or a list with length equal tostep.logprobs) on their own.In this PR, we add support for this feature while being very "cautious" about the user-customized input: we will carefully validate whether the steps in a
TrajectoryGrouphave their advantages already computed (via_check_advantage_already_computed) -- and issue warning for special cases (like 1/2 of the steps have advantages, while 1/2 do not) with mandatory falling back logic. If the advantages are indeed valid, we will then directly skip thecompute_advantagestage all-together.Feat 3: On-policy self-distillation (OPSD; Ongoing)
Previously I've proposed to reproduce results from papers like Self-distilled Reasoner or Self-Distillation Policy Optimization via a specific
TinkerOSPDBackend. However, this introduces unnecessary overhead and code complexities (why a new backend?).In this PR, with support from Feature 2, we are able to abstract the OPSD logic via a
postprocess_opsddecorator that can directly wrap on therunfunction of a workflow. This greatly reduces the user's mental burden and potentially allows better extensibility to more backends (Yes, the current wrapper also only supportsTinkercurrently).I will also be discussing with @kylemontgomery1 on how to integrate his OPD codes (for both
TinkerandVerl) into this module.Feat 4: Documentations (Ongoing)
Added two docs,
experimental/unified_trainer.mdandexperimental/backend-protocol.mdto clearly explain the API design (for developers). For generic users, I will be writing a more high-level "motivation" doc soon.