Skip to content

Refactor RL advantage estimators and add REINFORCE++ baseline/RLOO support#410

Merged
listar2000 merged 3 commits into
mainfrom
improve-adv-estimator
Mar 1, 2026
Merged

Refactor RL advantage estimators and add REINFORCE++ baseline/RLOO support#410
listar2000 merged 3 commits into
mainfrom
improve-adv-estimator

Conversation

@listar2000
Copy link
Copy Markdown
Collaborator

Summary

This PR refactors the experimental rLLM advantage estimation flow and adds new estimator options.

Changes

  • Refactored advantage estimator interface to return (advantages, returns) per group.
  • Added new estimators:
    • reinforce_plus_plus_baseline
    • rloo
  • Added rllm/experimental/common/rl_algo.py with grouped GRPO/RLOO advantage helpers.
  • Updated trajectory-group advantage collection to:
    • batch computation by role,
    • validate shape/length consistency with strict checks,
    • keep role-level reward/advantage metrics.
  • Extended estimator enum + config comments/options in:
    • rllm/experimental/common/config.py
    • rllm/experimental/config/rllm/base.yaml
  • Updated Tinker loss-function auto-mapping for new estimators and added a safe default fallback.

Testing

  • python -m compileall rllm/experimental/common/advantage.py rllm/experimental/common/config.py rllm/experimental/common/rl_algo.py rllm/trainer/tinker/tinker_policy_trainer.py

@listar2000 listar2000 marked this pull request as ready for review March 1, 2026 22:54
@listar2000 listar2000 merged commit 1d0f821 into main Mar 1, 2026
1 check passed
@listar2000 listar2000 deleted the improve-adv-estimator branch March 9, 2026 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant