Refactor RL advantage estimators and add REINFORCE++ baseline/RLOO support by listar2000 · Pull Request #410 · rllm-org/rllm

listar2000 · 2026-03-01T22:53:26Z

Summary

This PR refactors the experimental rLLM advantage estimation flow and adds new estimator options.

Refactored advantage estimator interface to return (advantages, returns) per group.
Added new estimators:
- reinforce_plus_plus_baseline
- rloo
Added rllm/experimental/common/rl_algo.py with grouped GRPO/RLOO advantage helpers.
Updated trajectory-group advantage collection to:
- batch computation by role,
- validate shape/length consistency with strict checks,
- keep role-level reward/advantage metrics.
Extended estimator enum + config comments/options in:
- rllm/experimental/common/config.py
- rllm/experimental/config/rllm/base.yaml
Updated Tinker loss-function auto-mapping for new estimators and added a safe default fallback.

python -m compileall rllm/experimental/common/advantage.py rllm/experimental/common/config.py rllm/experimental/common/rl_algo.py rllm/trainer/tinker/tinker_policy_trainer.py

listar2000 added 3 commits March 1, 2026 01:41

refactor rl adv estimators and add two new ones

f7fbb0f

strict zip

f47fa92

add missing tinker loss map

d76b223

listar2000 marked this pull request as ready for review March 1, 2026 22:54

listar2000 merged commit 1d0f821 into main Mar 1, 2026
1 check passed

listar2000 deleted the improve-adv-estimator branch March 9, 2026 18:45