Refactor unified trainer async flow and add precomputed-advantage + Tinker rollout fixes by listar2000 · Pull Request #401 · rllm-org/rllm

listar2000 · 2026-02-25T21:07:49Z

Summary

This PR streamlines the experimental trainer execution path, standardizes episode/task metadata handling, and adds support for workflows that provide pre-computed per-
token advantages. It also includes several Tinker integration fixes (rollout parsing, model length wiring, and resume behavior).

What Changed

Unified trainer lifecycle

Removed the background event-loop thread / thread-safe coroutine wrapper flow.
Kept fit() as sync entry point, now using asyncio.run(...).
Added explicit async entrypoint (fit_async) and moved workflow-pool initialization into async startup.
Passed use_precomputed_advantage through trainer setup.
Switched validation UID grouping to use normalized episode.task_id.
Updated tracking payload to serialize self.config (full config object used by trainer).

Agent / episode model cleanup

Added Step.from_model_output(...) to centralize Step construction.
Updated completers to use the shared constructor.
Relaxed Step.prompt_ids typing to allow non-int prompt blocks used by some backends.
Added cached Episode.task_id and Episode.rollout_idx.
Switched trajectory-group identifiers/properties to cached-property usage.

Advantage collection behavior

Added algorithm.use_precomputed_advantage (default false) to config/dataclass.
When enabled, collector consumes per-token step.advantage lists directly.
Missing/mismatched per-step advantages are defaulted to zeros with warnings.
Scalar step.advantage in precomputed mode now raises an error.
In normal RL mode, advantages are still computed from rewards; logs warning if precomputed values are present and being overwritten.

Transform / metadata consistency

Replaced ad-hoc episode.id.split(":") usage with episode.task_id / episode.rollout_idx.
Updated metadata emitted for visualization/transform paths accordingly.

Tinker backend and rollout fixes

Wired training.max_length into Tinker engine as max_model_length.
Added optional rollout_engine.renderer_name config.
Corrected stop-sequence fallback to use tokenizer eos_token_id when available.
Improved sampling-param handling (max_tokens precedence and copy semantics).
Added backend-specific completion-id extraction in TITO completer.
Added parser-bypass response parsing path when configured.
Added async compute_logprobs(...) API to Tinker engine.
Added resume support from training.resume_from_tinker_id (tinker://.../weights/<batch>).

listar2000 added 6 commits February 19, 2026 23:07

fix: add clean properties for episode and trajectory groups

d37de81

keep fixing tracking config

73843c2

further small fix

dcb9b79

feat: remove separate thread for async operation

ef52709

cherry-pick brian's change to opd

58c8544

revert a few changes in episode task id and refactor completer

bd7a9c6

listar2000 merged commit 4c3acae into main Feb 25, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor unified trainer async flow and add precomputed-advantage + Tinker rollout fixes#401

Refactor unified trainer async flow and add precomputed-advantage + Tinker rollout fixes#401
listar2000 merged 6 commits into
mainfrom
unified-trainer-cleanup

listar2000 commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

listar2000 commented Feb 25, 2026

Summary

What Changed

Unified trainer lifecycle

Agent / episode model cleanup

Advantage collection behavior

Transform / metadata consistency

Tinker backend and rollout fixes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant