Skip to content

Refactor unified trainer async flow and add precomputed-advantage + Tinker rollout fixes#401

Merged
listar2000 merged 6 commits into
mainfrom
unified-trainer-cleanup
Feb 25, 2026
Merged

Refactor unified trainer async flow and add precomputed-advantage + Tinker rollout fixes#401
listar2000 merged 6 commits into
mainfrom
unified-trainer-cleanup

Conversation

@listar2000
Copy link
Copy Markdown
Collaborator

Summary

This PR streamlines the experimental trainer execution path, standardizes episode/task metadata handling, and adds support for workflows that provide pre-computed per-
token advantages. It also includes several Tinker integration fixes (rollout parsing, model length wiring, and resume behavior).

What Changed

Unified trainer lifecycle

  • Removed the background event-loop thread / thread-safe coroutine wrapper flow.
  • Kept fit() as sync entry point, now using asyncio.run(...).
  • Added explicit async entrypoint (fit_async) and moved workflow-pool initialization into async startup.
  • Passed use_precomputed_advantage through trainer setup.
  • Switched validation UID grouping to use normalized episode.task_id.
  • Updated tracking payload to serialize self.config (full config object used by trainer).

Agent / episode model cleanup

  • Added Step.from_model_output(...) to centralize Step construction.
  • Updated completers to use the shared constructor.
  • Relaxed Step.prompt_ids typing to allow non-int prompt blocks used by some backends.
  • Added cached Episode.task_id and Episode.rollout_idx.
  • Switched trajectory-group identifiers/properties to cached-property usage.

Advantage collection behavior

  • Added algorithm.use_precomputed_advantage (default false) to config/dataclass.
  • When enabled, collector consumes per-token step.advantage lists directly.
  • Missing/mismatched per-step advantages are defaulted to zeros with warnings.
  • Scalar step.advantage in precomputed mode now raises an error.
  • In normal RL mode, advantages are still computed from rewards; logs warning if precomputed values are present and being overwritten.

Transform / metadata consistency

  • Replaced ad-hoc episode.id.split(":") usage with episode.task_id / episode.rollout_idx.
  • Updated metadata emitted for visualization/transform paths accordingly.

Tinker backend and rollout fixes

  • Wired training.max_length into Tinker engine as max_model_length.
  • Added optional rollout_engine.renderer_name config.
  • Corrected stop-sequence fallback to use tokenizer eos_token_id when available.
  • Improved sampling-param handling (max_tokens precedence and copy semantics).
  • Added backend-specific completion-id extraction in TITO completer.
  • Added parser-bypass response parsing path when configured.
  • Added async compute_logprobs(...) API to Tinker engine.
  • Added resume support from training.resume_from_tinker_id (tinker://.../weights/<batch>).

@listar2000 listar2000 merged commit 4c3acae into main Feb 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant