docs(AGENTS): warn against raw tokenizer.encode on chat-tuned models by dphuang2 · Pull Request #706 · thinking-machines-lab/tinker-cookbook

dphuang2 · 2026-05-14T04:26:27Z

Summary

Add a pitfall note to AGENTS.md (and CLAUDE.md via its symlink) clarifying that calling tokenizer.encode(prompt) directly on a chat-tuned model — instead of apply_chat_template or a cookbook renderer — produces OOD prompt tokens.

For models like gpt-oss-120b, Llama-3-Instruct, Qwen-Instruct, etc., the sampler and trainer take subtly different code paths on OOD inputs, and per-token sampler/trainer logprob KL can inflate by 5×+ with max ratios in the tens. This silently breaks PPO/CISPO/GRPO importance ratios.

Empirical evidence

Small repro on openai/gpt-oss-120b with loss_fn=cispo, sampling-path runs:

	kl_v2	max_abs_diff	max_ratio
Raw `tokenizer.encode`	0.0211	0.83	2.29
`apply_chat_template`	0.0057	0.48	1.37
Improvement	~3.7×	~1.7×	~1.7×

(Forced-completion paths require both prompt and completion to be in the chat format; applying the template to only the prompt half makes that path worse, which the note acknowledges by guiding users to renderers / proper messages lists rather than ad-hoc string concatenation.)

Does the doc change actually shift agent behavior?

Ran a small A/B with fresh agents (general-purpose subagents, no other context), N=2 per condition. Both conditions embed the full AGENTS.md in the prompt; the only difference is whether the new pitfall hunk is present. The task is identical and uses a naive framing — what an author actually thinks when writing a tokenization helper, not "I am writing a logprob-parity script":

Write def make_prompt_tokens(tokenizer, prompt: str) -> list[int] for a Tinker prototype. Take a user-provided string like "Question: What is 2+2?\nAnswer:" and turn it into the list of token IDs to pass to sampling_client.sample_async(...). The model is openai/gpt-oss-120b. Keep it minimal.

Trial	Condition	Output
1	Before (no pitfall)	`return tokenizer.encode(prompt).ids` ❌
2	Before	`return tokenizer.encode(prompt)` ❌
1	After (with pitfall)	`apply_chat_template(messages, tokenize=True, add_generation_prompt=True)["input_ids"]` ✅
2	After	`apply_chat_template(messages, tokenize=True, add_generation_prompt=True)["input_ids"]` ✅

Both Before trials reproduced the raw-encode bug (via slightly different APIs — HF's PreTrainedTokenizer.encode and the tokenizers-library .encode().ids, but the same conceptual failure). Both After trials reached for the chat template and correctly unwrapped ["input_ids"] from the BatchEncoding — the exact wrinkle the pitfall's parenthetical calls out.

Difference is categorical, not subtle. N=2 isn't statistical proof, but the four outputs are opposite enough to suggest the note actually shifts behavior on the framing that produces real-world bugs.

Test plan

Doc-only change; rendering verified locally.
A/B eval with fresh agents shows the note flips behavior from raw encode to apply_chat_template on the naive-framing task.
Reviewer sanity-check that the wording fits alongside the existing renderer-mismatch pitfall.

Add a pitfall note clarifying that calling tokenizer.encode on the prompt directly (instead of apply_chat_template or a cookbook renderer) produces OOD prompt tokens for chat-tuned models like gpt-oss-120b, Llama-3-Instruct, Qwen-Instruct, etc. Empirically the sampler and trainer disagree by 5x+ on KL with max per-token ratios in the tens on such inputs, which silently breaks PPO/CISPO/GRPO importance ratios. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

derek-tml approved these changes May 14, 2026

View reviewed changes

dphuang2 merged commit 753433a into thinking-machines-lab:main May 14, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(AGENTS): warn against raw tokenizer.encode on chat-tuned models#706

docs(AGENTS): warn against raw tokenizer.encode on chat-tuned models#706
dphuang2 merged 1 commit into
thinking-machines-lab:mainfrom
dphuang2:dphuang2/claude-md-chat-template-pitfall

dphuang2 commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dphuang2 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Empirical evidence

Does the doc change actually shift agent behavior?

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dphuang2 commented May 14, 2026 •

edited

Loading