Skip to content

docs(AGENTS): warn against raw tokenizer.encode on chat-tuned models#706

Merged
dphuang2 merged 1 commit into
thinking-machines-lab:mainfrom
dphuang2:dphuang2/claude-md-chat-template-pitfall
May 14, 2026
Merged

docs(AGENTS): warn against raw tokenizer.encode on chat-tuned models#706
dphuang2 merged 1 commit into
thinking-machines-lab:mainfrom
dphuang2:dphuang2/claude-md-chat-template-pitfall

Conversation

@dphuang2
Copy link
Copy Markdown
Collaborator

@dphuang2 dphuang2 commented May 14, 2026

Summary

Add a pitfall note to AGENTS.md (and CLAUDE.md via its symlink) clarifying that calling tokenizer.encode(prompt) directly on a chat-tuned model — instead of apply_chat_template or a cookbook renderer — produces OOD prompt tokens.

For models like gpt-oss-120b, Llama-3-Instruct, Qwen-Instruct, etc., the sampler and trainer take subtly different code paths on OOD inputs, and per-token sampler/trainer logprob KL can inflate by 5×+ with max ratios in the tens. This silently breaks PPO/CISPO/GRPO importance ratios.

Empirical evidence

Small repro on openai/gpt-oss-120b with loss_fn=cispo, sampling-path runs:

kl_v2 max_abs_diff max_ratio
Raw tokenizer.encode 0.0211 0.83 2.29
apply_chat_template 0.0057 0.48 1.37
Improvement ~3.7× ~1.7× ~1.7×

(Forced-completion paths require both prompt and completion to be in the chat format; applying the template to only the prompt half makes that path worse, which the note acknowledges by guiding users to renderers / proper messages lists rather than ad-hoc string concatenation.)

Does the doc change actually shift agent behavior?

Ran a small A/B with fresh agents (general-purpose subagents, no other context), N=2 per condition. Both conditions embed the full AGENTS.md in the prompt; the only difference is whether the new pitfall hunk is present. The task is identical and uses a naive framing — what an author actually thinks when writing a tokenization helper, not "I am writing a logprob-parity script":

Write def make_prompt_tokens(tokenizer, prompt: str) -> list[int] for a Tinker prototype. Take a user-provided string like "Question: What is 2+2?\nAnswer:" and turn it into the list of token IDs to pass to sampling_client.sample_async(...). The model is openai/gpt-oss-120b. Keep it minimal.

Trial Condition Output
1 Before (no pitfall) return tokenizer.encode(prompt).ids
2 Before return tokenizer.encode(prompt)
1 After (with pitfall) apply_chat_template(messages, tokenize=True, add_generation_prompt=True)["input_ids"]
2 After apply_chat_template(messages, tokenize=True, add_generation_prompt=True)["input_ids"]

Both Before trials reproduced the raw-encode bug (via slightly different APIs — HF's PreTrainedTokenizer.encode and the tokenizers-library .encode().ids, but the same conceptual failure). Both After trials reached for the chat template and correctly unwrapped ["input_ids"] from the BatchEncoding — the exact wrinkle the pitfall's parenthetical calls out.

Difference is categorical, not subtle. N=2 isn't statistical proof, but the four outputs are opposite enough to suggest the note actually shifts behavior on the framing that produces real-world bugs.

Test plan

  • Doc-only change; rendering verified locally.
  • A/B eval with fresh agents shows the note flips behavior from raw encode to apply_chat_template on the naive-framing task.
  • Reviewer sanity-check that the wording fits alongside the existing renderer-mismatch pitfall.

Add a pitfall note clarifying that calling tokenizer.encode on the
prompt directly (instead of apply_chat_template or a cookbook renderer)
produces OOD prompt tokens for chat-tuned models like gpt-oss-120b,
Llama-3-Instruct, Qwen-Instruct, etc. Empirically the sampler and
trainer disagree by 5x+ on KL with max per-token ratios in the tens on
such inputs, which silently breaks PPO/CISPO/GRPO importance ratios.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dphuang2 dphuang2 merged commit 753433a into thinking-machines-lab:main May 14, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants