Post
26
most multi-turn RL loops have a silent bug: you decode the model's output to detect tool calls, then re-tokenize the conversation for the next turn. BPE isn't invertible, so decode then re-encode can land on different ids. gradient ends up on tokens the model never sampled. no crash, just quietly wrong math and broken training
@qgallouedec wrote a super educational blog on MITO (message-in, token-out) vs TITO (token-in, token-out) and how you might fix the problem above
go read it π€
https://qgallouedec-tito.hf.space/
@qgallouedec wrote a super educational blog on MITO (message-in, token-out) vs TITO (token-in, token-out) and how you might fix the problem above
go read it π€
https://qgallouedec-tito.hf.space/