Emergent retokenization symmetry in large language models: phenomenology and applications

Tokenization introduces representational redundancy: under a fixed token vocabulary, every byte string admits many valid token encodings, or segmentations, that decode to the same surface string. However, given a prompt, most language model tokenizers break this representational symmetry by returning a canonical segmentation. Training only on canonical segmentations should influence inference behavior, and there is little reason to expect models to respect segmentation symmetry on downstream tasks. We find that this symmetry partially emerges during training. Here, we probe this emergent symmetry through experiments testing token compositional understanding, representation diversity, and task focused benchmark performance. We primarily use \textbf{retokenization} -- replacing a prompt's canonical tokenization with an alternative segmentation while preserving its bytes exactly. Relative to other prompt perturbations, retokenization is unusually clean because it isolates segmentation effects without changing syntax, semantics or surface form. We use retokenization to study sensitivity and robustness to semantically identical input representations across pretraining and post-training. Moreover, this partial retokenization symmetry suggests a distinct inference-time sampling axis. While temperature sampling generates diverse outputs from the model using its next-token probability distribution, retokenization generates diversity from the model's internal computations through semantically equivalent input representations. We find that while this retokenization sampling strategy can hurt performance on easy problems, it can also recover solutions that conventional sampling does not find. Overall, our work presents retokenization as a simple yet powerful probe of large language models, shedding light on compositional understanding and prompt sensitivity, and offering a novel sampling strategy.

翻译：分词引入了表征冗余：在固定词表下，每个字节串都存在多种有效的分词编码（即切分方式），这些编码解码后得到相同的表层字符串。然而在给定提示词时，大多数语言模型的分词器通过返回规范切分方式打破了这种表征对称性。仅在规范切分上训练应会影响推理行为，且我们没有理由期待模型在下游任务中保持切分对称性。我们发现这种对称性会在训练过程中部分涌现。本文通过测试单词组合理解、表征多样性和任务导向基准性能等实验，探究了这种涌现对称性。我们主要使用**再分词**技术——在完全保留原始字节的前提下，用替代性切分方式替换提示词的规范分词结果。相较于其他提示扰动方法，再分词具有独特的纯净性，因为它能隔离切分效应而不改变句法、语义或表层形式。我们利用再分词研究模型在预训练和后训练阶段对语义等价输入表征的敏感性和鲁棒性。此外，这种部分再分词对称性揭示了一个独特的推理时采样维度：温度采样通过利用模型的下一个词概率分布生成多样化输出，而再分词则通过语义等价的输入表征从模型内部计算中产生多样性。我们发现这种再分词采样策略虽可能降低简单问题的性能，却能恢复传统采样无法获得的解决方案。总体而言，本研究将再分词呈现为一种简洁而强大的大型语言模型探针，揭示了模型在组合理解和提示敏感性方面的特性，并提供了一种新颖的采样策略。