Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning, yet induce different internal computations. Despite this non-uniqueness, language models are typically trained using a single canonical longest-prefix tokenization. We formalize homotokens-alternative valid subword segmentations of the same lexical item-as a strictly meaning-preserving form of data augmentation. We introduce a lightweight training architecture that conditions canonical next-token prediction on sampled homotoken variants via an auxiliary causal encoder and block-causal cross-attention, without modifying the training objective or token interface. In data-constrained pretraining, homotoken augmentation consistently delays overfitting under repeated data exposure and improves generalization across diverse evaluation datasets. In multilingual fine-tuning, we find that the effectiveness of homotokens depends on tokenizer quality: gains are strongest when canonical tokens are highly compressed and diminish when the tokenizer already over-fragments the input. Overall, homotokens provide a simple and modular mechanism for inducing tokenization invariance in language models.
翻译:子词分词在语言模型中引入了一个计算层,其中许多不同的分词序列会解码为相同的表层形式并保持语义,但会引发不同的内部计算。尽管存在这种非唯一性,语言模型通常使用单一规范的最长前缀分词进行训练。我们将同形分词——同一词汇项的有效替代子词切分——形式化为一种严格保持语义的数据增强方法。我们提出一种轻量级训练架构,通过辅助因果编码器和块因果交叉注意力,将规范的下一个分词预测条件化于采样的同形分词变体,而无需修改训练目标或分词接口。在数据受限的预训练中,同形分词增强能持续延迟重复数据暴露下的过拟合现象,并在多样化评估数据集上提升泛化能力。在多语言微调场景中,我们发现同形分词的有效性取决于分词器质量:当规范分词具有高度压缩性时增益最强,而当分词器已对输入进行过度分割时增益减弱。总体而言,同形分词为在语言模型中引入分词不变性提供了一种简单且模块化的机制。