Tokenization is fundamental to how language models represent and process text, yet the behavior of widely used BPE tokenizers has received far less study than model architectures and training. In this paper, we investigate intermediate merge residues in BPE vocabularies: tokens that are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage. Such low-frequency tokens not only waste vocabulary capacity but also increase vulnerability to adversarial or atypical inputs. We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens. Because the affected tokens are rarely used, pretrained models can often accommodate the modified tokenizer without additional fine-tuning. Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.
翻译:分词是语言模型表示和处理文本的基础,然而与模型架构和训练相比,广泛使用的BPE分词器的行为研究却少得多。本文研究了BPE词汇表中的中间合并残留现象:这些词元在合并学习过程中频繁出现,因此被保留在最终词汇表中,但在分词器实际使用时对语料进行分词的过程中,它们大多会被进一步合并,极少作为最终输出。此类低频词元不仅浪费词汇容量,还会增加模型对对抗性或非典型输入的脆弱性。我们对常用分词器中这一现象进行了系统的实证分析,并提出了LiteToken——一种消除残留词元的简单方法。由于受影响词元极少被使用,预训练模型通常无需额外微调即可适配修改后的分词器。实验表明,LiteToken在保持整体性能的同时,能够减少分词碎片化、降低参数量,并提升对噪声或拼写错误输入的鲁棒性。