Efficiency and safety of Large Language Models (LLMs), among other factors, rely on the quality of tokenization. A good tokenizer not only improves inference speed and language understanding but also provides extra defense against jailbreak attacks and lowers the risk of hallucinations. In this work, we investigate the efficiency of code tokenization, in particular from the perspective of data source diversity. We demonstrate that code tokenizers are prone to producing unused, and thus under-trained, tokens due to the imbalance in repository and language diversity in the training data, as well as the dominance of source-specific, repetitive tokens that are often unusable in future inference. By modifying the BPE objective and introducing merge skipping, we implement different techniques under the name Source-Attributed BPE (SA-BPE) to regularize BPE training and minimize overfitting, thereby substantially reducing the number of under-trained tokens while maintaining the same inference procedure as with regular BPE. This provides an effective tool suitable for production use.
翻译:大型语言模型(LLMs)的效率与安全性,除其他因素外,取决于分词质量。优秀的分词器不仅能提升推理速度与语言理解能力,还能增强对越狱攻击的防御能力,并降低产生幻觉的风险。本研究从数据源多样性的视角切入,探究代码分词机制的效率问题。我们证明,由于训练数据中代码库规模与语言多样性的失衡,以及源自特定代码库且重复率高的分词单元(这些分词单元在后续推理中常无法使用)占据主导地位,代码分词器极易产生未使用(进而训练不充分)的分词单元。通过修改BPE目标函数并引入跳合并策略,我们以“源代码归因BPE”(SA-BPE)为名实现了多种技术,用以正则化BPE训练过程并最小化过拟合问题,从而在保持与标准BPE相同推理流程的前提下,大幅减少训练不充分分词单元的数量。这为生产环境提供了一种有效且实用的工具。