Despite it being the cornerstone of BPE, the most common tokenization algorithm, the importance of compression in the tokenization process is still unclear. In this paper, we argue for the theoretical importance of compression, that can be viewed as 0-gram language modeling where equal probability is assigned to all tokens. We also demonstrate the empirical importance of compression for downstream success of pre-trained language models. We control the compression ability of several BPE tokenizers by varying the amount of documents available during their training: from 1 million documents to a character-based tokenizer equivalent to no training data at all. We then pre-train English language models based on those tokenizers and fine-tune them over several tasks. We show that there is a correlation between tokenizers' compression and models' downstream performance, suggesting that compression is a reliable intrinsic indicator of tokenization quality. These correlations are more pronounced for generation tasks (over classification) or for smaller models (over large ones). We replicated a representative part of our experiments on Turkish and found similar results, confirming that our results hold for languages with typological characteristics dissimilar to English. We conclude that building better compressing tokenizers is a fruitful avenue for further research and for improving overall model performance.
翻译:尽管压缩是BPE(最常用的分词算法)的理论基石,但分词过程中压缩的重要性仍未明确。本文从理论层面论证压缩的重要性——可将其视为为所有词元赋予等概率的0-gram语言建模。同时,我们通过实验证明了压缩对预训练语言模型下游任务成功的关键作用。通过控制BPE分词器训练时可用的文档数量(从100万篇文档到完全无训练数据的字符级分词器),我们对多种分词器的压缩能力进行了调控。基于这些分词器预训练英语语言模型,并在多项任务上进行微调后,我们发现分词器的压缩能力与模型下游性能之间存在相关性,表明压缩是衡量分词质量的可靠内在指标。这种相关性在生成任务(优于分类任务)或小规模模型(优于大规模模型)中更为显著。我们在土耳其语上复现了具有代表性的实验,并得到相似结果,证实该结论适用于与英语类型特征不同的语言。我们得出结论:构建更优压缩性能的分词器是推动进一步研究及提升整体模型性能的有效途径。