Traditional greedy tokenization methods have been a critical step in Natural Language Processing (NLP), influencing how text is converted into tokens and directly impacting model performance. While subword tokenizers like Byte-Pair Encoding (BPE) are widely used, questions remain about their optimality across model scales and languages. In this work, we demonstrate through extensive experiments that an optimal BPE configuration significantly reduces token count compared to greedy segmentation, yielding improvements in token-saving percentages and performance benefits, particularly for smaller models. We evaluate tokenization performance across various intrinsic and extrinsic tasks, including generation and classification. Our findings suggest that compression-optimized tokenization strategies could provide substantial advantages for multilingual and low-resource language applications, highlighting a promising direction for further research and inclusive NLP.
翻译:传统的贪心分词方法一直是自然语言处理(NLP)中的关键步骤,它影响着文本如何转换为词元,并直接影响模型性能。尽管像字节对编码(BPE)这样的子词分词器被广泛使用,但其在不同模型规模和语言中的最优性仍存疑问。在本工作中,我们通过大量实验证明,与贪心分词相比,最优的BPE配置能显著减少词元数量,从而在词元节省百分比和性能收益方面带来提升,尤其对于较小规模的模型。我们在包括生成和分类在内的多种内在与外在任务上评估了分词性能。我们的研究结果表明,针对压缩优化的分词策略可能为多语言及低资源语言应用带来显著优势,这为未来研究和更具包容性的NLP指明了一个有前景的方向。