Large Language Models (LLMs) based on transformer architectures have revolutionized a variety of domains, with tokenization playing a pivotal role in their pre-processing and fine-tuning stages. In multilingual models, particularly those tailored for Indic languages, effective tokenization is crucial for optimizing performance. This paper presents a comprehensive evaluation of tokenizers used by 12 LLMs across all 22 official languages of India, with a focus on comparing the efficiency of their tokenization processes. We employed the Normalized Sequence Length (NSL) as a key metric in our analysis. Our findings reveal that the SUTRA tokenizer outperforms all other models, including several Indic-specific models, excelling in 14 languages. Notable insights include the SUTRA tokenizer's superior handling of Indic languages, GPT-4o's advancement over its predecessor GPT-4 in processing Indian languages, and the limited performance of Project Indus in certain languages. This study underscores the critical importance of developing targeted tokenization strategies for multilingual and Indic-centric models, laying the groundwork for future improvements in tokenizer design to enhance linguistic coverage and model efficiency.
翻译:基于Transformer架构的大型语言模型(LLMs)已在多个领域引发革命性变革,其中分词技术在其预处理和微调阶段发挥着关键作用。对于多语言模型,特别是针对印度语言定制的模型,有效的分词对优化模型性能至关重要。本文对12个LLMs在印度全部22种官方语言中使用的分词器进行了全面评估,重点比较其分词过程的效率。我们在分析中采用归一化序列长度(NSL)作为核心评估指标。研究结果表明,SUTRA分词器在所有模型中表现最优,包括多个印度语言专用模型,在14种语言中表现卓越。重要发现包括:SUTRA分词器在处理印度语言方面具有显著优势,GPT-4o在处理印度语言方面较其前代GPT-4取得明显进步,而Project Indus在某些语言中表现有限。本研究强调了为多语言及印度语言中心模型开发针对性分词策略的至关重要性,为未来通过改进分词器设计以提升语言覆盖范围和模型效率奠定了基础。