Tokenizers play a crucial role in determining the performance, training efficiency, and the inference cost of Large Language Models (LLMs). Designing effective tokenizers for multilingual LLMs is particularly challenging due to diverse scripts and rich morphological variation. While subword methods like Byte Pair Encoding (BPE) are widely adopted, their effectiveness in multilingual settings remains underexplored. We present MUTANT, a recipe for building multilingual tokenizers, with careful vocabulary and training data design, language-aware pre-tokenization, and subword and multiword aware training. We also introduce MUTANT-Indic, a tokenizer for India-specific multilingual LLMs, that produces linguistically coherent tokens and achieves state-of-the-art performance. Evaluated across English, 22 Indian languages and code data, our tokenizer improves the average fertility score by 39.5%$ over LLaMA4 and by 18% over Sutra (the current best). This translates to 44% improvement in inference throughput over LLaMA4 while maintaining comparable performance on English and Indic benchmarks. We present detailed ablations across tokenizer training data size, vocabulary size, merging techniques, and pre-tokenization strategies, demonstrating the robustness of our design choices.
翻译:分词器在决定大语言模型的性能、训练效率及推理成本中起着关键作用。为多语言大语言模型设计高效分词器尤为困难,原因在于其需应对多样化的文字系统及丰富的形态变化。尽管字节对编码等子词方法已被广泛采用,但其在多语言场景下的有效性仍未得到充分探索。本文提出MUTANT,一种构建多语言分词器的系统化方案,涵盖词汇与训练数据的精心设计、语言感知的预分词处理、以及子词与多词感知的训练流程。我们还推出MUTANT-Indic——专为印度多语言大语言模型设计的分词器,其生成的词汇单元具有语言学连贯性,并达到了当前最优性能。在英语、22种印度语言及代码数据上的评估显示,我们的分词器在平均宽松度指标上较LLaMA4提升39.5%,较当前最佳模型Sutra提升18%。这转化为推理吞吐量较LLaMA4提升44%,同时在英语及Indic基准测试中保持相当性能。我们通过分词器训练数据规模、词汇量大小、合并技术及预分词策略的详细消融实验,验证了设计选择的鲁棒性。