Multilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that structurally favor high-resource languages and Latin scripts. For speakers of underrepresented languages, particularly those across Southeast Asia, this bias inflates inference costs and widens cross-lingual capability gaps. We present the first systematic comparison of equitable tokenizers on a unified benchmark spanning 11 Southeast Asian languages. Beyond tokenizer-level analysis of compression efficiency and cross-lingual equity, we assess downstream task performance through controlled 1.5B-parameter language model training using the same training data. Our results show that Parity-aware BPE lies on the Pareto frontier of the efficiency-equity trade-off, achieving strong compression parity at competitive cost. Morphology-Driven Byte Encoding delivers the best semantic reasoning performance through morphologically richer representations, albeit at a higher computational expense. Byte Latent Transformer underperforms on downstream tasks, possibly because its architectural assumptions misalign with the constraints of limited low-resource training data. Together, our findings demonstrate that cross-lingual fairness and tokenization efficiency are not fundamentally at odds, and offer practical guidance for designing equitable multilingual models.
翻译:多语言大语言模型依赖子词分词来桥接离散文本与连续神经表征。当前最先进的多语言大语言模型常使用字节级字节对编码(BPE)分词器,这种分词器在结构上偏向高资源语言和拉丁字母。对于使用资源匮乏语言(尤其是东南亚地区语言)的用户而言,这种偏差会推高推理成本并加剧跨语言能力差距。我们首次在涵盖11种东南亚语言的统一基准上系统比较了公平分词器。除分词器层面的压缩效率与跨语言公平性分析外,我们还通过基于相同训练数据的可控1.5B参数语言模型训练评估下游任务性能。结果表明:感知奇偶性的BPE位于效率-公平权衡的帕累托前沿,能以竞争性成本实现强压缩公平性;形态驱动字节编码通过更丰富的形态学表征取得最佳语义推理性能,但计算开销更高;字节隐式Transformer在下游任务中表现欠佳,可能因其架构假设与低资源训练数据的有限约束不匹配。综合而言,我们的发现证明跨语言公平性与分词效率并非根本上对立,并为设计公平的多语言模型提供了实践指导。