Building effective tokenizers for multilingual Large Language Models (LLMs) requires careful control over language-specific data mixtures. While a tokenizer's compression performance critically affects the efficiency of LLM training and inference, existing approaches rely on heuristics or costly large-scale searches to determine optimal language ratios. We introduce Tokenizer Regression for Optimal Data MiXture (TREX), a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training. TREX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures. This learned model enables scalable mixture search before large-scale tokenizer training, mitigating the accuracy-cost trade-off in multilingual tokenizer design. Tokenizers trained with TReX's predicted mixtures outperform mixtures based on LLaMA3 and uniform distributions by up to 12% in both inand out-of-distribution compression efficiency, demonstrating strong scalability, robustness, and practical effectiveness.
翻译:为多语言大语言模型构建高效的分词器需要精确控制语言特定的数据混合比例。虽然分词器的压缩性能直接影响大语言模型训练和推理的效率,但现有方法依赖于启发式规则或代价高昂的大规模搜索来确定最优语言比例。本文提出基于分词器回归的最优数据混合方法,这是一种基于回归的框架,能够高效预测分词器训练的最优数据混合比例。该方法通过在随机混合数据上训练小规模代理分词器,收集其压缩统计数据,并学习从数据混合比例预测压缩性能。习得的模型可在进行大规模分词器训练前实现可扩展的混合比例搜索,从而缓解多语言分词器设计中准确性与成本之间的权衡。使用TReX预测的混合比例训练的分词器,在分布内和分布外压缩效率上均优于基于LLaMA3和均匀分布的混合方法,最高提升达12%,展现出强大的可扩展性、鲁棒性和实际有效性。