We propose a novel scaling law for general-purpose decoder-only language models (LMs) trained on multilingual data, tackling the problem of balancing languages during multilingual pretraining. A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer. To address this, we shift the focus from individual languages to language families. We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio, independent of other languages in the mixture. This insight simplifies the complexity of multilingual scaling and make the analysis scalable to an arbitrary number of languages. Building on this hypothesis, we derive a power-law relationship that links performance with dataset size, model size and sampling ratios. This relationship enables us to predict performance across various combinations of the above three quantities, and derive the optimal sampling ratios at different model scales. To demonstrate the effectiveness and accuracy of our proposed scaling law, we perform a large-scale empirical study, training more than 100 models on 23 languages spanning 5 language families. Our experiments show that the optimal sampling ratios derived from small models (85M parameters) generalize effectively to models that are several orders of magnitude larger (1.2B parameters), offering a resource-efficient approach for multilingual LM training at scale.
翻译:我们提出了一种针对在多语言数据上训练的通用解码器语言模型(LMs)的新颖缩放定律,旨在解决多语言预训练中语言平衡的问题。研究多语言缩放的一个主要挑战是由于跨语言迁移的存在,难以分析单个语言的性能。为了解决这个问题,我们将关注点从单个语言转移到语系上。我们提出并验证了一个假设:每个语系的测试交叉熵损失仅由其自身的采样比例决定,而与混合数据中的其他语言无关。这一见解简化了多语言缩放的复杂性,使得分析可以扩展到任意数量的语言。基于此假设,我们推导出一个将性能与数据集大小、模型大小和采样比例联系起来的幂律关系。这种关系使我们能够预测上述三个量在不同组合下的性能,并推导出不同模型规模下的最优采样比例。为了证明我们提出的缩放定律的有效性和准确性,我们进行了一项大规模实证研究,在跨越5个语系的23种语言上训练了超过100个模型。我们的实验表明,从小型模型(8500万参数)推导出的最优采样比例能够有效地推广到规模大几个数量级的模型(12亿参数),为大规模多语言LM训练提供了一种资源高效的方法。