English, as a very high-resource language, enables the pretraining of high-quality large language models (LLMs). The same cannot be said for most other languages, as leading LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated text from a single high-quality source language can contribute significantly to the pretraining of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into French, German, and Spanish, resulting in a final 300B-token dataset, which we call TransWeb-Edu, and pretrain a 1.3B-parameter model, CuatroLLM, from scratch on this dataset. Across five non-English reasoning tasks, we show that CuatroLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2 and Gemma2, despite using an order of magnitude less data, such as about 6% of the tokens used for Llama3.2's training. We further demonstrate that with additional domain-specific pretraining, amounting to less than 1% of TransWeb-Edu, CuatroLLM surpasses the state of the art in multilingual reasoning. To promote reproducibility, we release our corpus, models, and training pipeline under open licenses at hf.co/britllm/CuatroLLM.
翻译:英语作为一种极高资源语言,能够支撑高质量大语言模型(LLM)的预训练。然而对于大多数其他语言而言,情况则并非如此,当前领先的LLM在非英语语言上仍表现欠佳,这很可能源于现有多语言预训练语料库在质量与多样性方面的差距。本研究发现,从单一高质量源语言进行机器翻译的文本,能够显著促进多语言LLM的预训练。我们将高质量英文网页数据集FineWeb-Edu翻译为法语、德语和西班牙语,构建了包含3000亿词元的最终数据集TransWeb-Edu,并基于该数据集从头预训练了一个13亿参数的模型CuatroLLM。在五项非英语推理任务中,CuatroLLM的表现达到甚至超越了使用封闭数据训练的最先进多语言模型(如Llama3.2和Gemma2),尽管其使用的训练数据量少了一个数量级(例如,仅约为Llama3.2训练所用词元量的6%)。我们进一步证明,通过额外进行占比不足TransWeb-Edu总量1%的领域特定预训练,CuatroLLM能够在多语言推理任务上超越现有最佳水平。为促进可复现性,我们已在hf.co/britllm/CuatroLLM以开放许可协议发布本研究的语料库、模型及训练流程。