Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language

English, as a very high-resource language, enables the pretraining of high-quality large language models (LLMs). The same cannot be said for most other languages, as leading LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated text from a single high-quality source language can contribute significantly to the pretraining of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into French, German, and Spanish, resulting in a final 300B-token dataset, which we call TransWeb-Edu, and pretrain a 1.3B-parameter model, CuatroLLM, from scratch on this dataset. Across five non-English reasoning tasks, we show that CuatroLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2 and Gemma2, despite using an order of magnitude less data, such as about 6% of the tokens used for Llama3.2's training. We further demonstrate that with additional domain-specific pretraining, amounting to less than 1% of TransWeb-Edu, CuatroLLM surpasses the state of the art in multilingual reasoning. To promote reproducibility, we release our corpus, models, and training pipeline under open licenses at hf.co/britllm/CuatroLLM.

翻译：英语作为一种极高资源语言，能够支撑高质量大语言模型（LLM）的预训练。然而对于大多数其他语言而言，情况则并非如此，当前领先的LLM在非英语语言上仍表现欠佳，这很可能源于现有多语言预训练语料库在质量与多样性方面的差距。本研究发现，从单一高质量源语言进行机器翻译的文本，能够显著促进多语言LLM的预训练。我们将高质量英文网页数据集FineWeb-Edu翻译为法语、德语和西班牙语，构建了包含3000亿词元的最终数据集TransWeb-Edu，并基于该数据集从头预训练了一个13亿参数的模型CuatroLLM。在五项非英语推理任务中，CuatroLLM的表现达到甚至超越了使用封闭数据训练的最先进多语言模型（如Llama3.2和Gemma2），尽管其使用的训练数据量少了一个数量级（例如，仅约为Llama3.2训练所用词元量的6%）。我们进一步证明，通过额外进行占比不足TransWeb-Edu总量1%的领域特定预训练，CuatroLLM能够在多语言推理任务上超越现有最佳水平。为促进可复现性，我们已在hf.co/britllm/CuatroLLM以开放许可协议发布本研究的语料库、模型及训练流程。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日