In this paper, we explore the utility of Translationese as synthetic data created using machine translation for pre-training language models (LMs). Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English. Recently, there has been a growing interest in using synthetic data to address this data scarcity. We take the case of English and Indic languages and translate web-crawled monolingual documents (clean) into the target language. Then, we train language models containing 28M and 85M parameters on this translationese data (synthetic). We show that their performance on downstream natural language understanding and generative tasks is only 3.56% poorer on NLU tasks and 1.51% on NLG tasks than LMs pre-trained on clean data. Further, we propose the use of lightweight TinyLMs pre-trained on clean data to filter synthetic data efficiently which significantly improves the performance of our models. We also find that LMs trained on synthetic data strongly benefit from extended pretraining on a tiny fraction (10%) of clean data. We release the data we collected and created as a part of this work, IndicMonoDoc, the largest collection of monolingual document-level corpora, which we hope will help bridge the gap between English and non-English performance for large language models.
翻译:本文探究了将机器翻译生成的翻译语料作为合成数据用于语言模型预训练的效用。预训练需要海量单语数据,而除英语外的大多数语言往往缺乏此类资源。近年来,利用合成数据解决数据稀缺问题逐渐引起学界关注。我们以英语与印度语言为例,将网络爬取的单语文档(纯净数据)翻译为目标语言,并在此基础上训练了包含28M和85M参数的语言模型。实验表明,与基于纯净数据预训练的语言模型相比,这些模型在下游自然语言理解任务上性能仅降低3.56%,在自然语言生成任务上降低1.51%。进一步地,我们提出利用预训练于纯净数据的轻量级TinyLM模型高效过滤合成数据,此举显著提升了模型性能。此外,研究发现基于合成数据训练的语言模型通过仅10%的纯净数据持续预训练即可获得显著提升。我们将本研究收集和创建的IndicMonoDoc数据集——当前规模最大的单语文档级语料库——进行开源,期望该资源能缩小英语与非英语大语言模型之间的性能鸿沟。