In this paper, we explore the utility of \textit{Translationese} as synthetic data created using machine translation for pre-training language models (LMs). Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English. Recently, there has been a growing interest in using synthetic data to address this data scarcity. We take the case of English and Indic languages and translate web-crawled monolingual documents (clean) into the target language. Then, we train language models containing 28M and 85M parameters on this translationese data (synthetic). We show that their performance on downstream natural language understanding and generative tasks is only 3.56\% poorer on NLU tasks and 1.51\% on NLG tasks than LMs pre-trained on clean data. Further, we propose the use of lightweight \textit{TinyLMs} pre-trained on clean data to filter synthetic data efficiently which significantly improves the performance of our models. We also find that LMs trained on synthetic data strongly benefit from extended pretraining on a tiny fraction (10\%) of clean data. We release the data we collected and created as a part of this work, \textit{IndicMonoDoc}, the largest collection of monolingual document-level corpora, which we hope will help bridge the gap between English and non-English performance for large language models.
翻译:在本文中,我们探讨了使用机器翻译生成的“翻译语料库”作为合成数据来预训练语言模型的效用。预训练需要大量单语数据,而这些数据对于英语以外的语言通常不可用。近年来,利用合成数据解决这一数据稀缺性问题引起了越来越多的关注。我们以英语和印度语言为例,将网络抓取的干净单语文档翻译为目标语言,然后在翻译语料库(合成数据)上训练包含2800万和8500万参数的语言模型。结果表明,在下游自然语言理解和生成任务中,这些模型的性能仅比在干净数据上预训练的模型差3.56%(NLU任务)和1.51%(NLG任务)。此外,我们提出使用在干净数据上预训练的轻量级TinyLMs高效过滤合成数据,显著提升了模型性能。我们还发现,在合成数据上训练的模型通过少量(10%)干净数据的扩展预训练受益匪浅。我们发布了本工作中收集和创建的数据集IndicMonoDoc,这是最大的单语文档级语料库,希望有助于缩小大规模语言模型中英语与非英语语言性能之间的差距。