Multilingual machine translation (MMT), trained on a mixture of parallel and monolingual data, is key for improving translation in low-resource language pairs. However, the literature offers conflicting results on the performance of different methods of including monolingual data. To resolve this, we examine how denoising autoencoding (DAE) and backtranslation (BT) impact MMT under different data conditions and model scales. Unlike prior studies, we use a realistic dataset of 100 translation directions and consider many domain combinations of monolingual and test data. We find that monolingual data generally helps MMT, but models are surprisingly brittle to domain mismatches, especially at smaller model scales. BT is beneficial when the parallel, monolingual, and test data sources are similar but can be detrimental otherwise, while DAE is less effective than previously reported. Next, we analyze the impact of scale (from 90M to 1.6B parameters) and find it is important for both methods, particularly DAE. As scale increases, DAE transitions from underperforming the parallel-only baseline at 90M to converging with BT performance at 1.6B, and even surpassing it in low-resource. These results offer new insights into how to best use monolingual data in MMT.
翻译:多语言机器翻译(MMT)通过混合使用平行语料和单语数据进行训练,是提升低资源语言对翻译质量的关键技术。然而,现有文献对于不同单语数据引入方法的性能表现存在相互矛盾的结论。为解决这一争议,我们系统研究了去噪自编码(DAE)和回译(BT)在不同数据条件与模型规模下对MMT的影响。与以往研究不同,我们采用了包含100个翻译方向的实际数据集,并考虑了单语数据与测试数据之间多种领域组合。研究发现:单语数据通常对MMT有帮助,但模型对领域不匹配表现出令人意外的脆弱性,尤其是在较小模型规模下。当平行语料、单语数据和测试数据来源相似时,BT具有正向效果,反之则可能产生负面影响;而DAE的效果弱于此前文献报道。进一步分析模型规模(从90M到1.6B参数)的影响发现,参数规模对两种方法均至关重要,尤其对DAE更为显著。随着模型规模增大,DAE从90M参数时表现劣于仅使用平行语料的基线,逐步发展到1.6B参数时与BT性能趋同,甚至在低资源场景下超越BT。这些结果为MMT中单语数据的最优使用策略提供了新的见解。