This paper investigates the impact of data volume and the use of similar languages on transfer learning in a machine translation task. We find out that having more data generally leads to better performance, as it allows the model to learn more patterns and generalizations from the data. However, related languages can also be particularly effective when there is limited data available for a specific language pair, as the model can leverage the similarities between the languages to improve performance. To demonstrate, we fine-tune mBART model for a Polish-English translation task using the OPUS-100 dataset. We evaluate the performance of the model under various transfer learning configurations, including different transfer source languages and different shot levels for Polish, and report the results. Our experiments show that a combination of related languages and larger amounts of data outperforms the model trained on related languages or larger amounts of data alone. Additionally, we show the importance of related languages in zero-shot and few-shot configurations.
翻译:本文研究了数据规模及使用相似语言对机器翻译任务中迁移学习的影响。我们发现,更多数据通常能带来更优性能,因为这使得模型能够从数据中学习到更多模式与泛化特征。然而,当特定语言对可用数据有限时,相关语言也能发挥显著作用——模型可通过利用语言间的相似性来提升翻译质量。为验证这一观点,我们基于OPUS-100数据集对mBART模型进行微调,针对波兰语-英语翻译任务展开实验。我们在不同迁移学习配置下评估模型性能,包括不同迁移源语言及波兰语的少样本等级,并汇报相应结果。实验表明,结合相关语言与更大规模数据的模型,其表现优于仅依赖相关语言或仅依赖大数据的模型。此外,我们进一步揭示了相关语言在零样本与少样本配置中的重要性。