Improving Polish to English Neural Machine Translation with Transfer Learning: Effects of Data Volume and Language Similarity

This paper investigates the impact of data volume and the use of similar languages on transfer learning in a machine translation task. We find out that having more data generally leads to better performance, as it allows the model to learn more patterns and generalizations from the data. However, related languages can also be particularly effective when there is limited data available for a specific language pair, as the model can leverage the similarities between the languages to improve performance. To demonstrate, we fine-tune mBART model for a Polish-English translation task using the OPUS-100 dataset. We evaluate the performance of the model under various transfer learning configurations, including different transfer source languages and different shot levels for Polish, and report the results. Our experiments show that a combination of related languages and larger amounts of data outperforms the model trained on related languages or larger amounts of data alone. Additionally, we show the importance of related languages in zero-shot and few-shot configurations.

翻译：本文研究了数据规模及使用相似语言对机器翻译任务中迁移学习的影响。我们发现，更多数据通常能带来更优性能，因为这使得模型能够从数据中学习到更多模式与泛化特征。然而，当特定语言对可用数据有限时，相关语言也能发挥显著作用——模型可通过利用语言间的相似性来提升翻译质量。为验证这一观点，我们基于OPUS-100数据集对mBART模型进行微调，针对波兰语-英语翻译任务展开实验。我们在不同迁移学习配置下评估模型性能，包括不同迁移源语言及波兰语的少样本等级，并汇报相应结果。实验表明，结合相关语言与更大规模数据的模型，其表现优于仅依赖相关语言或仅依赖大数据的模型。此外，我们进一步揭示了相关语言在零样本与少样本配置中的重要性。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

33页PPT【AI+天气预测】，AI and Machine learning for weather predictions

专知会员服务

35+阅读 · 2022年3月5日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日