Recent advancements in neural machine translation (NMT) have revolutionized the field, yet the dependency on extensive parallel corpora limits progress for low-resource languages. Cross-lingual transfer learning offers a promising solution by utilizing data from high-resource languages but often struggles with in-domain NMT. In this paper, we investigate three pivotal aspects: enhancing the domain-specific quality of NMT by fine-tuning domain-relevant data from different language pairs, identifying which domains are transferable in zero-shot scenarios, and assessing the impact of language-specific versus domain-specific factors on adaptation effectiveness. Using English as the source language and Spanish for fine-tuning, we evaluate multiple target languages including Portuguese, Italian, French, Czech, Polish, and Greek. Our findings reveal significant improvements in domain-specific translation quality, especially in specialized fields such as medical, legal, and IT, underscoring the importance of well-defined domain data and transparency of the experiment setup in in-domain transfer learning.
翻译:神经机器翻译(NMT)的最新进展已彻底改变了该领域,然而对大规模平行语料库的依赖限制了低资源语言的发展。跨语言迁移学习通过利用高资源语言的数据提供了一种有前景的解决方案,但在领域内NMT中常常面临困难。本文研究了三个关键方面:通过微调来自不同语言对的领域相关数据来提升NMT的领域特定质量;识别在零样本场景中哪些领域是可迁移的;以及评估语言特定因素与领域特定因素对适应效果的影响。以英语为源语言、西班牙语进行微调,我们评估了包括葡萄牙语、意大利语、法语、捷克语、波兰语和希腊语在内的多个目标语言。我们的研究结果表明,领域特定翻译质量得到显著提升,尤其是在医疗、法律和IT等专业领域,这突显了明确定义的领域数据和实验设置透明度在领域内迁移学习中的重要性。