Does multilingual Neural Machine Translation (NMT) lead to The Curse of the Multlinguality or provides the Cross-lingual Knowledge Transfer within a language family? In this study, we explore multiple approaches for extending the available data-regime in NMT and we prove cross-lingual benefits even in 0-shot translation regime for low-resource languages. With this paper, we provide state-of-the-art open-source NMT models for translating between selected Slavic languages. We released our models on the HuggingFace Hub (https://hf.co/collections/allegro/multislav-6793d6b6419e5963e759a683) under the CC BY 4.0 license. Slavic language family comprises morphologically rich Central and Eastern European languages. Although counting hundreds of millions of native speakers, Slavic Neural Machine Translation is under-studied in our opinion. Recently, most NMT research focuses either on: high-resource languages like English, Spanish, and German - in WMT23 General Translation Task 7 out of 8 task directions are from or to English; massively multilingual models covering multiple language groups; or evaluation techniques.
翻译:多语言神经机器翻译(NMT)究竟会导致"多语言诅咒",还是在语系内部实现跨语言知识迁移?本研究探索了扩展NMT可用数据体系的多种方法,并证明了即使在零样本翻译场景下,低资源语言仍能获得跨语言收益。本文为选定斯拉夫语言之间的翻译提供了最先进的开源NMT模型。我们已在HuggingFace平台(https://hf.co/collections/allegro/multislav-6793d6b6419e5963e759a683)以CC BY 4.0许可协议发布模型。斯拉夫语系包含形态丰富的中东欧语言,虽然母语使用者达数亿之众,但我们认为斯拉夫语神经机器翻译研究仍显不足。当前多数NMT研究主要聚焦于:英语、西班牙语、德语等高资源语言——WMT23通用翻译任务中7/8的任务方向涉及英语输入或输出;覆盖多语系的大规模多语言模型;或评估技术本身。