Neural machine translation (NMT) has progressed rapidly over the past several years, and modern models are able to achieve relatively high quality using only monolingual text data, an approach dubbed Unsupervised Machine Translation (UNMT). However, these models still struggle in a variety of ways, including aspects of translation that for a human are the easiest - for instance, correctly translating common nouns. This work explores a cheap and abundant resource to combat this problem: bilingual lexica. We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text. We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements, and can be combined for even greater improvements; (3) we demonstrate the importance of carefully curated lexica over larger, noisier ones, especially with larger models; and (4) we compare the efficacy of multilingual lexicon data versus human-translated parallel data. Finally, we open-source GATITOS (available at https://github.com/google-research/url-nlp/tree/main/gatitos), a new multilingual lexicon for 26 low-resource languages, which had the highest performance among lexica in our experiments.
翻译:神经机器翻译(NMT)在过去数年间取得了快速发展,现代模型仅使用单语文本数据(即无监督机器翻译,UNMT方法)即可实现较高质量。然而,这些模型在多个方面仍存在困难,包括人类认为最简单的翻译任务——例如,正确翻译普通名词。本研究探索了一种廉价且丰富的资源以应对该问题:双语词汇表。我们在基于网络爬取文本训练的200语言翻译模型上,于真实场景中测试了双语词汇表的有效性。我们提出以下发现:(1)通过词汇数据增强,我们在无监督翻译中展示了显著性能提升;(2)通过比较多类数据增强方法,证明它们能够带来相似的改进,且可被结合使用以取得更优效果;(3)论证了精心筛选的词汇表相较于大规模但含噪词汇表的重要性,尤其对于大型模型;(4)比较了多语言词汇数据与人译平行数据的有效性。最后,我们开源了GATITOS(获取地址:https://github.com/google-research/url-nlp/tree/main/gatitos)——一个面向26种低资源语言的新多语言词汇表,该词汇表在本实验的词汇数据中取得了最高性能。