Beyond Shared Vocabulary: Increasing Representational Word Similarities across Languages for Multilingual Machine Translation

Using a shared vocabulary is common practice in Multilingual Neural Machine Translation (MNMT). In addition to its simple design, shared tokens play an important role in positive knowledge transfer, which manifests naturally when the shared tokens refer to similar meanings across languages. However, natural flaws exist in such a design as well: 1) when languages use different writing systems, transfer is inhibited, and 2) even if languages use similar writing systems, shared tokens may have completely different meanings in different languages, increasing ambiguity. In this paper, we propose a re-parameterized method for building embeddings to alleviate the first problem. More specifically, we define word-level information transfer pathways via word equivalence classes and rely on graph networks to fuse word embeddings across languages. Our experiments demonstrate the advantages of our approach: 1) the semantics of embeddings are better aligned across languages, 2) our method achieves significant BLEU improvements on high- and low-resource MNMT, and 3) only less than 1.0\% additional trainable parameters are required with a limited increase in computational costs.

翻译：在神经机器翻译( MNMT）中采用共享词汇表是常见实践。除设计简洁外，共享标记在正向知识迁移中发挥重要作用——当共享标记在不同语言中指向相似语义时，这种迁移会自然实现。然而，该设计存在固有缺陷：1）当语言采用不同书写系统时，迁移受到抑制；2）即使语言使用相似书写系统，共享标记在不同语言中可能具有完全不同的语义，导致歧义增加。本文提出一种重参数化方法来构建词嵌入，以缓解上述第一个问题。具体而言，我们通过词等价类定义词语级信息传递路径，并借助图网络实现跨语言词嵌入融合。实验证明了该方法的优势：1）跨语言词嵌入语义对齐效果更优；2）在高资源与低资源多语言机器翻译场景中均获得显著BLEU值提升；3）额外可训练参数仅增加不到1.0%，计算开销增幅有限。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日