Using a shared vocabulary is common practice in Multilingual Neural Machine Translation (MNMT). In addition to its simple design, shared tokens play an important role in positive knowledge transfer, which manifests naturally when the shared tokens refer to similar meanings across languages. However, natural flaws exist in such a design as well: 1) when languages use different writing systems, transfer is inhibited, and 2) even if languages use similar writing systems, shared tokens may have completely different meanings in different languages, increasing ambiguity. In this paper, we propose a re-parameterized method for building embeddings to alleviate the first problem. More specifically, we define word-level information transfer pathways via word equivalence classes and rely on graph networks to fuse word embeddings across languages. Our experiments demonstrate the advantages of our approach: 1) the semantics of embeddings are better aligned across languages, 2) our method achieves significant BLEU improvements on high- and low-resource MNMT, and 3) only less than 1.0\% additional trainable parameters are required with a limited increase in computational costs.
翻译:在神经机器翻译( MNMT)中采用共享词汇表是常见实践。除设计简洁外,共享标记在正向知识迁移中发挥重要作用——当共享标记在不同语言中指向相似语义时,这种迁移会自然实现。然而,该设计存在固有缺陷:1)当语言采用不同书写系统时,迁移受到抑制;2)即使语言使用相似书写系统,共享标记在不同语言中可能具有完全不同的语义,导致歧义增加。本文提出一种重参数化方法来构建词嵌入,以缓解上述第一个问题。具体而言,我们通过词等价类定义词语级信息传递路径,并借助图网络实现跨语言词嵌入融合。实验证明了该方法的优势:1)跨语言词嵌入语义对齐效果更优;2)在高资源与低资源多语言机器翻译场景中均获得显著BLEU值提升;3)额外可训练参数仅增加不到1.0%,计算开销增幅有限。