Complete Multi-lingual Neural Machine Translation (C-MNMT) achieves superior performance against the conventional MNMT by constructing multi-way aligned corpus, i.e., aligning bilingual training examples from different language pairs when either their source or target sides are identical. However, since exactly identical sentences from different language pairs are scarce, the power of the multi-way aligned corpus is limited by its scale. To handle this problem, this paper proposes "Extract and Generate" (EAG), a two-step approach to construct large-scale and high-quality multi-way aligned corpus from bilingual data. Specifically, we first extract candidate aligned examples by pairing the bilingual examples from different language pairs with highly similar source or target sentences; and then generate the final aligned examples from the candidates with a well-trained generation model. With this two-step pipeline, EAG can construct a large-scale and multi-way aligned corpus whose diversity is almost identical to the original bilingual corpus. Experiments on two publicly available datasets i.e., WMT-5 and OPUS-100, show that the proposed method achieves significant improvements over strong baselines, with +1.1 and +1.4 BLEU points improvements on the two datasets respectively.
翻译:完全多语言神经机器翻译(C-MNMT)通过构建多向对齐语料库——即当不同语言对的平行训练样本在源端或目标端完全相同时进行对齐——实现了对传统MNMT的性能超越。然而,由于来自不同语言对的完全相同的句子极为稀缺,多向对齐语料库的效能受限于其规模。为解决该问题,本文提出“提取与生成”(EAG)方法,通过两个步骤从双语数据中构建大规模高质量的多向对齐语料库。具体而言,我们首先通过配对源句或目标句高度相似的不同语言对双语样本,提取候选对齐样本;随后利用训练良好的生成模型从候选集中生成最终对齐样本。通过这一两阶段流程,EAG能够构建规模庞大且多样性几乎与原始双语语料库相当的多向对齐语料库。在WMT-5和OPUS-100两个公开数据集上的实验表明,该方法在强基线模型基础上取得了显著提升,在两个数据集上分别实现了+1.1和+1.4 BLEU分数的改进。