Understanding representation transfer in multilingual neural machine translation can reveal the representational issue causing the zero-shot translation deficiency. In this work, we introduce the identity pair, a sentence translated into itself, to address the lack of the base measure in multilingual investigations, as the identity pair represents the optimal state of representation among any language transfers. In our analysis, we demonstrate that the encoder transfers the source language to the representational subspace of the target language instead of the language-agnostic state. Thus, the zero-shot translation deficiency arises because representations are entangled with other languages and are not transferred effectively to the target language. Based on our findings, we propose two methods: 1) low-rank language-specific embedding at the encoder, and 2) language-specific contrastive learning of the representation at the decoder. The experimental results on Europarl-15, TED-19, and OPUS-100 datasets show that our methods substantially enhance the performance of zero-shot translations by improving language transfer capacity, thereby providing practical evidence to support our conclusions.
翻译:理解多语言神经机器翻译中的表征迁移能够揭示导致零样本翻译缺陷的表征问题。本研究引入恒等对(即句子翻译为自身)来解决多语言研究中基准度量的缺失问题,因为恒等对代表了任意语言迁移中表征的最优状态。通过分析,我们证明编码器将源语言迁移至目标语言表征子空间而非语言无关状态。因此,零样本翻译缺陷的产生是由于表征与其他语言纠缠,未能有效迁移至目标语言。基于这些发现,我们提出两种方法:1)编码器端的低秩语言特定嵌入;2)解码器端表征的语言特定对比学习。在Europarl-15、TED-19和OPUS-100数据集上的实验结果表明,我们的方法通过提升语言迁移能力显著增强了零样本翻译性能,从而为研究结论提供了实证支持。