Zero-shot translation (ZST), which is generally based on a multilingual neural machine translation model, aims to translate between unseen language pairs in training data. The common practice to guide the zero-shot language mapping during inference is to deliberately insert the source and target language IDs, e.g., <EN> for English and <DE> for German. Recent studies have shown that language IDs sometimes fail to navigate the ZST task, making them suffer from the off-target problem (non-target language words exist in the generated translation) and, therefore, difficult to apply the current multilingual translation model to a broad range of zero-shot language scenarios. To understand when and why the navigation capabilities of language IDs are weakened, we compare two extreme decoder input cases in the ZST directions: Off-Target (OFF) and On-Target (ON) cases. By contrastively visualizing the contextual word representations (CWRs) of these cases with teacher forcing, we show that 1) the CWRs of different languages are effectively distributed in separate regions when the sentence and ID are matched (ON setting), and 2) if the sentence and ID are unmatched (OFF setting), the CWRs of different languages are chaotically distributed. Our analyses suggest that although they work well in ideal ON settings, language IDs become fragile and lose their navigation ability when faced with off-target tokens, which commonly exist during inference but are rare in training scenarios. In response, we employ unlikelihood tuning on the negative (OFF) samples to minimize their probability such that the language IDs can discriminate between the on- and off-target tokens during training. Experiments spanning 40 ZST directions show that our method reduces the off-target ratio by -48.0% on average, leading to a +9.1 BLEU improvement with only an extra +0.3% tuning cost.
翻译:零样本翻译(ZST)通常基于多语言神经机器翻译模型,旨在翻译训练数据中未见过的语言对。在推理过程中引导零样本语言映射的常见做法是主动插入源语言和目标语言标识符,例如使用<EN>表示英语、<DE>表示德语。近期研究表明,语言标识有时无法有效引导ZST任务,导致出现目标偏离问题(生成翻译中存在非目标语言词汇),从而难以将当前多语言翻译模型应用于广泛的零样本语言场景。为了理解语言标识导航能力何时及为何被削弱,我们对比了ZST方向中两种极端的解码器输入情况:目标偏离(OFF)和目标匹配(ON)情形。通过采用教师强制方法对比可视化这些案例的上下文词汇表征(CWR),我们发现:1)当句子与标识匹配时(ON设置),不同语言的CWR有效分布在独立区域;2)当句子与标识不匹配时(OFF设置),不同语言的CWR呈现混乱分布。分析表明,尽管语言标识在理想的ON设置下表现良好,但在面对目标偏离词元时(这在推理中常见但训练场景罕见)会变得脆弱并丧失导航能力。为此,我们对负面(OFF)样本采用非似然调优以最小化其概率,使语言标识在训练过程中能区分目标匹配与目标偏离词元。覆盖40个ZST方向的实验表明,该方法平均降低目标偏离率-48.0%,仅需额外+0.3%的调优成本即可实现+9.1 BLEU值的提升。