Simultaneous machine translation (SiMT) is a challenging task that requires starting translation before the full source sentence is available. Prefix-to-prefix framework is often applied to SiMT, which learns to predict target tokens using only a partial source prefix. However, due to the word order difference between languages, misaligned prefix pairs would make SiMT models suffer from serious hallucination problems, i.e. target outputs that are unfaithful to source inputs. Such problems can not only produce target tokens that are not supported by the source prefix, but also hinder generating the correct translation by receiving more source words. In this work, we propose a Confidence-Based Simultaneous Machine Translation (CBSiMT) framework, which uses model confidence to perceive hallucination tokens and mitigates their negative impact with weighted prefix-to-prefix training. Specifically, token-level and sentence-level weights are calculated based on model confidence and acted on the loss function. We explicitly quantify the faithfulness of the generated target tokens using the token-level weight, and employ the sentence-level weight to alleviate the disturbance of sentence pairs with serious word order differences on the model. Experimental results on MuST-C English-to-Chinese and WMT15 German-to-English SiMT tasks demonstrate that our method can consistently improve translation quality at most latency regimes, with up to 2 BLEU scores improvement at low latency.
翻译:摘要:同步机器翻译(SiMT)是一项具有挑战性的任务,要求在完整源句可用前开始翻译。前缀到前缀框架常被用于SiMT,其通过仅利用部分源前缀来学习预测目标词元。然而,由于语言间词序差异,未对齐的前缀对会导致SiMT模型产生严重的幻觉问题,即目标输出与源输入不一致。此类问题不仅会生成源前缀不支持的目标词元,还会在接收更多源词时阻碍正确翻译的生成。本文提出一种基于置信度的同步机器翻译(CBSiMT)框架,利用模型置信度感知幻觉词元,并通过加权前缀到前缀训练缓解其负面影响。具体而言,基于模型置信度计算词元级和句子级权重,并将其作用于损失函数。我们使用词元级权重显式量化生成目标词元的忠实度,并采用句子级权重减轻词序差异严重的句子对模型产生的干扰。在MuST-C英译中及WMT15德译英SiMT任务上的实验结果表明,该方法能在多数延迟场景下持续提升翻译质量,在低延迟场景中BLEU分值最高提升2分。