How to achieve better end-to-end speech translation (ST) by leveraging (text) machine translation (MT) data? Among various existing techniques, multi-task learning is one of the effective ways to share knowledge between ST and MT in which additional MT data can help to learn source-to-target mapping. However, due to the differences between speech and text, there is always a gap between ST and MT. In this paper, we first aim to understand this modality gap from the target-side representation differences, and link the modality gap to another well-known problem in neural machine translation: exposure bias. We find that the modality gap is relatively small during training except for some difficult cases, but keeps increasing during inference due to the cascading effect. To address these problems, we propose the Cross-modal Regularization with Scheduled Sampling (Cress) method. Specifically, we regularize the output predictions of ST and MT, whose target-side contexts are derived by sampling between ground truth words and self-generated words with a varying probability. Furthermore, we introduce token-level adaptive training which assigns different training weights to target tokens to handle difficult cases with large modality gaps. Experiments and analysis show that our approach effectively bridges the modality gap, and achieves promising results in all eight directions of the MuST-C dataset.
翻译:如何通过利用(文本)机器翻译(MT)数据实现更好的端到端语音翻译(ST)?在各种现有技术中,多任务学习是在ST与MT之间共享知识的有效方式之一,其中额外的MT数据有助于学习源语言到目标语言的映射。然而,由于语音和文本之间的差异,ST与MT之间始终存在差距。本文首先旨在从目标端表示差异的角度理解这一模态差距,并将其与神经机器翻译中另一个著名问题——曝光偏差——联系起来。我们发现,除了一些困难情形外,模态差距在训练期间相对较小,但由于级联效应,在推理过程中持续增大。为解决这些问题,我们提出了结合调度采样的跨模态正则化方法(Cress)。具体而言,我们对ST和MT的输出预测进行正则化,这两种预测的目标端上下文通过以变化概率在真实词与自生成词之间采样得到。此外,我们引入了令牌级自适应训练,为目标令牌分配不同的训练权重,以处理具有较大模态差距的困难情形。实验与分析表明,我们的方法有效地弥合了模态差距,并在MuST-C数据集的所有八个方向上取得了有前景的结果。