The gap between speech and text modalities is a major challenge in speech-to-text translation (ST). Different methods have been proposed for reducing this gap, but most of them require architectural changes in ST training. In this work, we propose to mitigate this issue at the pre-training stage, requiring no change in the ST model. First, we show that the connectionist temporal classification (CTC) loss can reduce the modality gap by design. We provide a quantitative comparison with the more common cross-entropy loss, showing that pre-training with CTC consistently achieves better final ST accuracy. Nevertheless, CTC is only a partial solution and thus, in our second contribution, we propose a novel pre-training method combining CTC and optimal transport to further reduce this gap. Our method pre-trains a Siamese-like model composed of two encoders, one for acoustic inputs and the other for textual inputs, such that they produce representations that are close to each other in the Wasserstein space. Extensive experiments on the standard CoVoST-2 and MuST-C datasets show that our pre-training method applied to the vanilla encoder-decoder Transformer achieves state-of-the-art performance under the no-external-data setting, and performs on par with recent strong multi-task learning systems trained with external data. Finally, our method can also be applied on top of these multi-task systems, leading to further improvements for these models.
翻译:语音与文本模态之间的差异是语音到文本翻译(ST)面临的主要挑战。现有多种方法试图缩小这一差异,但大多数需要在ST训练中引入架构修改。本研究提出在预训练阶段缓解该问题,无需对ST模型进行任何改动。首先,我们证明联结主义时序分类(CTC)损失函数通过其固有设计可缩小模态差异。通过定量对比更常用的交叉熵损失,我们发现采用CTC预训练能持续获得更优的最终ST准确率。然而,CTC仅能提供部分解决方案。因此,在第二项贡献中,我们提出一种结合CTC与最优输运的新型预训练方法以进一步缩小模态鸿沟。该方法预训练一个由两个编码器组成的孪生网络——一个处理声学输入,另一个处理文本输入——使其在Wasserstein空间中生成的表征相互接近。在标准CoVoST-2和MuST-C数据集上的大量实验表明,将本文预训练方法应用于基础编码器-解码器Transformer架构,在无外部数据设置下达到了最先进性能,且与近期使用外部数据训练的强多任务学习系统性能持平。最后,我们的方法还可应用于这些多任务系统之上,进一步改善其模型表现。