The gap between speech and text modalities is a major challenge in speech-to-text translation (ST). Different methods have been proposed to reduce this gap, but most of them require architectural changes in ST training. In this work, we propose to mitigate this issue at the pre-training stage, requiring no change in the ST model. First, we show that the connectionist temporal classification (CTC) loss can reduce the modality gap by design. We provide a quantitative comparison with the more common cross-entropy loss, showing that pre-training with CTC consistently achieves better final ST accuracy. Nevertheless, CTC is only a partial solution and thus, in our second contribution, we propose a novel pre-training method combining CTC and optimal transport to further reduce this gap. Our method pre-trains a Siamese-like model composed of two encoders, one for acoustic inputs and the other for textual inputs, such that they produce representations that are close to each other in the Wasserstein space. Extensive experiments on the standard CoVoST-2 and MuST-C datasets show that our pre-training method applied to the vanilla encoder-decoder Transformer achieves state-of-the-art performance under the no-external-data setting, and performs on par with recent strong multi-task learning systems trained with external data. Finally, our method can also be applied on top of these multi-task systems, leading to further improvements for these models. Code and pre-trained models are available at https://github.com/formiel/fairseq.
翻译:语音与文本模态之间的差异是语音到文本翻译(ST)的主要挑战。尽管已有多种方法被提出以缩小这一差距,但大多数方法需要在ST训练中改变架构。本研究提出在预训练阶段缓解该问题,无需对ST模型进行任何修改。首先,我们证明连接主义时序分类(CTC)损失可通过设计缩小模态差距。通过与更常见的交叉熵损失进行定量比较,我们发现使用CTC预训练能够持续取得更优的最终ST准确率。然而,CTC仅能提供部分解决方案,因此我们提出第二项贡献——一种结合CTC与最优传输的新型预训练方法,以进一步缩小模态差距。该方法预训练一个由两个编码器构成的孪生式模型(一个处理声学输入,另一个处理文本输入),使其在Wasserstein空间中生成的表示彼此接近。在标准CoVoST-2和MuST-C数据集上的大量实验表明,应用于标准编码器-解码器Transformer架构的预训练方法,在无外部数据设置下达到了最优性能,并与近期使用外部数据训练的强多任务学习系统性能相当。此外,我们的方法还可应用于这些多任务系统之上,进一步优化模型表现。代码与预训练模型已开源至https://github.com/formiel/fairseq。