End-to-end speech translation (ST) is the task of translating speech signals in the source language into text in the target language. As a cross-modal task, end-to-end ST is difficult to train with limited data. Existing methods often try to transfer knowledge from machine translation (MT), but their performances are restricted by the modality gap between speech and text. In this paper, we propose Cross-modal Mixup via Optimal Transport CMOT to overcome the modality gap. We find the alignment between speech and text sequences via optimal transport and then mix up the sequences from different modalities at a token level using the alignment. Experiments on the MuST-C ST benchmark demonstrate that CMOT achieves an average BLEU of 30.0 in 8 translation directions, outperforming previous methods. Further analysis shows CMOT can adaptively find the alignment between modalities, which helps alleviate the modality gap between speech and text. Code is publicly available at https://github.com/ictnlp/CMOT.
翻译:端到端语音翻译(ST)是将源语言语音信号转换为目标语言文本的任务。作为一种跨模态任务,端到端ST在有限数据下难以训练。现有方法常尝试从机器翻译(MT)迁移知识,但其性能受到语音与文本之间模态差距的限制。本文提出基于最优传输的跨模态混合方法CMOT以克服模态差距。我们通过最优传输发现语音与文本序列之间的对齐,并利用该对齐在词元级别混合不同模态的序列。在MuST-C ST基准上的实验表明,CMOT在8个翻译方向上平均BLEU值达到30.0,优于先前方法。进一步分析显示,CMOT能够自适应地发现模态间的对齐,有助于缓解语音与文本之间的模态差距。代码已公开于https://github.com/ictnlp/CMOT。