Research in multilingual speech-to-text translation is topical. Having a single model that supports multiple translation tasks is desirable. The goal of this work it to improve cross-lingual transfer learning in multilingual speech-to-text translation via semantic knowledge distillation. We show that by initializing the encoder of the encoder-decoder sequence-to-sequence translation model with SAMU-XLS-R, a multilingual speech transformer encoder trained using multi-modal (speech-text) semantic knowledge distillation, we achieve significantly better cross-lingual task knowledge transfer than the baseline XLS-R, a multilingual speech transformer encoder trained via self-supervised learning. We demonstrate the effectiveness of our approach on two popular datasets, namely, CoVoST-2 and Europarl. On the 21 translation tasks of the CoVoST-2 benchmark, we achieve an average improvement of 12.8 BLEU points over the baselines. In the zero-shot translation scenario, we achieve an average gain of 18.8 and 11.9 average BLEU points on unseen medium and low-resource languages. We make similar observations on Europarl speech translation benchmark.
翻译:多语言语音到文本翻译的研究是当前热点。支持多项翻译任务的单一模型具有理想性。本研究旨在通过语义知识蒸馏改进多语言语音到文本翻译中的跨语言迁移学习。我们证明,通过使用SAMU-XLS-R(一种通过多模态(语音-文本)语义知识蒸馏训练的多语言语音Transformer编码器)初始化编码器-解码器序列到序列翻译模型的编码器,相较于基线XLS-R(一种通过自监督学习训练的多语言语音Transformer编码器),我们实现了显著更优的跨语言任务知识迁移。我们在两个常用数据集CoVoST-2和Europarl上展示了本方法的有效性。在CoVoST-2基准的21个翻译任务中,我们相比基线平均提升了12.8个BLEU点。在零样本翻译场景下,对于未见过的中资源和低资源语言,我们分别实现了18.8和11.9的平均BLEU点增益。在Europarl语音翻译基准上,我们观察到了类似的结果。