Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set.
翻译:联合语音-语言训练颇具挑战性,原因在于其训练数据需求量巨大、GPU消耗高,以及语音与语言之间的模态差异。我们提出ComSL,一种构建于公开预训练纯语音和纯语言模型复合架构之上的语音-语言模型,并通过数据高效的方式针对口语任务进行优化。特别地,我们提出将跨模态学习融入迁移学习,并以多任务学习方式同时为下游任务执行这两种学习。我们的方法在端到端语音到文本翻译任务中展现出有效性,在面向21种语言的多语言语音到英文文本翻译任务中,于公共CoVoST2评估集上达到了31.5的平均BLEU分数,创下新的最优水平。