Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation. In this work, we instead propose a ST/MT multi-tasking framework with hard parameter sharing in which all model parameters are shared cross-modally. Our method reduces the speech-text modality gap via a pre-processing stage which converts speech and text inputs into two discrete token sequences of similar length -- this allows models to indiscriminately process both modalities simply using a joint vocabulary. With experiments on MuST-C, we demonstrate that our multi-tasking framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU without any external MT data. Further, we show that this framework incorporates external MT data, yielding +0.8 BLEU, and also improves transfer learning from pre-trained textual models, yielding +1.8 BLEU.
翻译:近期端到端语音到文本翻译(ST)的研究提出了采用软参数共享的多任务方法,通过辅助编码器将文本输入映射为跨模态表征,从而利用机器翻译(MT)数据。本研究提出一种采用硬参数共享的ST/MT多任务框架,其中所有模型参数均实现跨模态共享。该方法通过预处理阶段将语音和文本输入转换为长度相近的两类离散令牌序列,从而缩小语音-文本模态差异——这使得模型能够仅通过联合词汇表无差别地处理两种模态。基于MuST-C数据集的实验表明,本文提出的多任务框架可使注意力编码器-解码器、连接时序分类(CTC)、转换器以及联合CTC/注意力模型的平均BLEU值提升+0.5(无需外部MT数据)。进一步研究表明,该框架在融入外部MT数据后可获得+0.8 BLEU的提升,同时能改善基于预训练文本模型的迁移学习效果,实现+1.8 BLEU的提升。