End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model. This approach is promising because it, in contrast to the conventional cascade approach, can utilize full acoustical information and mitigate to the propagation of transcription errors. However, due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences. To overcome this drawback, we propose for the first time to integrate a pre-trained language model (LM), which is highly capable of generating natural sentences, into the E2E SSum decoder via transfer learning. In addition, to reduce the gap between the independently pre-trained encoder and decoder, we also propose to transfer the baseline E2E SSum encoder instead of the commonly used automatic speech recognition encoder. Experimental results show that the proposed model outperforms baseline and data augmented models.
翻译:端到端语音摘要(E2E SSum)通过单一模型将输入语音直接总结为易于阅读的短句。与传统的级联方法相比,这种方法能够充分利用完整的声学信息,并减少转录错误传播,因此极具前景。然而,由于收集语音-摘要配对数据成本高昂,E2E SSum模型往往面临训练数据稀缺的问题,并生成不自然的句子。为克服这一缺陷,我们首次提出通过迁移学习将预训练语言模型(LM)——该模型生成自然句子的能力极强——整合到E2E SSum解码器中。此外,为缩小独立预训练的编码器与解码器之间的差异,我们还提出迁移基线E2E SSum编码器,而非常用的自动语音识别编码器。实验结果表明,所提模型优于基线模型及数据增强模型。