End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech. Compared with the cascade approach, which combines automatic speech recognition (ASR) and text summarization models, the E2E approach is more promising because it mitigates ASR errors, incorporates nonverbal information, and simplifies the overall system. However, since collecting a large amount of paired data (i.e., speech and summary) is difficult, the training data is usually insufficient to train a robust E2E SSum system. In this paper, we present two novel methods that leverage a large amount of external text summarization data for E2E SSum training. The first technique is to utilize a text-to-speech (TTS) system to generate synthesized speech, which is used for E2E SSum training with the text summary. The second is a TTS-free method that directly inputs phoneme sequence instead of synthesized speech to the E2E SSum model. Experiments show that our proposed TTS- and phoneme-based methods improve several metrics on the How2 dataset. In particular, our best system outperforms a previous state-of-the-art one by a large margin (i.e., METEOR score improvements of more than 6 points). To the best of our knowledge, this is the first work to use external language resources for E2E SSum. Moreover, we report a detailed analysis of the How2 dataset to confirm the validity of our proposed E2E SSum system.
翻译:端到端语音摘要(E2E SSum)是一种直接从语音生成摘要句子的技术。与结合自动语音识别(ASR)和文本摘要模型的级联方法相比,端到端方法更具前景,因为它能减轻ASR错误、融入非语言信息并简化整体系统。然而,由于难以收集大量配对数据(即语音与摘要),训练数据通常不足以构建鲁棒的E2E SSum系统。本文提出两种利用大量外部文本摘要数据训练E2E SSum的新方法:第一种方法采用文本转语音(TTS)系统生成合成语音,并配合文本摘要进行E2E SSum训练;第二种是无TTS方法,直接将音素序列而非合成语音输入E2E SSum模型。实验表明,所提出的基于TTS和音素的方法在How2数据集上多项指标均有提升。特别地,我们最佳系统的表现大幅超越先前最先进模型(如METEOR评分提升超过6分)。据我们所知,这是首个将外部语言资源用于E2E SSum的研究。此外,我们通过详细分析How2数据集证实了所提E2E SSum系统的有效性。