Recently, speech-text pre-training methods have shown remarkable success in many speech and natural language processing tasks. However, most previous pre-trained models are usually tailored for one or two specific tasks, but fail to conquer a wide range of speech-text tasks. In addition, existing speech-text pre-training methods fail to explore the contextual information within a dialogue to enrich utterance representations. In this paper, we propose Speech-text dialog Pre-training for spoken dialog understanding with ExpliCiT cRoss-Modal Alignment (SPECTRA), which is the first-ever speech-text dialog pre-training model. Concretely, to consider the temporality of speech modality, we design a novel temporal position prediction task to capture the speech-text alignment. This pre-training task aims to predict the start and end time of each textual word in the corresponding speech waveform. In addition, to learn the characteristics of spoken dialogs, we generalize a response selection task from textual dialog pre-training to speech-text dialog pre-training scenarios. Experimental results on four different downstream speech-text tasks demonstrate the superiority of SPECTRA in learning speech-text alignment and multi-turn dialog context.
翻译:近期,语音文本预训练方法在多项语音和自然语言处理任务中展现出显著成效。然而,大多数现有预训练模型通常针对一两个特定任务进行定制,但无法广泛适用于语音文本任务。此外,现有语音文本预训练方法未能挖掘对话中的上下文信息来丰富话语表示。本文提出面向口语对话理解的语音文本对话预训练模型SPECTRA(Speech-text dialog Pre-training for spoken dialog understanding with ExpliCiT cRoss-Modal Alignment),这是首个语音文本对话预训练模型。具体而言,为考虑语音模态的时间特性,我们设计了一种新颖的时间位置预测任务来捕获语音文本对齐。该预训练任务旨在预测每个文本词在对应语音波形中的起始和结束时间。此外,为学习口语对话的特征,我们将文本对话预训练中的响应选择任务泛化至语音文本对话预训练场景。在四个不同的下游语音文本任务上的实验结果表明,SPECTRA在学习语音文本对齐和多轮对话上下文方面具有优越性。