In spoken dialogue, even if two current turns are the same sentence, their responses might still differ when they are spoken in different styles. The spoken styles, containing paralinguistic and prosodic information, mark the most significant difference between text and speech modality. When using text-only LLMs to model spoken dialogue, text-only LLMs cannot give different responses based on the speaking style of the current turn. In this paper, we focus on enabling LLMs to listen to the speaking styles and respond properly. Our goal is to teach the LLM that "even if the sentences are identical if they are spoken in different styles, their corresponding responses might be different". Since there is no suitable dataset for achieving this goal, we collect a speech-to-speech dataset, StyleTalk, with the following desired characteristics: when two current speeches have the same content but are spoken in different styles, their responses will be different. To teach LLMs to understand and respond properly to the speaking styles, we propose the Spoken-LLM framework that can model the linguistic content and the speaking styles. We train Spoken-LLM using the StyleTalk dataset and devise a two-stage training pipeline to help the Spoken-LLM better learn the speaking styles. Based on extensive experiments, we show that Spoken-LLM outperforms text-only baselines and prior speech LLMs methods.
翻译:在口语对话中,即使当前两个话轮是相同的句子,当它们以不同风格说出时,其回应仍可能不同。包含副语言信息和韵律特征的说话风格,是文本与语音模态之间最显著的差异。当使用纯文本大语言模型对口语对话进行建模时,这些模型无法根据当前话轮的说话风格给出不同的回应。本文聚焦于使大语言模型能够感知说话风格并作出恰当回应。我们的目标是教会大语言模型理解"即使句子完全相同,若以不同风格说出,其对应回应也可能不同"。由于缺乏实现该目标的合适数据集,我们收集了一个语音到语音数据集StyleTalk,该数据集具备以下理想特性:当两段当前语音内容相同但说话风格不同时,其回应将有所差异。为教会大语言模型理解说话风格并作出恰当回应,我们提出了能够同时建模语言内容和说话风格的Spoken-LLM框架。我们使用StyleTalk数据集训练Spoken-LLM,并设计了两阶段训练流程以帮助Spoken-LLM更好地学习说话风格。基于大量实验,我们证明Spoken-LLM优于纯文本基线模型及现有语音大语言模型方法。