Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore propose Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT), an LLM that utilizes text and speech modalities to better model the linguistic content and paralinguistic attributes of spoken dialogue. The model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking multimodal framework. Specifically, our framework serializes tasks in the order of current paralinguistic attribute prediction, response paralinguistic attribute prediction, and response text generation with autoregressive conditioning. We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset. Experimental results indicate the proposed serialized multitasking method outperforms typical sequence classification techniques on current and response sentiment classification. Furthermore, leveraging conversational context and speech embeddings significantly improves both response text generation and sentiment prediction. Our proposed framework achieves relative improvements of 6.7%, 12.0%, and 3.5% in current sentiment accuracy, response sentiment accuracy, and response text BLEU score, respectively.

翻译：大语言模型（LLMs）在聊天、推理和问答等任务中展现出卓越能力。然而，标准LLM可能忽略情感、情绪和说话风格等关键副语言信息——这些信息对于实现自然、类人的口语对话至关重要，尤其是当此类信息通过声学线索传递时。为此，我们提出副语言学增强型生成式预训练Transformer（ParalinGPT），该LLM利用文本与语音模态，以更优方式对口语对话的语言内容和副语言属性进行建模。该模型将文本会话上下文、语音嵌入和副语言属性作为输入提示，集成于序列化多任务多模态框架中。具体而言，我们的框架按当前副语言属性预测、响应副语言属性预测和响应文本生成的顺序，通过自回归条件化实现任务序列化。实验采用Switchboard-1语料库（包含情感标签作为副语言属性）作为口语对话数据集。结果表明，所提出的序列化多任务方法在当前和响应情感分类任务上优于典型序列分类技术。此外，利用会话上下文和语音嵌入可显著提升响应文本生成与情感预测性能。本框架在当前情感准确率、响应情感准确率和响应文本BLEU评分上分别获得6.7%、12.0%和3.5%的相对提升。