As speech becomes an increasingly common modality for interacting with large language models (LLMs), it is becoming desirable to develop systems where LLMs can take into account users' emotions or speaking styles when providing their responses. In this work, we study the potential of an LLM to understand these aspects of speech without fine-tuning its weights. To do this, we utilize an end-to-end system with a speech encoder; the encoder is trained to produce token embeddings such that the LLM's response to an expressive speech prompt is aligned with its response to a semantically matching text prompt where the speaker's emotion has also been specified. We find that this training framework allows the encoder to generate tokens that capture both semantic and paralinguistic information in speech and effectively convey it to the LLM, even when the LLM remains completely frozen. We also explore training on additional emotion and style-related response alignment tasks, finding that they further increase the amount of paralinguistic information explicitly captured in the speech tokens. Experiments demonstrate that our system is able to produce higher quality and more empathetic responses to expressive speech prompts compared to several baselines.
翻译:随着语音日益成为与大型语言模型(LLM)交互的常见模态,开发能够考虑用户情感或说话风格来生成回应的系统变得愈发重要。在本研究中,我们探索了LLM在不微调其权重的情况下理解语音中这些特征的潜力。为此,我们采用了一个包含语音编码器的端到端系统;该编码器经过训练,能够生成词元嵌入,使得LLM对富有表现力的语音提示的回应,与对语义匹配且指定了说话者情感的文本提示的回应保持一致。我们发现,这种训练框架使编码器能够生成同时捕获语音中语义和副语言信息的词元,并将其有效传递给LLM,即使LLM保持完全冻结状态。我们还探索了在额外情感和风格相关的回应对齐任务上进行训练,发现这能进一步增加语音词元中明确捕获的副语言信息量。实验表明,与多个基线模型相比,我们的系统能够对富有表现力的语音提示生成质量更高、更具共情力的回应。