The recent release of GPT-4o showcased the potential of end-to-end multimodal models, not just in terms of low latency but also in their ability to understand and generate expressive speech with rich emotions. While the details are unknown to the open research community, it likely involves significant amounts of curated data and compute, neither of which is readily accessible. In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. BLSP-Emo utilizes existing speech recognition (ASR) and speech emotion recognition (SER) datasets through a two-stage process. The first stage focuses on semantic alignment, following recent work on pretraining speech-language models using ASR data. The second stage performs emotion alignment with the pretrained speech-language model on an emotion-aware continuation task constructed from SER data. Our experiments demonstrate that the BLSP-Emo model excels in comprehending speech and delivering empathetic responses, both in instruction-following tasks and conversations.
翻译:GPT-4o 的近期发布展示了端到端多模态模型的潜力,不仅在于其低延迟,更在于其理解和生成富有情感的、具有表现力的语音的能力。虽然开放研究社区尚不清楚其具体细节,但它很可能涉及大量精心策划的数据和计算资源,而这两者都难以轻易获得。在本文中,我们提出了 BLSP-Emo(支持情感的自举式语言-语音预训练),这是一种开发端到端语音-语言模型的新方法,该模型能够理解语音中的语义和情感,并生成具有共情能力的回应。BLSP-Emo 通过一个两阶段过程,利用现有的语音识别(ASR)和语音情感识别(SER)数据集。第一阶段侧重于语义对齐,遵循近期利用 ASR 数据预训练语音-语言模型的研究工作。第二阶段在由 SER 数据构建的情感感知续写任务上,对预训练的语音-语言模型进行情感对齐。我们的实验表明,无论是在指令遵循任务还是在对话中,BLSP-Emo 模型在理解语音和提供共情回应方面都表现出色。