BLSP-Emo: Towards Empathetic Large Speech-Language Models

The recent release of GPT-4o showcased the potential of end-to-end multimodal models, not just in terms of low latency but also in their ability to understand and generate expressive speech with rich emotions. While the details are unknown to the open research community, it likely involves significant amounts of curated data and compute, neither of which is readily accessible. In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. BLSP-Emo utilizes existing speech recognition (ASR) and speech emotion recognition (SER) datasets through a two-stage process. The first stage focuses on semantic alignment, following recent work on pretraining speech-language models using ASR data. The second stage performs emotion alignment with the pretrained speech-language model on an emotion-aware continuation task constructed from SER data. Our experiments demonstrate that the BLSP-Emo model excels in comprehending speech and delivering empathetic responses, both in instruction-following tasks and conversations.

翻译：GPT-4o 的近期发布展示了端到端多模态模型的潜力，不仅在于其低延迟，更在于其理解和生成富有情感的、具有表现力的语音的能力。虽然开放研究社区尚不清楚其具体细节，但它很可能涉及大量精心策划的数据和计算资源，而这两者都难以轻易获得。在本文中，我们提出了 BLSP-Emo（支持情感的自举式语言-语音预训练），这是一种开发端到端语音-语言模型的新方法，该模型能够理解语音中的语义和情感，并生成具有共情能力的回应。BLSP-Emo 通过一个两阶段过程，利用现有的语音识别（ASR）和语音情感识别（SER）数据集。第一阶段侧重于语义对齐，遵循近期利用 ASR 数据预训练语音-语言模型的研究工作。第二阶段在由 SER 数据构建的情感感知续写任务上，对预训练的语音-语言模型进行情感对齐。我们的实验表明，无论是在指令遵循任务还是在对话中，BLSP-Emo 模型在理解语音和提供共情回应方面都表现出色。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日