This paper introduces EmpathyEar, a pioneering open-source, avatar-based multimodal empathetic chatbot, to fill the gap in traditional text-only empathetic response generation (ERG) systems. Leveraging the advancements of a large language model, combined with multimodal encoders and generators, EmpathyEar supports user inputs in any combination of text, sound, and vision, and produces multimodal empathetic responses, offering users, not just textual responses but also digital avatars with talking faces and synchronized speeches. A series of emotion-aware instruction-tuning is performed for comprehensive emotional understanding and generation capabilities. In this way, EmpathyEar provides users with responses that achieve a deeper emotional resonance, closely emulating human-like empathy. The system paves the way for the next emotional intelligence, for which we open-source the code for public access.
翻译:本文介绍了EmpathyEar,这是一种开创性的、基于数字形象的开源多模态共情对话机器人,旨在填补传统纯文本共情回复生成(ERG)系统的空白。通过利用大型语言模型的进展,并结合多模态编码器与生成器,EmpathyEar支持用户以文本、声音和视觉的任何组合形式输入,并生成多模态的共情回复,不仅为用户提供文本回应,还提供具有说话面部表情和同步语音的数字形象。系统进行了一系列情感感知的指令微调,以具备全面的情感理解与生成能力。由此,EmpathyEar能够为用户提供实现更深层次情感共鸣的回复,高度模拟类人的共情能力。该系统为下一代情感智能的发展铺平了道路,我们已开源其代码以供公众使用。