BeMERC: Behavior-Aware MLLM-based Framework for Multimodal Emotion Recognition in Conversation

Multimodal emotion recognition in conversation (MERC), the task of identifying the emotion label for each utterance in a conversation, is vital for developing empathetic machines. Current MLLM-based MERC studies focus mainly on capturing the speaker's textual or vocal characteristics, but ignore the significance of video-derived behavior information. Different from text and audio inputs, learning videos with rich facial expression, body language and posture, provides emotion trigger signals to the models for more accurate emotion predictions. In this paper, we propose a novel behavior-aware MLLM-based framework (BeMERC) to incorporate speaker's behaviors, including subtle facial micro-expression, body language and posture, into a vanilla MLLM-based MERC model, thereby facilitating the modeling of emotional dynamics during a conversation. Furthermore, BeMERC adopts a two-stage instruction tuning strategy to extend the model to the conversations scenario for end-to-end training of a MERC predictor. Experiments demonstrate that BeMERC achieves superior performance than the state-of-the-art methods on two benchmark datasets, and also provides a detailed discussion on the significance of video-derived behavior information in MERC.

翻译：对话多模态情感识别（MERC）旨在识别对话中每个话语的情感标签，对于开发具有共情能力的机器至关重要。当前基于多模态大语言模型（MLLM）的MERC研究主要集中于捕捉说话者的文本或语音特征，却忽视了视频衍生的行为信息的重要性。与文本和音频输入不同，学习包含丰富面部表情、肢体语言和姿态的视频，能为模型提供情感触发信号，从而实现更准确的情感预测。本文提出了一种新颖的基于多模态大语言模型的行为感知框架（BeMERC），将说话者的行为（包括细微的面部微表情、肢体语言和姿态）整合到一个基础的基于MLLM的MERC模型中，从而促进对话过程中情感动态的建模。此外，BeMERC采用两阶段指令微调策略，将模型扩展至对话场景，以实现MERC预测器的端到端训练。实验表明，BeMERC在两个基准数据集上取得了优于现有最先进方法的性能，并对视频衍生的行为信息在MERC中的重要性进行了详细探讨。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日