Multimodal emotion recognition in conversation (MERC), the task of identifying the emotion label for each utterance in a conversation, is vital for developing empathetic machines. Current MLLM-based MERC studies focus mainly on capturing the speaker's textual or vocal characteristics, but ignore the significance of video-derived behavior information. Different from text and audio inputs, learning videos with rich facial expression, body language and posture, provides emotion trigger signals to the models for more accurate emotion predictions. In this paper, we propose a novel behavior-aware MLLM-based framework (BeMERC) to incorporate speaker's behaviors, including subtle facial micro-expression, body language and posture, into a vanilla MLLM-based MERC model, thereby facilitating the modeling of emotional dynamics during a conversation. Furthermore, BeMERC adopts a two-stage instruction tuning strategy to extend the model to the conversations scenario for end-to-end training of a MERC predictor. Experiments demonstrate that BeMERC achieves superior performance than the state-of-the-art methods on two benchmark datasets, and also provides a detailed discussion on the significance of video-derived behavior information in MERC.
翻译:对话多模态情感识别(MERC)旨在识别对话中每个话语的情感标签,对于开发具有共情能力的机器至关重要。当前基于多模态大语言模型(MLLM)的MERC研究主要集中于捕捉说话者的文本或语音特征,却忽视了视频衍生的行为信息的重要性。与文本和音频输入不同,学习包含丰富面部表情、肢体语言和姿态的视频,能为模型提供情感触发信号,从而实现更准确的情感预测。本文提出了一种新颖的基于多模态大语言模型的行为感知框架(BeMERC),将说话者的行为(包括细微的面部微表情、肢体语言和姿态)整合到一个基础的基于MLLM的MERC模型中,从而促进对话过程中情感动态的建模。此外,BeMERC采用两阶段指令微调策略,将模型扩展至对话场景,以实现MERC预测器的端到端训练。实验表明,BeMERC在两个基准数据集上取得了优于现有最先进方法的性能,并对视频衍生的行为信息在MERC中的重要性进行了详细探讨。