Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023-SEMI challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.
翻译:准确的情感感知对于人机交互、教育和心理咨询等多种应用至关重要。然而,传统的单模态方法往往难以捕捉现实世界情感表达的复杂性,这些表达本质上是多模态的。此外,现有的多模态大语言模型在整合音频信息和识别细微的面部微表情方面面临挑战。为此,我们引入了MERR数据集,该数据集包含28,618个粗标注样本和4,487个细标注样本,涵盖多种情感类别。该数据集使模型能够从多样化的场景中学习,并泛化到实际应用中。进一步,我们提出了Emotion-LLaMA模型,该模型通过特定于情感的编码器无缝整合音频、视觉和文本输入。通过将特征对齐到共享空间,并采用经过指令微调的改进版LLaMA模型,Emotion-LLaMA显著提升了情感识别与推理能力。大量评估表明,Emotion-LLaMA优于其他多模态大语言模型,在EMER基准上取得了最高的线索重叠度(7.83)和标签重叠度(6.25)分数,在MER2023-SEMI挑战赛上获得了0.9036的F1分数,并在DFEW数据集的零样本评估中取得了最高的未加权平均召回率(45.59)和加权平均召回率(59.37)。