Sentiment analysis and emotion recognition are crucial for applications such as human-computer interaction and depression detection. Traditional unimodal methods often fail to capture the complexity of emotional expressions due to conflicting signals from different modalities. Current Multimodal Large Language Models (MLLMs) also face challenges in detecting subtle facial expressions and addressing a wide range of emotion-related tasks. To tackle these issues, we propose M2SE, a Multistage Multitask Sentiment and Emotion Instruction Tuning Strategy for general-purpose MLLMs. It employs a combined approach to train models on tasks such as multimodal sentiment analysis, emotion recognition, facial expression recognition, emotion reason inference, and emotion cause-pair extraction. We also introduce the Emotion Multitask dataset (EMT), a custom dataset that supports these five tasks. Our model, Emotion Universe (EmoVerse), is built on a basic MLLM framework without modifications, yet it achieves substantial improvements across these tasks when trained with the M2SE strategy. Extensive experiments demonstrate that EmoVerse outperforms existing methods, achieving state-of-the-art results in sentiment and emotion tasks. These results highlight the effectiveness of M2SE in enhancing multimodal emotion perception. The dataset and code are available at https://github.com/xiaoyaoxinyi/M2SE.
翻译:情感分析与情绪识别在人机交互和抑郁检测等应用中至关重要。传统的单模态方法常因不同模态间的信号冲突而难以捕捉情感表达的复杂性。当前的多模态大语言模型(MLLMs)在检测细微面部表情和处理广泛情绪相关任务方面也面临挑战。为解决这些问题,我们提出了M2SE——一种面向通用MLLMs的多阶段多任务情感与情绪指令微调策略。该策略采用组合方法,在多模态情感分析、情绪识别、面部表情识别、情绪推理推断及情绪因果对抽取等任务上训练模型。我们还构建了支持这五项任务的定制数据集EMT(情绪多任务数据集)。基于基础MLLM框架构建的模型Emotion Universe(EmoVerse)未作结构调整,但通过M2SE策略训练后,在各项任务上均取得显著提升。大量实验表明,EmoVerse在情感与情绪任务中超越现有方法,达到了最先进的性能水平。这些结果凸显了M2SE在增强多模态情绪感知方面的有效性。数据集与代码已公开于https://github.com/xiaoyaoxinyi/M2SE。