This paper presents our winning approach for the MER-NOISE and MER-OV tracks of the MER2024 Challenge on multimodal emotion recognition. Our system leverages the advanced emotional understanding capabilities of Emotion-LLaMA to generate high-quality annotations for unlabeled samples, addressing the challenge of limited labeled data. To enhance multimodal fusion while mitigating modality-specific noise, we introduce Conv-Attention, a lightweight and efficient hybrid framework. Extensive experimentation vali-dates the effectiveness of our approach. In the MER-NOISE track, our system achieves a state-of-the-art weighted average F-score of 85.30%, surpassing the second and third-place teams by 1.47% and 1.65%, respectively. For the MER-OV track, our utilization of Emotion-LLaMA for open-vocabulary annotation yields an 8.52% improvement in average accuracy and recall compared to GPT-4V, securing the highest score among all participating large multimodal models. The code and model for Emotion-LLaMA are available at https://github.com/ZebangCheng/Emotion-LLaMA.
翻译:本文介绍了我们在 MER2024 挑战赛多模态情感识别任务中,针对 MER-NOISE 和 MER-OV 赛道所采用的获奖方案。我们的系统利用了 Emotion-LLaMA 先进的情感理解能力,为未标注样本生成高质量的标注,以应对标注数据有限的挑战。为了增强多模态融合并减轻特定模态噪声的影响,我们引入了 Conv-Attention,一个轻量级且高效的混合框架。广泛的实验验证了我们方法的有效性。在 MER-NOISE 赛道中,我们的系统实现了 85.30% 的加权平均 F 分数,达到了最先进的水平,分别超过第二名和第三名团队 1.47% 和 1.65%。在 MER-OV 赛道中,我们利用 Emotion-LLaMA 进行开放词汇标注,其平均准确率和召回率相比 GPT-4V 提升了 8.52%,在所有参赛的大型多模态模型中获得了最高分。Emotion-LLaMA 的代码和模型可在 https://github.com/ZebangCheng/Emotion-LLaMA 获取。