This paper presents our winning approach for the MER-NOISE and MER-OV tracks of the MER2024 Challenge on multimodal emotion recognition. Our system leverages the advanced emotional understanding capabilities of Emotion-LLaMA to generate high-quality annotations for unlabeled samples, addressing the challenge of limited labeled data. To enhance multimodal fusion while mitigating modality-specific noise, we introduce Conv-Attention, a lightweight and efficient hybrid framework. Extensive experimentation vali-dates the effectiveness of our approach. In the MER-NOISE track, our system achieves a state-of-the-art weighted average F-score of 85.30%, surpassing the second and third-place teams by 1.47% and 1.65%, respectively. For the MER-OV track, our utilization of Emotion-LLaMA for open-vocabulary annotation yields an 8.52% improvement in average accuracy and recall compared to GPT-4V, securing the highest score among all participating large multimodal models. The code and model for Emotion-LLaMA are available at https://github.com/ZebangCheng/Emotion-LLaMA.
翻译:本文介绍了我们在MER2024多模态情感识别挑战赛的MER-NOISE和MER-OV赛道中的获胜方案。我们的系统利用Emotion-LLaMA先进的情感理解能力,为未标注样本生成高质量标注,以应对标注数据有限的挑战。为了增强多模态融合并减轻模态特异性噪声,我们引入了Conv-Attention,这是一个轻量级且高效的混合框架。大量实验验证了我们方法的有效性。在MER-NOISE赛道中,我们的系统实现了85.30%的加权平均F分数,达到了最先进的性能,分别超过第二名和第三名团队1.47%和1.65%。在MER-OV赛道中,我们利用Emotion-LLaMA进行开放词汇标注,其平均准确率和召回率相比GPT-4V提升了8.52%,在所有参赛的大型多模态模型中获得了最高分。Emotion-LLaMA的代码和模型可在https://github.com/ZebangCheng/Emotion-LLaMA获取。