In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks,we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our strategy, ultimately earning us third place in the MER-SEMI track.
翻译:本文针对MER2024竞赛中的半监督学习赛道(MER-SEMI)提出了一套解决方案。首先,为提升特征提取器在情感分类任务上的性能,我们使用标注数据对视频与文本特征提取器(具体采用CLIP-vit-large与Baichuan-13B模型)进行了微调,该方法有效保留了视频中蕴含的原始情感信息。其次,我们提出了一种音频引导的Transformer(AGT)融合机制,该机制利用Hubert-large模型的鲁棒性,在融合通道间与通道内信息方面表现出卓越效果。第三,为提高模型精度,我们采用高置信度未标注数据作为伪标签,通过迭代方式进行自监督学习。最后,通过黑盒探测分析,我们发现训练集与测试集之间存在数据分布不均衡现象,因此采用了基于先验知识的投票机制。实验结果验证了我们策略的有效性,最终帮助我们在MER-SEMI赛道中获得了第三名的成绩。