HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition

Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-ware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in self-supervised learning, we propose Hierarchical Contrastive Masked Autoencoder (HiCMAE), a novel self-supervised framework that leverages large-scale self-supervised pre-training on vast unlabeled audio-visual data to promote the advancement of AVER. Following prior arts in self-supervised audio-visual representation learning, HiCMAE adopts two primary forms of self-supervision for pre-training, namely masked data modeling and contrastive learning. Unlike them which focus exclusively on top-layer representations while neglecting explicit guidance of intermediate layers, HiCMAE develops a three-pronged strategy to foster hierarchical audio-visual feature learning and improve the overall quality of learned representations. To verify the effectiveness of HiCMAE, we conduct extensive experiments on 9 datasets covering both categorical and dimensional AVER tasks. Experimental results show that our method significantly outperforms state-of-the-art supervised and self-supervised audio-visual methods, which indicates that HiCMAE is a powerful audio-visual emotion representation learner. Codes and models will be publicly available at https://github.com/sunlicai/HiCMAE.

翻译：摘要：视听情感识别（Audio-Visual Emotion Recognition, AVER）近年来因其在构建情感感知智能机器中的关键作用而受到越来越多的关注。该领域的前期工作主要以监督学习范式为主导。尽管取得了显著进展，但受限于AVER中长期存在的数据稀缺问题，监督学习正面临发展瓶颈。受自监督学习领域最新进展的启发，我们提出了层次化对比掩码自编码器（Hierarchical Contrastive Masked Autoencoder, HiCMAE），这是一种新颖的自监督框架，通过在大量无标注视听数据上进行大规模自监督预训练，推动AVER的发展。遵循自监督视听表征学习的先前研究成果，HiCMAE采用掩码数据建模和对比学习两种主要形式的自监督信号进行预训练。与仅关注顶层表征而忽视中间层显式指导的现有方法不同，HiCMAE开发了一种三管齐下的策略，以促进层次化视听特征学习并提升学习表征的整体质量。为验证HiCMAE的有效性，我们在涵盖分类型与维度型AVER任务的9个数据集上进行了大量实验。实验结果表明，我们的方法显著优于当前最先进的监督与自监督视听方法，这表明HiCMAE是一种强大的视听情感表征学习器。代码与模型将开源在https://github.com/sunlicai/HiCMAE。