HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition

from arxiv, 19 pages, 8 figures, 15 tables. Full Abstract is shown in the pdf file. Codes and models will be publicly available at https://github.com/sunlicai/HiCMAE

Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-ware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in self-supervised learning, we propose Hierarchical Contrastive Masked Autoencoder (HiCMAE), a novel self-supervised framework that leverages large-scale self-supervised pre-training on vast unlabeled audio-visual data to promote the advancement of AVER. Following prior arts in self-supervised audio-visual representation learning, HiCMAE adopts two primary forms of self-supervision for pre-training, namely masked data modeling and contrastive learning. Unlike them which focus exclusively on top-layer representations while neglecting explicit guidance of intermediate layers, HiCMAE develops a three-pronged strategy to foster hierarchical audio-visual feature learning and improve the overall quality of learned representations. To verify the effectiveness of HiCMAE, we conduct extensive experiments on 9 datasets covering both categorical and dimensional AVER tasks. Experimental results show that our method significantly outperforms state-of-the-art supervised and self-supervised audio-visual methods, which indicates that HiCMAE is a powerful audio-visual emotion representation learner. Codes and models will be publicly available at https://github.com/sunlicai/HiCMAE.

翻译：音视频情感识别因其在构建情感感知智能机器中的关键作用，近年来受到越来越多的关注。此前该领域的研究主要受监督学习范式主导。尽管取得了显著进展，但由于音视频情感识别领域长期存在的数据稀缺问题，监督学习正面临瓶颈。受自监督学习近期进展的启发，我们提出层次对比掩码自编码器（HiCMAE），这是一种新颖的自监督框架，通过在大量无标注音视频数据上进行大规模自监督预训练，推动音视频情感识别的发展。借鉴自监督音视频表征学习的现有研究，HiCMAE采用两种主要自监督形式进行预训练，即掩码数据建模和对比学习。与这些仅关注顶层表征而忽视中间层显式指导的方法不同，HiCMAE开发了一种三管齐下的策略，以促进层次化音视频特征学习，提升所学表征的整体质量。为验证HiCMAE的有效性，我们在涵盖分类和维度两种音视频情感识别任务的9个数据集上进行了广泛实验。实验结果表明，我们的方法显著优于当前最先进的监督和自监督音视频方法，表明HiCMAE是一种强大的音视频情感表征学习器。代码和模型将在https://github.com/sunlicai/HiCMAE公开提供。