While fully-supervised models have been shown to be effective for audiovisual speech emotion recognition (SER), the limited availability of labeled data remains a major challenge in the field. To address this issue, self-supervised learning approaches, such as masked autoencoders (MAEs), have gained popularity as potential solutions. In this paper, we propose the VQ-MAE-AV model, a vector quantized MAE specifically designed for audiovisual speech self-supervised representation learning. Unlike existing multimodal MAEs that rely on the processing of the raw audiovisual speech data, the proposed method employs a self-supervised paradigm based on discrete audio and visual speech representations learned by two pre-trained vector quantized variational autoencoders. Experimental results show that the proposed approach, which is pre-trained on the VoxCeleb2 database and fine-tuned on standard emotional audiovisual speech datasets, outperforms the state-of-the-art audiovisual SER methods.
翻译:尽管全监督模型已被证明在音视频语音情感识别(SER)中有效,但标记数据有限仍是该领域面临的主要挑战。为解决这一问题,自监督学习方法(如掩码自编码器,MAE)作为潜在解决方案日益受到关注。本文提出VQ-MAE-AV模型,这是一种专为音视频语音自监督表示学习设计的向量量化MAE。与依赖原始音视频语音数据处理的现有多模态MAE不同,所提方法基于两个预训练向量量化变分自编码器学习到的离散音频和视觉语音表示,构建了自监督范式。实验结果表明,该方法在VoxCeleb2数据库上进行预训练,并在标准情感音视频语音数据集上微调后,性能超越了当前最先进的音视频SER方法。