This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to advance FER. We release our code at https://github.com/lennarteingunia/vjepa-for-fer.
翻译:本文提出了一种新颖的视频联合嵌入预测架构(V-JEPA)在面部表情识别(FER)中的应用。与依赖像素级重建的传统视频理解预训练方法不同,V-JEPA通过学习从非遮蔽区域的嵌入预测遮蔽区域的嵌入进行训练。这使得训练后的编码器不会捕获视频中无关的信息,例如背景中像素区域的颜色。使用预训练的V-JEPA视频编码器,我们利用RAVDESS和CREMA-D数据集训练浅层分类器,在RAVDESS上取得了最先进的性能,并在CREMA-D上超越了所有其他基于视觉的方法(+1.48 WAR)。此外,跨数据集评估显示出强大的泛化能力,证明了纯基于嵌入的预训练方法在推进FER方面的潜力。我们在https://github.com/lennarteingunia/vjepa-for-fer发布了代码。