A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition

This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.

翻译：本文针对第十届野外情感行为分析（ABAW）研讨会与竞赛中的表情（EXPR）识别任务展开研究，该任务要求对无约束视频中的八种面部情感表达进行帧级分类。由于人脸定位不准确、姿态与尺度变化大、运动模糊、时序不稳定以及相邻帧间其他干扰因素的存在，此项任务极具挑战性。为应对这些困难，我们提出了一种两阶段双模态（视听）模型。第一阶段侧重于利用基于DINOv2的预训练编码器提取鲁棒的视觉特征：具体采用DINOv2 ViT-L/14作为骨干网络，运用填充感知增强（PadAug）策略对原始视频进行图像填充与数据预处理，并引入专家混合（MoE）训练头以增强分类器多样性。第二阶段处理模态融合与时序一致性问题：在视觉模态方面，从原始视频中多尺度重裁剪人脸区域，对提取的视觉特征进行平均以构建鲁棒的帧级表征；同时，从短音频窗口提取帧对齐的Wav2Vec 2.0音频特征以提供互补的声学线索。这些双模态特征通过轻量级门控融合模块进行整合，并在推理阶段进行时序平滑处理。在ABAW数据集上的实验验证了所提方法的有效性：该两阶段模型在官方验证集上取得了0.5368的宏F1分数，在五折交叉验证下达到0.5122 +/- 0.0277，性能优于官方基线模型。