Dynamic facial expression recognition (DFER) infers emotions from the temporal evolution of expressions, unlike static facial expression recognition (SFER), which relies solely on a single snapshot. This temporal analysis provides richer information and promises greater recognition capability. However, current DFER methods often exhibit unsatisfied performance largely due to fewer training samples compared to SFER. Given the inherent correlation between static and dynamic expressions, we hypothesize that leveraging the abundant SFER data can enhance DFER. To this end, we propose Static-for-Dynamic (S4D), a unified dual-modal learning framework that integrates SFER data as a complementary resource for DFER. Specifically, S4D employs dual-modal self-supervised pre-training on facial images and videos using a shared Vision Transformer (ViT) encoder-decoder architecture, yielding improved spatiotemporal representations. The pre-trained encoder is then fine-tuned on static and dynamic expression datasets in a multi-task learning setup to facilitate emotional information interaction. Unfortunately, vanilla multi-task learning in our study results in negative transfer. To address this, we propose an innovative Mixture of Adapter Experts (MoAE) module that facilitates task-specific knowledge acquisition while effectively extracting shared knowledge from both static and dynamic expression data. Extensive experiments demonstrate that S4D achieves a deeper understanding of DFER, setting new state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65\%, 58.44\%, and 76.68\%, respectively. Additionally, a systematic correlation analysis between SFER and DFER tasks is presented, which further elucidates the potential benefits of leveraging SFER.
翻译:动态面部表情识别(DFER)通过分析表情的时间演化来推断情绪,与仅依赖单张快照的静态面部表情识别(SFER)不同。这种时序分析提供了更丰富的信息,并有望实现更强的识别能力。然而,当前DFER方法性能往往不尽如人意,主要原因是训练样本数量远少于SFER。鉴于静态与动态表情之间存在内在关联,我们假设利用丰富的SFER数据可以提升DFER性能。为此,我们提出静态为动态(S4D)——一个统一的双模态学习框架,将SFER数据整合为DFER的补充资源。具体而言,S4D采用共享的Vision Transformer(ViT)编码器-解码器架构,在面部图像和视频上进行双模态自监督预训练,从而获得改进的时空表征。预训练编码器随后在多任务学习设置下,于静态和动态表情数据集上进行微调,以促进情感信息交互。然而,本研究中的传统多任务学习导致了负迁移现象。为解决此问题,我们提出创新的适配器专家混合(MoAE)模块,该模块在有效提取静态与动态表情数据共享知识的同时,促进任务特定知识的获取。大量实验表明,S4D实现了对DFER的更深入理解,在FERV39K、MAFW和DFEW基准测试中创造了新的最优性能,加权平均召回率(WAR)分别达到53.65%、58.44%和76.68%。此外,本文系统分析了SFER与DFER任务之间的相关性,进一步阐明了利用SFER数据的潜在优势。