In-the-wild Dynamic facial expression recognition (DFER) encounters a significant challenge in recognizing emotion-related expressions, which are often temporally and spatially diluted by emotion-irrelevant expressions and global context respectively. Most of the prior DFER methods model tightly coupled spatiotemporal representations which may incorporate weakly relevant features, leading to information redundancy and emotion-irrelevant context bias. Several DFER methods have highlighted the significance of dynamic information, but utilize explicit manners to extract dynamic features with overly strong prior knowledge. In this paper, we propose a novel Implicit Facial Dynamics Disentanglement framework (IFDD). Through expanding wavelet lifting scheme to fully learnable framework, IFDD disentangles emotion-related dynamic information from emotion-irrelevant global context in an implicit manner, i.e., without exploit operations and external guidance. The disentanglement process of IFDD contains two stages, i.e., Inter-frame Static-dynamic Splitting Module (ISSM) for rough disentanglement estimation and Lifting-based Aggregation-Disentanglement Module (LADM) for further refinement. Specifically, ISSM explores inter-frame correlation to generate content-aware splitting indexes on-the-fly. We preliminarily utilize these indexes to split frame features into two groups, one with greater global similarity, and the other with more unique dynamic features. Subsequently, LADM first aggregates these two groups of features to obtain fine-grained global context features by an updater, and then disentangles emotion-related facial dynamic features from the global context by a predictor. Extensive experiments on in-the-wild datasets have demonstrated that IFDD outperforms prior supervised DFER methods with higher recognition accuracy and comparable efficiency.
翻译:野外动态面部表情识别面临一个关键挑战:情感相关表情往往在时间维度被情感无关表情稀释,在空间维度被全局上下文稀释。现有动态面部表情识别方法大多建模紧耦合的时空表征,可能包含弱相关特征,导致信息冗余和情感无关上下文偏差。部分方法虽强调动态信息重要性,但采用显式方式提取动态特征,依赖过强先验知识。本文提出新型隐式面部动态解耦框架,通过将小波提升方案扩展为全可学习框架,以隐式方式(即无需显式操作和外部引导)从情感无关全局上下文中解耦情感相关动态信息。该解耦过程包含两个阶段:用于粗粒度解耦估计的帧间静态-动态分离模块,以及用于精细优化的基于提升的聚合-解耦模块。具体而言,帧间静态-动态分离模块通过探索帧间相关性动态生成内容感知的分离索引,初步将帧特征划分为两组——一组具有更高全局相似性,另一组包含更多独特动态特征。随后,基于提升的聚合-解耦模块先通过更新器聚合两组特征获得细粒度全局上下文特征,再通过预测器从全局上下文中解耦情感相关面部动态特征。在野外数据集上的大量实验表明,该框架以更高识别精度和可比效率优于现有监督式动态面部表情识别方法。