Facial action unit (AU) detection is a challenging task due to the scarcity of manual annotations. Recent works on AU detection with self-supervised learning have emerged to address this problem, aiming to learn meaningful AU representations from numerous unlabeled data. However, most existing AU detection works with self-supervised learning utilize global facial features only, while AU-related properties such as locality and relevance are not fully explored. In this paper, we propose a novel self-supervised framework for AU detection with the region and relation learning. In particular, AU related attention map is utilized to guide the model to focus more on AU-specific regions to enhance the integrity of AU local features. Meanwhile, an improved Optimal Transport (OT) algorithm is introduced to exploit the correlation characteristics among AUs. In addition, Swin Transformer is exploited to model the long-distance dependencies within each AU region during feature learning. The evaluation results on BP4D and DISFA demonstrate that our proposed method is comparable or even superior to the state-of-the-art self-supervised learning methods and supervised AU detection methods.
翻译:面部动作单元(AU)检测因人工标注稀缺而极具挑战性。近年来,基于自监督学习的AU检测方法应运而生,旨在从海量无标注数据中学习有意义的AU表征。然而,现有自监督AU检测方法大多仅利用全局面部特征,未能充分挖掘AU的局部性和相关性等特性。本文提出一种新颖的自监督AU检测框架,融合了区域与关系学习。具体而言,利用AU相关注意力图引导模型聚焦于AU特定区域,以增强AU局部特征的完整性;同时引入改进的最优传输(OT)算法,挖掘AU之间的关联特性。此外,采用Swin Transformer建模特征学习过程中各AU区域内部的长距离依赖关系。在BP4D和DISFA数据集上的评估结果表明,本方法在性能上可比肩甚至超越当前最先进的自监督学习方法与有监督AU检测方法。