Facial action unit (AU) detection, aiming to classify AU present in the facial image, has long suffered from insufficient AU annotations. In this paper, we aim to mitigate this data scarcity issue by learning AU representations from a large number of unlabelled facial videos in a contrastive learning paradigm. We formulate the self-supervised AU representation learning signals in two-fold: (1) AU representation should be frame-wisely discriminative within a short video clip; (2) Facial frames sampled from different identities but show analogous facial AUs should have consistent AU representations. As to achieve these goals, we propose to contrastively learn the AU representation within a video clip and devise a cross-identity reconstruction mechanism to learn the person-independent representations. Specially, we adopt a margin-based temporal contrastive learning paradigm to perceive the temporal AU coherence and evolution characteristics within a clip that consists of consecutive input facial frames. Moreover, the cross-identity reconstruction mechanism facilitates pushing the faces from different identities but show analogous AUs close in the latent embedding space. Experimental results on three public AU datasets demonstrate that the learned AU representation is discriminative for AU detection. Our method outperforms other contrastive learning methods and significantly closes the performance gap between the self-supervised and supervised AU detection approaches.
翻译:面部动作单元(AU)检测旨在识别面部图像中存在的AU,长期以来一直受限于AU标注不足。本文旨在通过对比学习范式,从大量无标签面部视频中学习AU表示,缓解这一数据稀缺问题。我们将自监督AU表示学习信号构建为两个方面:(1)AU表示应在短时间视频片段内具有帧级判别性;(2)具有相似面部AU的不同身份的面部帧应保持一致的AU表示。为实现这些目标,我们提出在视频片段内对比学习AU表示,并设计跨身份重建机制以学习个体无关表示。具体而言,我们采用基于边界的时序对比学习范式来感知连续输入面部帧组成的片段中AU的时序一致性与演化特征。此外,跨身份重建机制有助于将不同身份但具有相似AU的面部结构在潜在嵌入空间中推近。在三个公开AU数据集上的实验结果表明,所学AU表示对AU检测具有判别性。我们的方法优于其他对比学习方法,并显著缩小了自监督与监督AU检测方法之间的性能差距。