Representation learning and feature disentanglement have recently attracted much research interests in facial expression recognition. The ubiquitous ambiguity of emotion labels is detrimental to those methods based on conventional supervised representation learning. Meanwhile, directly learning the mapping from a facial expression image to an emotion label lacks explicit supervision signals of facial details. In this paper, we propose a novel FER model, called Poker Face Vision Transformer or PF-ViT, to separate and recognize the disturbance-agnostic emotion from a static facial image via generating its corresponding poker face without the need for paired images. Here, we regard an expressive face as the comprehensive result of a set of facial muscle movements on one's poker face (i.e., emotionless face), inspired by Facial Action Coding System. The proposed PF-ViT leverages vanilla Vision Transformers, and are firstly pre-trained as Masked Autoencoders on a large facial expression dataset without emotion labels, obtaining excellent representations. It mainly consists of five components: 1) an encoder mapping the facial expression to a complete representation, 2) a separator decomposing the representation into an emotional component and an orthogonal residue, 3) a generator that can reconstruct the expressive face and synthesize the poker face, 4) a discriminator distinguishing the fake face produced by the generator, trained adversarially with the encoder and generator, 5) a classification head recognizing the emotion. Quantitative and qualitative results demonstrate the effectiveness of our method, which trumps the state-of-the-art methods on four popular FER testing sets.
翻译:表示学习与特征解缠近期在面部表情识别领域引起了广泛研究兴趣。情绪标签普遍存在的模糊性不利于那些基于传统监督式表示学习的方法。同时,直接从面部表情图像映射到情绪标签的方式缺乏面部细节的显式监督信号。本文提出一种新型面部表情识别(FER)模型,称为"扑克脸视觉 Transformer"(Poker Face Vision Transformer,简称 PF-ViT),该模型无需配对图像,通过生成对应的扑克脸,从静态面部图像中分离并识别与扰动无关的情绪。受面部动作编码系统启发,本文将富有表情的面部视为一组面部肌肉运动在个体扑克脸(即无表情脸)上的综合结果。所提出的 PF-ViT 采用经典视觉 Transformer 架构,首先在大规模无情绪标签的面部表情数据集上作为掩码自编码器进行预训练,获得优异表示。模型主要由五个组件构成:1)编码器:将面部表情映射为完整表示;2)分离器:将表示分解为情绪分量与正交残差;3)生成器:可重建表情脸并合成扑克脸;4)判别器:区分生成器产生的伪造面部图像,与编码器和生成器进行对抗训练;5)分类头:识别情绪。定量与定性结果验证了本方法的有效性,在四个主流 FER 测试集上均优于现有最优方法。