Facial expression recognition (FER) remains a challenging task due to the ambiguity of expressions. The derived noisy labels significantly harm the performance in real-world scenarios. To address this issue, we present a new FER model named Landmark-Aware Net~(LA-Net), which leverages facial landmarks to mitigate the impact of label noise from two perspectives. Firstly, LA-Net uses landmark information to suppress the uncertainty in expression space and constructs the label distribution of each sample by neighborhood aggregation, which in turn improves the quality of training supervision. Secondly, the model incorporates landmark information into expression representations using the devised expression-landmark contrastive loss. The enhanced expression feature extractor can be less susceptible to label noise. Our method can be integrated with any deep neural network for better training supervision without introducing extra inference costs. We conduct extensive experiments on both in-the-wild datasets and synthetic noisy datasets and demonstrate that LA-Net achieves state-of-the-art performance.
翻译:面部表情识别(FER)因表情的模糊性仍是一项具有挑战性的任务。由此产生的带有噪声的标签在真实场景中严重损害了模型性能。为解决该问题,我们提出了一种名为地标感知网络(Landmark-Aware Net, LA-Net)的新型FER模型,该模型利用面部地标从两个角度减轻标签噪声的影响。首先,LA-Net利用地标信息抑制表情空间中的不确定性,并通过邻域聚合构建每个样本的标签分布,从而提升训练监督的质量。其次,模型通过设计的表情-地标对比损失将地标信息整合到表情表示中。增强后的表情特征提取器能更有效地抵御标签噪声的影响。我们的方法可集成至任意深度神经网络中以实现更优的训练监督,且无需引入额外推理成本。在真实场景数据集和合成噪声数据集上的大量实验表明,LA-Net达到了最先进的性能。