Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues

Although state-of-the-art classifiers for facial expression recognition (FER) can achieve a high level of accuracy, they lack interpretability, an important feature for end-users. Experts typically associate spatial action units (\aus) from a codebook to facial regions for the visual interpretation of expressions. In this paper, the same expert steps are followed. A new learning strategy is proposed to explicitly incorporate \au cues into classifier training, allowing to train deep interpretable models. During training, this \au codebook is used, along with the input image expression label, and facial landmarks, to construct a \au heatmap that indicates the most discriminative image regions of interest w.r.t the facial expression. This valuable spatial cue is leveraged to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with \au heatmaps. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with \au maps, simulating the expert decision process. Our strategy only relies on image class expression for supervision, without additional manual annotations. Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time. Our extensive evaluation on two public benchmarks \rafdb, and \affectnet datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on class activation mapping (CAM) methods, and show that our approach can also improve CAM interpretability.

翻译：尽管最先进的面部表情识别（FER）分类器能够实现高精度，但缺乏可解释性——这是终端用户的一项重要特征。专家通常利用编码本中的空间动作单元（\aus）来关联面部区域，以实现表情的视觉解释。本文遵循相同的专家流程，提出一种新的学习策略，将\au线索显式融入分类器训练中，从而训练深度可解释模型。在训练过程中，利用该\au编码本、输入图像的表情标签以及面部关键点，构建一个指示表情相关最具判别性感兴趣图像区域的\au热力图。这一有价值的空间线索被用于训练深度可解释的分类器进行FER。具体实现方式是，约束分类器的空间层特征与\au热力图相关。通过复合损失函数，训练分类器既能正确分类图像，又能生成与\au图相关、模拟专家决策过程的可解释逐层视觉注意力。我们的策略仅依赖图像类别标签作为监督信号，无需额外人工标注。该新策略具有通用性，可应用于任何基于深度CNN或Transformer的分类器，而无需修改架构或显著增加训练时间。在\rafdb和\affectnet两个公开基准数据集上的广泛评估表明，我们的策略能在不降低分类性能的前提下提升逐层可解释性。此外，我们探讨了依赖类激活映射（CAM）方法的常见可解释分类器，并证明我们的方法同样能提升CAM的可解释性。