Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues

While state-of-the-art facial expression recognition (FER) classifiers achieve a high level of accuracy, they lack interpretability, an important aspect for end-users. To recognize basic facial expressions, experts resort to a codebook associating a set of spatial action units to a facial expression. In this paper, we follow the same expert footsteps, and propose a learning strategy that allows us to explicitly incorporate spatial action units (aus) cues into the classifier's training to build a deep interpretable model. In particular, using this aus codebook, input image expression label, and facial landmarks, a single action units heatmap is built to indicate the most discriminative regions of interest in the image w.r.t the facial expression. We leverage this valuable spatial cue to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with \aus map. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with aus maps, simulating the experts' decision process. This is achieved using only the image class expression as supervision and without any extra manual annotations. Moreover, our method is generic. It can be applied to any CNN- or transformer-based deep classifier without the need for architectural change or adding significant training time. Our extensive evaluation on two public benchmarks RAFDB, and AFFECTNET datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on Class-Activation Mapping methods (CAMs), and we show that our training technique improves the CAM interpretability.

翻译：尽管最先进的面部表情识别分类器达到了高准确率，但缺乏可解释性——这对终端用户而言是一个重要方面。为识别基本面部表情，专家们依赖一个将一组空间动作单元与面部表情关联的编码手册。本文遵循专家的相同方法，提出一种学习策略，通过明确地将空间动作单元线索融入分类器训练，构建深度可解释模型。具体而言，利用该动作单元编码手册、输入图像的表情标签及面部关键点，我们构建单一动作单元热力图，以指示图像中与面部表情相关的最具判别力的感兴趣区域。我们利用这一有价值的空间线索来训练深度可解释的面部表情识别分类器。这通过约束分类器的空间层特征与动作单元图的相关性实现。采用复合损失函数，分类器在正确分类图像的同时，能生成与动作单元图相关的可解释的逐层可视化注意力，模拟专家的决策过程。该方法仅使用图像类标签作为监督，无需额外手动标注。此外，我们的方法具有通用性，可应用于任何基于CNN或Transformer的深度分类器，无需改变架构或显著增加训练时间。在RAFDB和AFFECTNET两个公开基准数据集上的广泛评估表明，所提出的策略能在不降低分类性能的前提下提升逐层可解释性。同时，我们探究了依赖类激活映射方法的常见可解释分类器，并证明我们的训练技术能增强类激活映射的可解释性。