Facial expression recognition (FER) is a challenging topic in artificial intelligence. Recently, many researchers have attempted to introduce Vision Transformer (ViT) to the FER task. However, ViT cannot fully utilize emotional features extracted from raw images and requires a lot of computing resources. To overcome these problems, we propose a quaternion orthogonal transformer (QOT) for FER. Firstly, to reduce redundancy among features extracted from pre-trained ResNet-50, we use the orthogonal loss to decompose and compact these features into three sets of orthogonal sub-features. Secondly, three orthogonal sub-features are integrated into a quaternion matrix, which maintains the correlations between different orthogonal components. Finally, we develop a quaternion vision transformer (Q-ViT) for feature classification. The Q-ViT adopts quaternion operations instead of the original operations in ViT, which improves the final accuracies with fewer parameters. Experimental results on three in-the-wild FER datasets show that the proposed QOT outperforms several state-of-the-art models and reduces the computations.
翻译:面部表情识别(FER)是人工智能领域中的一个具有挑战性的课题。近年来,许多研究者尝试将视觉Transformer(ViT)引入FER任务中。然而,ViT无法充分利用从原始图像中提取的情感特征,且需要大量计算资源。为克服这些问题,我们提出了一种用于FER的四元数正交变换器(QOT)。首先,为减少从预训练ResNet-50中提取的特征之间的冗余,我们利用正交损失将这些特征分解并压缩为三组正交子特征。其次,将三个正交子特征整合为一个四元数矩阵,以保持不同正交分量之间的关联性。最后,我们开发了一个四元数视觉Transformer(Q-ViT)用于特征分类。Q-ViT采用四元数运算替代ViT中的原始运算,从而以更少的参数提升最终精度。在三个野外FER数据集上的实验结果表明,所提出的QOT优于多个当前最优模型,并降低了计算量。