Facial expression recognition (FER) is an important task in computer vision, having practical applications in areas such as human-computer interaction, education, healthcare, and online monitoring. In this challenging FER task, there are three key issues especially prevalent: inter-class similarity, intra-class discrepancy, and scale sensitivity. While existing works typically address some of these issues, none have fully addressed all three challenges in a unified framework. In this paper, we propose a two-stream Pyramid crOss-fuSion TransformER network (POSTER), that aims to holistically solve all three issues. Specifically, we design a transformer-based cross-fusion method that enables effective collaboration of facial landmark features and image features to maximize proper attention to salient facial regions. Furthermore, POSTER employs a pyramid structure to promote scale invariance. Extensive experimental results demonstrate that our POSTER achieves new state-of-the-art results on RAF-DB (92.05%), FERPlus (91.62%), as well as AffectNet 7 class (67.31%) and 8 class (63.34%). The code is available at https://github.com/zczcwh/POSTER.
翻译:面部表情识别(FER)是计算机视觉中的一项重要任务,在人机交互、教育、医疗健康和在线监控等领域具有实际应用价值。在这个具有挑战性的FER任务中,有三个关键问题尤为突出:类间相似性、类内差异性以及尺度敏感性。现有工作通常解决其中部分问题,但尚未有工作能在统一框架中完全应对所有三个挑战。本文提出了一种双流金字塔交叉融合Transformer网络(POSTER),旨在全面解决这三个问题。具体而言,我们设计了一种基于Transformer的交叉融合方法,能够促进面部关键点特征与图像特征的有效协作,从而最大化对显著面部区域的恰当关注。此外,POSTER采用金字塔结构来提升尺度不变性。大量实验结果表明,我们的POSTER在RAF-DB(92.05%)、FERPlus(91.62%)、AffectNet 7类(67.31%)和8类(63.34%)数据集上均取得了最新最优结果。代码开源地址为:https://github.com/zczcwh/POSTER。