Facial expression recognition (FER) has received considerable attention in computer vision, with "in-the-wild" environments such as human-computer interaction. However, FER images contain uncertainties such as occlusion, low resolution, pose variation, illumination variation, and subjectivity, which includes some expressions that do not match the target label. Consequently, little information is obtained from a noisy single image and it is not trusted. This could significantly degrade the performance of the FER task. To address this issue, we propose a batch transformer (BT), which consists of the proposed class batch attention (CBA) module, to prevent overfitting in noisy data and extract trustworthy information by training on features reflected from several images in a batch, rather than information from a single image. We also propose multi-level attention (MLA) to prevent overfitting the specific features by capturing correlations between each level. In this paper, we present a batch transformer network (BTN) that combines the above proposals. Experimental results on various FER benchmark datasets show that the proposed BTN consistently outperforms the state-ofthe-art in FER datasets. Representative results demonstrate the promise of the proposed BTN for FER.
翻译:面部表情识别(FER)在计算机视觉领域受到了广泛关注,特别是在人机交互等"非受控"环境中。然而,FER图像包含诸多不确定性因素,如遮挡、低分辨率、姿态变化、光照变化以及主观性(包括部分表情与目标标签不匹配的情况)。因此,从单张噪声图像中获得的信息有限且不可靠,这会显著降低FER任务的性能。为解决这一问题,我们提出了一种批量Transformer(BT),其包含所提出的类别批量注意力(CBA)模块,通过训练批次中多张图像反映的特征(而非单张图像的信息)来防止噪声数据过拟合并提取可信信息。我们还提出了多级注意力(MLA)机制,通过捕捉各级特征间的相关性来防止模型过拟合特定特征。本文提出了一种结合上述方案的批量Transformer网络(BTN)。在多个FER基准数据集上的实验结果表明,所提出的BTN在FER数据集上持续优于现有最优方法。代表性结果验证了所提出的BTN在FER任务中的潜力。