The emergence of ConvNeXt and its variants has reaffirmed the conceptual and structural suitability of CNN-based models for vision tasks, re-establishing them as key players in image classification in general, and in facial expression recognition (FER) in particular. In this paper, we propose a new set of models that build on these advancements by incorporating a new set of attention mechanisms that combines Triplet attention with Squeeze-and-Excitation (TripSE) in four different variants. We demonstrate the effectiveness of these variants by applying them to the ResNet18, DenseNet and ConvNext architectures to validate their versatility and impact. Our study shows that incorporating a TripSE block in these CNN models boosts their performances, particularly for the ConvNeXt architecture, indicating its utility. We evaluate the proposed mechanisms and associated models across four datasets, namely CIFAR100, ImageNet, FER2013 and AffectNet datasets, where ConvNext with TripSE achieves state-of-the-art results with an accuracy of \textbf{78.27\%} on the popular FER2013 dataset, a new feat for this dataset.
翻译:ConvNeXt及其变体的出现再次证实了基于CNN的模型在视觉任务中的概念和结构适用性,使其重新成为图像分类(尤其是面部表情识别)领域的关键参与者。本文基于这些进展,提出了一系列新模型,通过将三重注意力与压缩激励机制相结合,构建了四种不同变体的TripSE注意力模块。我们通过将TripSE模块应用于ResNet18、DenseNet和ConvNeXt架构来验证其通用性和影响。研究表明,在CNN模型中引入TripSE模块能显著提升性能,尤其在ConvNeXt架构上效果最为突出。我们在CIFAR100、ImageNet、FER2013和AffectNet四个数据集上评估了所提出的机制及相关模型,其中集成TripSE的ConvNeXt在FER2013数据集上取得了\textbf{78.27\%}的准确率,刷新了该数据集的最佳性能记录。