Understanding emotions and expressions is a task of interest across multiple disciplines, especially for improving user experiences. Contrary to the common perception, it has been shown that emotions are not discrete entities but instead exist along a continuum. People understand discrete emotions differently due to a variety of factors, including cultural background, individual experiences, and cognitive biases. Therefore, most approaches to expression understanding, particularly those relying on discrete categories, are inherently biased. In this paper, we present a comparative in-depth analysis of two common datasets (AffectNet and EMOTIC) equipped with the components of the circumplex model of affect. Further, we propose a model for the prediction of facial expressions tailored for lightweight applications. Using a small-scaled MaxViT-based model architecture, we evaluate the impact of discrete expression category labels in training with the continuous valence and arousal labels. We show that considering valence and arousal in addition to discrete category labels helps to significantly improve expression inference. The proposed model outperforms the current state-of-the-art models on AffectNet, establishing it as the best-performing model for inferring valence and arousal achieving a 7% lower RMSE. Training scripts and trained weights to reproduce our results can be found here: https://github.com/wagner-niklas/CAGE_expression_inference.
翻译:理解情感与表情是多个学科中具有跨领域意义的研究课题,尤其对于提升用户体验至关重要。与普遍认知不同,研究表明情感并非离散实体,而是存在于连续谱系中。由于文化背景、个人经历和认知偏差等多种因素,人们对离散情感的理解存在差异。因此,大多数表情理解方法——尤其依赖离散类别的方法——本质上存在偏差。本文对配备环状情感模型组件的两个常见数据集(AffectNet和EMOTIC)进行了深入比较分析。在此基础上,我们提出了一种面向轻量级应用场景的面部表情预测模型。采用基于MaxViT的小规模模型架构,我们评估了离散表情类别标签在连续效价与唤醒度标签训练中的影响。实验表明,在离散类别标签之外融入效价与唤醒度信息,能显著提升表情推断性能。所提模型在AffectNet数据集上超越当前最先进模型,成为效价与唤醒度推断的最佳模型,均方根误差(RMSE)降低7%。用于重现结果的训练脚本与权重文件见:https://github.com/wagner-niklas/CAGE_expression_inference。