Facial Expression Recognition (FER) is a crucial task in affective computing, but its conventional focus on the seven basic emotions limits its applicability to the complex and expanding emotional spectrum. To address the issue of new and unseen emotions present in dynamic in-the-wild FER, we propose a novel vision-language model that utilises sample-level text descriptions (i.e. captions of the context, expressions or emotional cues) as natural language supervision, aiming to enhance the learning of rich latent representations, for zero-shot classification. To test this, we evaluate using zero-shot classification of the model trained on sample-level descriptions on four popular dynamic FER datasets. Our findings show that this approach yields significant improvements when compared to baseline methods. Specifically, for zero-shot video FER, we outperform CLIP by over 10\% in terms of Weighted Average Recall and 5\% in terms of Unweighted Average Recall on several datasets. Furthermore, we evaluate the representations obtained from the network trained using sample-level descriptions on the downstream task of mental health symptom estimation, achieving performance comparable or superior to state-of-the-art methods and strong agreement with human experts. Namely, we achieve a Pearson's Correlation Coefficient of up to 0.85 on schizophrenia symptom severity estimation, which is comparable to human experts' agreement. The code is publicly available at: https://github.com/NickyFot/EmoCLIP.
翻译:面部表情识别(FER)是情感计算中的关键任务,但其传统上聚焦于七种基本情绪,限制了其在复杂且不断扩展的情感频谱中的适用性。为解决动态野外FER中新颖且未见情绪的分类问题,我们提出了一种新型视觉-语言模型,该模型利用样本级文本描述(即上下文、表情或情感线索的文本说明)作为自然语言监督,旨在增强富含潜在表征的学习能力,以实现零样本分类。为验证此方法,我们对基于样本级描述训练的模型在四个主流动态FER数据集上进行零样本分类评估。实验结果表明,该方法相较于基线方法取得了显著提升。具体而言,在多个数据集的零样本视频FER任务中,我们的模型在加权平均召回率上超过CLIP模型10%以上,在未加权平均召回率上超过5%。此外,我们将基于样本级描述训练的网络所得表征应用于下游心理健康症状估计任务,其性能可比拟或超越现有最先进方法,并与人类专家评估结果高度一致。例如,在精神分裂症症状严重程度评估上,我们取得了高达0.85的皮尔逊相关系数,与人类专家的一致性相当。代码已开源至:https://github.com/NickyFot/EmoCLIP。