Facial Expression Recognition (FER) is a crucial task in affective computing, but its conventional focus on the seven basic emotions limits its applicability to the complex and expanding emotional spectrum. To address the issue of new and unseen emotions present in dynamic in-the-wild FER, we propose a novel vision-language model that utilises sample-level text descriptions (i.e. captions of the context, expressions or emotional cues) as natural language supervision, aiming to enhance the learning of rich latent representations, for zero-shot classification. To test this, we evaluate using zero-shot classification of the model trained on sample-level descriptions on four popular dynamic FER datasets. Our findings show that this approach yields significant improvements when compared to baseline methods. Specifically, for zero-shot video FER, we outperform CLIP by over 10\% in terms of Weighted Average Recall and 5\% in terms of Unweighted Average Recall on several datasets. Furthermore, we evaluate the representations obtained from the network trained using sample-level descriptions on the downstream task of mental health symptom estimation, achieving performance comparable or superior to state-of-the-art methods and strong agreement with human experts. Namely, we achieve a Pearson's Correlation Coefficient of up to 0.85 on schizophrenia symptom severity estimation, which is comparable to human experts' agreement. The code is publicly available at: https://github.com/NickyFot/EmoCLIP.
翻译:面部表情识别(FER)是情感计算中的一项关键任务,但其传统方法局限于七种基本情感,难以适应复杂且不断扩展的情感谱系。为解决动态野外场景中新型及未见情感的识别问题,我们提出一种新颖的视觉-语言模型,该模型利用样本级文本描述(即关于情境、表情或情感线索的说明文本)作为自然语言监督信号,旨在增强丰富潜在表征的学习能力,以实现零样本分类。为验证该方法,我们在四个主流动态FER数据集上对基于样本级描述训练的模型进行了零样本分类评估。结果表明,与基线方法相比,该方法取得了显著提升。具体而言,在零样本视频FER任务中,我们在多个数据集上的加权平均召回率指标超越了CLIP方法超过10%,非加权平均召回率指标提升5%。此外,我们将基于样本级描述训练的网络所获得的表征用于下游任务——心理健康症状估计,其性能达到或超越现有最先进方法,且与人类专家的评估结果高度一致。例如,在精神分裂症症状严重程度估计任务中,我们达到了高达0.85的皮尔逊相关系数,与人类专家的评估一致性水平相当。代码已公开于:https://github.com/NickyFot/EmoCLIP。