Recently, GPT-4 with Vision (GPT-4V) has demonstrated remarkable visual capabilities across various tasks, but its performance in emotion recognition has not been fully evaluated. To bridge this gap, we present the quantitative evaluation results of GPT-4V on 19 benchmark datasets covering 5 tasks: visual sentiment analysis, micro-expression recognition, facial emotion recognition, dynamic facial emotion recognition, and multimodal emotion recognition. This paper collectively refers to these tasks as ``Generalized Emotion Recognition (GER)''. Through experimental analysis, we observe that GPT-4V generally outperforms supervised systems in visual sentiment analysis, highlighting its powerful visual understanding capabilities. Meanwhile, GPT-4V shows the ability to integrate multimodal clues and exploit temporal information, which is also critical for emotion recognition. Despite these achievements, GPT-4V is primarily tailored for general-purpose domains, which cannot recognize micro-expressions that require specialized knowledge. To the best of our knowledge, this paper provides the first quantitative assessment of GPT-4V for the GER tasks, offering valuable insights to researchers in this field. It can also serve as a zero-shot benchmark for subsequent research. Our code and evaluation results are available at: https://github.com/zeroQiaoba/gpt4v-emotion.
翻译:近期,具有视觉能力的GPT-4(GPT-4V)在各类视觉任务中展现出卓越性能,但其在情感识别领域的表现尚未被充分评估。为弥补这一空白,本文在涵盖五项任务的19个基准数据集上呈现了GPT-4V的量化评估结果:视觉情感分析、微表情识别、面部表情识别、动态面部表情识别及多模态情感识别。本文将上述任务统称为"广义情感识别(Generalized Emotion Recognition, GER)"。通过实验分析发现,GPT-4V在视觉情感分析任务中整体优于监督学习系统,彰显其强大的视觉理解能力。同时,GPT-4V展现出整合多模态线索与利用时序信息的能力,这对情感识别至关重要。尽管取得上述成果,GPT-4V主要面向通用领域设计,无法识别需要专业知识的微表情。据我们所知,本文首次为GER任务提供了GPT-4V的量化评估,为该领域研究人员提供了重要参考,同时可作为后续研究的零样本基准。我们的代码与评估结果见:https://github.com/zeroQiaoba/gpt4v-emotion。