Recently, GPT-4 with Vision (GPT-4V) has shown remarkable performance across various multimodal tasks. However, its efficacy in emotion recognition remains a question. This paper quantitatively evaluates GPT-4V's capabilities in multimodal emotion understanding, encompassing tasks such as facial emotion recognition, visual sentiment analysis, micro-expression recognition, dynamic facial emotion recognition, and multimodal emotion recognition. Our experiments show that GPT-4V exhibits impressive multimodal and temporal understanding capabilities, even surpassing supervised systems in some tasks. Despite these achievements, GPT-4V is currently tailored for general domains. It performs poorly in micro-expression recognition that requires specialized expertise. The main purpose of this paper is to present quantitative results of GPT-4V on emotion understanding and establish a zero-shot benchmark for future research. Code and evaluation results are available at: https://github.com/zeroQiaoba/gpt4v-emotion.
翻译:近期,具备视觉能力的GPT-4(GPT-4V)在多种多模态任务中展现出卓越性能。然而,其在情感识别领域的有效性仍存疑问。本文定量评估了GPT-4V在多模态情感理解方面的能力,涵盖面部表情识别、视觉情感分析、微表情识别、动态面部表情识别及多模态情感识别等任务。实验表明,GPT-4V展现出令人印象深刻的多模态与时序理解能力,甚至在某些任务中超越有监督系统。尽管如此,GPT-4V目前仍面向通用领域,在需要专业知识的微表情识别任务中表现不佳。本文的主要目的是呈现GPT-4V在情感理解方面的定量结果,并为未来研究建立零样本基准。代码与评估结果可访问:https://github.com/zeroQiaoba/gpt4v-emotion。