Why Do Vision Language Models Struggle To Recognize Human Emotions?

Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VLMs struggle to recognize human emotions?", and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from "in-between" frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.

翻译：情感理解是智能系统与人类交互的基本能力。近几年来，视觉语言模型（VLM）在众多视觉任务中取得了巨大进展，为情感理解提供了有前景的解决方案。然而，令人惊讶的是，即使是当下最先进的VLM也难以识别人类情感，甚至其表现还不如专门用于视觉情感分类的分类器。本文提出“为何VLM难以识别人类情感？”这一问题，并观察到面部表情识别（DFER）这一本质连续且动态的任务暴露了VLM的两个关键缺陷。首先，情感数据集天然呈长尾分布，而用于预训练VLM的大规模网络数据加剧了这种头部类别偏差，导致模型系统性将稀有且未被充分表征的情感坍缩至常见类别。我们提出了替代采样策略以防止对常见概念的倾向性。其次，时间信息对于理解情感至关重要。然而，VLM无法对密集帧序列中的时间信息进行表征，因为其受限于上下文长度及内存可容纳的令牌数量，这对情感识别构成了明显挑战。我们证明，VLM中使用的稀疏时间采样策略本质上与微表情（0.25-0.5秒）的短暂特性不一致，而微表情往往是最关键的情感信号。作为诊断性探测手段，我们提出了一种多阶段上下文增强策略，通过首先将“中间帧”信息转化为自然语言摘要加以利用。该增强后的文本上下文与稀疏关键帧一同作为VLM的输入，在保留情感轨迹的同时，避免了因过多视觉数据导致的注意力稀释。