Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.
翻译:视觉语言模型(VLMs)在感知任务中展现出强大能力,然而,将感知、推理与生成融为一体的情感图像内容分析(AICA)领域仍鲜有探索。为弥补这一空白,我们提出AICA-Bench——一项涵盖三大核心任务的综合性基准:情感理解(Eu)、情感推理(ER)与情感引导内容生成(EGCG)。通过对23个VLM模型的评估,我们发现了两个主要局限:强度校准能力薄弱及开放式描述过于浅显。针对这些问题,我们提出无训练框架"基于锚点的情感树提示(GAT Prompting)",该方法将视觉支架与层级推理相结合。实验表明,GAT可有效降低强度误差并提升描述深度,为情感多模态理解与生成领域的未来研究提供了强基线。