Affective computing seeks to support the holistic development of artificial intelligence by enabling machines to engage with human emotion. Recent foundation models, particularly large language models (LLMs), have been trained and evaluated on emotion-related tasks, typically using supervised learning with discrete emotion labels. Such evaluations largely focus on surface phenomena, such as recognizing expressed or evoked emotions, leaving open whether these systems reason about emotion in cognitively meaningful ways. Here we ask whether LLMs can reason about emotions through underlying cognitive dimensions rather than labels alone. Drawing on cognitive appraisal theory, we introduce CoRE, a large-scale benchmark designed to probe the implicit cognitive structures LLMs use when interpreting emotionally charged situations. We assess alignment with human appraisal patterns, internal consistency, cross-model generalization, and robustness to contextual variation. We find that LLMs capture systematic relations between cognitive appraisals and emotions but show misalignment with human judgments and instability across contexts.
翻译:情感计算旨在通过使机器能够理解人类情感,来支持人工智能的全面发展。最近的基础模型,特别是大型语言模型(LLMs),已在情感相关任务上进行了训练和评估,通常采用带有离散情感标签的监督学习。此类评估主要关注表面现象,例如识别表达或诱发的情感,而未能揭示这些系统是否以具有认知意义的方式进行情感推理。本文探讨LLMs能否通过潜在的认知维度而非仅凭标签进行情感推理。借鉴认知评价理论,我们引入了CoRE——一个大规模基准测试,旨在探究LLMs在解释情感化情境时所使用的隐式认知结构。我们评估了其与人类评价模式的契合度、内部一致性、跨模型泛化能力以及对情境变化的鲁棒性。研究发现,LLMs能够捕捉认知评价与情感之间的系统关系,但在与人类判断的一致性方面存在偏差,且在不同情境下表现出不稳定性。