Large Language Models (LLMs) have demonstrated remarkable abilities across numerous disciplines, primarily assessed through tasks in language generation, knowledge utilization, and complex reasoning. However, their alignment with human emotions and values, which is critical for real-world applications, has not been systematically evaluated. Here, we assessed LLMs' Emotional Intelligence (EI), encompassing emotion recognition, interpretation, and understanding, which is necessary for effective communication and social interactions. Specifically, we first developed a novel psychometric assessment focusing on Emotion Understanding (EU), a core component of EI, suitable for both humans and LLMs. This test requires evaluating complex emotions (e.g., surprised, joyful, puzzled, proud) in realistic scenarios (e.g., despite feeling underperformed, John surprisingly achieved a top score). With a reference frame constructed from over 500 adults, we tested a variety of mainstream LLMs. Most achieved above-average EQ scores, with GPT-4 exceeding 89% of human participants with an EQ of 117. Interestingly, a multivariate pattern analysis revealed that some LLMs apparently did not reply on the human-like mechanism to achieve human-level performance, as their representational patterns were qualitatively distinct from humans. In addition, we discussed the impact of factors such as model size, training method, and architecture on LLMs' EQ. In summary, our study presents one of the first psychometric evaluations of the human-like characteristics of LLMs, which may shed light on the future development of LLMs aiming for both high intellectual and emotional intelligence. Project website: https://emotional-intelligence.github.io/
翻译:大型语言模型(LLMs)在众多学科领域展现出卓越能力,主要通过语言生成、知识运用和复杂推理等任务进行评估。然而,它们与人类情感和价值观的一致性——这对现实应用至关重要——尚未得到系统评估。本研究评估了LLMs的情感智能(EI),涵盖情感识别、解读和理解,这是有效沟通和社交互动的必要条件。具体而言,我们首先开发了一种新颖的心理测量评估方法,聚焦于情感理解(EU)——这一EI的核心组成部分,适用于人类和LLMs。该测试要求评估真实场景中的复杂情感(例如,尽管感到表现不佳,约翰却出人意料地获得了高分)。基于500多名成年人构建的参考框架,我们测试了多种主流LLMs。大多数LLMs获得了高于平均水平的EQ分数,其中GPT-4的EQ为117,超过了89%的人类参与者。有趣的是,多变量模式分析揭示,部分LLMs似乎并未依赖类似人类的机制来实现人类水平的性能,因为其表征模式在性质上与人类存在显著差异。此外,我们讨论了模型规模、训练方法和架构等因素对LLMs情商(EQ)的影响。总之,本研究首次对LLMs类人特征进行心理测量评估,这或将为未来追求高智力与高情感智能的LLMs开发提供启示。项目网站:https://emotional-intelligence.github.io/