Large Language Models (LLMs) have demonstrated remarkable abilities across numerous disciplines, primarily assessed through tasks in language generation, knowledge utilization, and complex reasoning. However, their alignment with human emotions and values, which is critical for real-world applications, has not been systematically evaluated. Here, we assessed LLMs' Emotional Intelligence (EI), encompassing emotion recognition, interpretation, and understanding, which is necessary for effective communication and social interactions. Specifically, we first developed a novel psychometric assessment focusing on Emotion Understanding (EU), a core component of EI, suitable for both humans and LLMs. This test requires evaluating complex emotions (e.g., surprised, joyful, puzzled, proud) in realistic scenarios (e.g., despite feeling underperformed, John surprisingly achieved a top score). With a reference frame constructed from over 500 adults, we tested a variety of mainstream LLMs. Most achieved above-average EQ scores, with GPT-4 exceeding 89% of human participants with an EQ of 117. Interestingly, a multivariate pattern analysis revealed that some LLMs apparently did not reply on the human-like mechanism to achieve human-level performance, as their representational patterns were qualitatively distinct from humans. In addition, we discussed the impact of factors such as model size, training method, and architecture on LLMs' EQ. In summary, our study presents one of the first psychometric evaluations of the human-like characteristics of LLMs, which may shed light on the future development of LLMs aiming for both high intellectual and emotional intelligence. Project website: https://emotional-intelligence.github.io/
翻译:大语言模型(LLMs)已在众多领域展现出卓越能力,主要通过语言生成、知识利用和复杂推理等任务进行评估。然而,它们与人类情感及价值观的契合度——这一对实际应用至关重要的特性——尚未得到系统性评估。在此,我们评估了LLMs的情感智能(EI),涵盖情感识别、解读与理解能力,这些能力对于有效沟通和社交互动至关重要。具体而言,我们首先开发了一套新型心理测量评估方法,聚焦于情感理解(EU)——这一EI的核心组成部分,并适用于人类与LLMs。该测试要求评估真实场景中的复杂情感(例如,尽管感觉表现不佳,约翰却意外取得了最高分)。基于500余名成年人构建的参考框架,我们对多种主流LLMs进行了测试。多数模型获得了高于平均水平的EQ分数,其中GPT-4的EQ达117分,超过89%的人类参与者。有趣的是,多变量模式分析揭示,部分LLMs似乎并非依赖类人机制达到人类水平的表现,因其表征模式与人类存在质性差异。此外,我们探讨了模型规模、训练方法及架构等因素对LLMs EQ的影响。综上,本研究首次对LLMs的类人特性进行了心理测量评估,或将为未来兼具高智力与高情感智能的LLMs发展提供启示。项目网站:https://emotional-intelligence.github.io/