Recently, the community has witnessed the advancement of Large Language Models (LLMs), which have shown remarkable performance on various downstream tasks. Led by powerful models like ChatGPT and Claude, LLMs are revolutionizing how users engage with software, assuming more than mere tools but intelligent assistants. Consequently, evaluating LLMs' anthropomorphic capabilities becomes increasingly important in contemporary discourse. Utilizing the emotion appraisal theory from psychology, we propose to evaluate the empathy ability of LLMs, i.e., how their feelings change when presented with specific situations. After a careful and comprehensive survey, we collect a dataset containing over 400 situations that have proven effective in eliciting the eight emotions central to our study. Categorizing the situations into 36 factors, we conduct a human evaluation involving more than 1,200 subjects worldwide. With the human evaluation results as references, our evaluation includes five LLMs, covering both commercial and open-source models, including variations in model sizes, featuring the latest iterations, such as GPT-4 and LLaMA 2. A conclusion can be drawn from the results that, despite several misalignments, LLMs can generally respond appropriately to certain situations. Nevertheless, they fall short in alignment with the emotional behaviors of human beings and cannot establish connections between similar situations. Our collected dataset of situations, the human evaluation results, and the code of our testing framework, dubbed EmotionBench, is made publicly in https://github.com/CUHK-ARISE/EmotionBench. We aspire to contribute to the advancement of LLMs regarding better alignment with the emotional behaviors of human beings, thereby enhancing their utility and applicability as intelligent assistants.
翻译:近期,社区见证了大型语言模型(LLMs)的快速发展,这些模型在各类下游任务中展现出卓越性能。在ChatGPT、Claude等强大模型的引领下,LLMs正重塑用户与软件的交互方式——它们已不仅是工具,更是智能助手。因此,评估LLMs的拟人化能力在当代讨论中愈发重要。我们基于心理学的情绪评价理论,提出评估LLMs的共情能力,即它们在面对特定情境时情感如何变化。经过细致全面的调研,我们收集了包含400多个情境的数据集,这些情境已被证实能有效激发本研究关注的八种核心情感。将情境分为36个因素后,我们开展了涉及全球1200多名被试的人工评估。以人类评估结果作为参照,我们对五款LLMs进行了评估,涵盖商业及开源模型,包括不同模型规模的变体,并纳入最新迭代版本(如GPT-4和LLaMA 2)。结果表明:尽管存在若干偏差,LLMs通常能对特定情境做出恰当反应。然而,它们在与人类情感行为的对齐上仍有不足,无法建立相似情境间的关联。我们收集的情境数据集、人工评估结果以及名为EmotionBench的测试框架代码均已在https://github.com/CUHK-ARISE/EmotionBench公开。我们期望通过此项工作推动LLMs更好地与人类情感行为对齐,从而提升其作为智能助手的实用性和适用性。