Recently, the community has witnessed the advancement of Large Language Models (LLMs), which have shown remarkable performance on various downstream tasks. Led by powerful models like ChatGPT and Claude, LLMs are revolutionizing how users engage with software, assuming more than mere tools but intelligent assistants. Consequently, evaluating LLMs' anthropomorphic capabilities becomes increasingly important in contemporary discourse. Utilizing the emotion appraisal theory from psychology, we propose to evaluate the empathy ability of LLMs, i.e., how their feelings change when presented with specific situations. After a careful and comprehensive survey, we collect a dataset containing over 400 situations that have proven effective in eliciting the eight emotions central to our study. Categorizing the situations into 36 factors, we conduct a human evaluation involving more than 1,200 subjects worldwide. With the human evaluation results as references, our evaluation includes five LLMs, covering both commercial and open-source models, including variations in model sizes, featuring the latest iterations, such as GPT-4 and LLaMA 2. A conclusion can be drawn from the results that, despite several misalignments, LLMs can generally respond appropriately to certain situations. Nevertheless, they fall short in alignment with the emotional behaviors of human beings and cannot establish connections between similar situations. Our collected dataset of situations, the human evaluation results, and the code of our testing framework, dubbed EmotionBench, is made publicly in https://github.com/CUHK-ARISE/EmotionBench. We aspire to contribute to the advancement of LLMs regarding better alignment with the emotional behaviors of human beings, thereby enhancing their utility and applicability as intelligent assistants.
翻译:近期,社区见证了大型语言模型(LLMs)的进步,它们在各类下游任务中展现出卓越性能。以ChatGPT和Claude等强大模型为代表,LLMs正革新用户与软件的交互方式,其角色已超越单纯工具,演变为智能助手。因此,评估LLMs的人类化能力在当代讨论中愈发重要。借助心理学中的情感评价理论,我们提出评估LLMs的共情能力,即当面对特定情境时,其情感状态如何变化。经过细致全面的调研,我们收集了一个包含400多个情境的数据集,这些情境已被证明能有效引发本研究关注的八种核心情感。将情境归纳为36个因素后,我们开展了涉及全球超过1200名受试者的人类评估。以人类评估结果为参照,我们的评估覆盖了五个LLMs,包括商业和开源模型,涵盖了不同模型规模及最新迭代版本,如GPT-4和LLaMA 2。从结果可得出结论:尽管存在若干偏差,LLMs通常能对特定情境做出适当回应。然而,它们在符合人类情感行为方面仍有不足,且无法建立相似情境间的关联。我们收集的情境数据集、人类评估结果以及测试框架(名为EmotionBench)的代码已在https://github.com/CUHK-ARISE/EmotionBench公开。我们期望通过此工作促进LLMs更好地对齐人类情感行为,从而提升其作为智能助手的实用性与适用性。