With the emergence of large language models (LLMs), investigating if they can surpass humans in areas such as emotion recognition and empathetic responding has become a focal point of research. This paper presents a comprehensive study exploring the empathetic responding capabilities of four state-of-the-art LLMs: GPT-4, LLaMA-2-70B-Chat, Gemini-1.0-Pro, and Mixtral-8x7B-Instruct in comparison to a human baseline. We engaged 1,000 participants in a between-subjects user study, assessing the empathetic quality of responses generated by humans and the four LLMs to 2,000 emotional dialogue prompts meticulously selected to cover a broad spectrum of 32 distinct positive and negative emotions. Our findings reveal a statistically significant superiority of the empathetic responding capability of LLMs over humans. GPT-4 emerged as the most empathetic, marking approximately 31% increase in responses rated as "Good" compared to the human benchmark. It was followed by LLaMA-2, Mixtral-8x7B, and Gemini-Pro, which showed increases of approximately 24%, 21%, and 10% in "Good" ratings, respectively. We further analyzed the response ratings at a finer granularity and discovered that some LLMs are significantly better at responding to specific emotions compared to others. The suggested evaluation framework offers a scalable and adaptable approach for assessing the empathy of new LLMs, avoiding the need to replicate this study's findings in future research.
翻译:随着大型语言模型(LLMs)的出现,探究其是否能在情感识别与共情回应等领域超越人类已成为研究焦点。本文开展了一项综合性研究,旨在比较四种前沿大型语言模型(GPT-4、LLaMA-2-70B-Chat、Gemini-1.0-Pro 和 Mixtral-8x7B-Instruct)与人类基准在共情回应能力上的表现。我们招募了1000名参与者进行了一项组间用户研究,针对精心筛选的2000个涵盖32种不同积极与消极情绪的对话提示,评估了人类及四种LLMs生成回应的共情质量。研究结果表明,LLMs的共情回应能力在统计学上显著优于人类。其中GPT-4表现最为突出,其被评定为“良好”的回应比例较人类基准提升了约31%;随后依次是LLaMA-2、Mixtral-8x7B和Gemini-Pro,其“良好”评级分别提升了约24%、21%和10%。我们进一步对回应评级进行了细粒度分析,发现某些LLMs在回应特定情绪时明显优于其他模型。本研究提出的评估框架为衡量新型LLMs的共情能力提供了一种可扩展且适应性强的方法,无需在未来研究中重复本实验设计。