Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at https://github.com/haidequanbu/ESC-Eval.
翻译:情感支持对话(ESC)是一项关键应用,旨在减轻人类压力、提供情感引导,并最终提升人类身心健康。随着大语言模型(LLMs)的发展,许多研究者已采用LLMs作为ESC模型。然而,这些基于LLM的ESC系统的评估仍不明确。受角色扮演智能体快速发展的启发,我们提出了一个ESC评估框架(ESC-Eval),该框架使用角色扮演智能体与ESC模型进行交互,随后对交互对话进行人工评估。具体而言,我们首先从七个现有数据集中重新整理了2,801张角色扮演卡片,以定义角色扮演智能体的角色。其次,我们训练了一个名为ESC-Role的特定角色扮演模型,其行为比GPT-4更接近困惑者。第三,通过ESC-Role与整理的角色卡片,我们系统性地使用14个LLMs作为ESC模型进行实验,包括通用AI助手型LLMs(如ChatGPT)和面向ESC的LLMs(如ExTES-Llama)。我们对不同ESC模型交互产生的多轮对话进行了全面的人工标注。结果表明,面向ESC的LLMs相比通用AI助手型LLMs展现出更优的ESC能力,但仍与人类表现存在差距。此外,为了自动化未来ESC模型的评分过程,我们基于标注数据训练了ESC-RANK模型,其评分性能超过GPT-4达35分以上。我们的数据与代码公开于https://github.com/haidequanbu/ESC-Eval。