ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at https://github.com/haidequanbu/ESC-Eval.

翻译：情感支持对话（ESC）是一项关键应用，旨在减轻人类压力、提供情感引导，并最终提升人类身心健康。随着大语言模型（LLMs）的发展，许多研究者已采用LLMs作为ESC模型。然而，这些基于LLM的ESC系统的评估仍不明确。受角色扮演智能体快速发展的启发，我们提出了一个ESC评估框架（ESC-Eval），该框架使用角色扮演智能体与ESC模型进行交互，随后对交互对话进行人工评估。具体而言，我们首先从七个现有数据集中重新整理了2,801张角色扮演卡片，以定义角色扮演智能体的角色。其次，我们训练了一个名为ESC-Role的特定角色扮演模型，其行为比GPT-4更接近困惑者。第三，通过ESC-Role与整理的角色卡片，我们系统性地使用14个LLMs作为ESC模型进行实验，包括通用AI助手型LLMs（如ChatGPT）和面向ESC的LLMs（如ExTES-Llama）。我们对不同ESC模型交互产生的多轮对话进行了全面的人工标注。结果表明，面向ESC的LLMs相比通用AI助手型LLMs展现出更优的ESC能力，但仍与人类表现存在差距。此外，为了自动化未来ESC模型的评分过程，我们基于标注数据训练了ESC-RANK模型，其评分性能超过GPT-4达35分以上。我们的数据与代码公开于https://github.com/haidequanbu/ESC-Eval。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日