Recently, the advent of large language models (LLMs) has revolutionized generative agents. Among them, Role-Playing Conversational Agents (RPCAs) attract considerable attention due to their ability to emotionally engage users. However, the absence of a comprehensive benchmark impedes progress in this field. To bridge this gap, we introduce CharacterEval, a Chinese benchmark for comprehensive RPCA assessment, complemented by a tailored high-quality dataset. The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 23,020 examples and featuring 77 characters derived from Chinese novels and scripts. It was carefully constructed, beginning with initial dialogue extraction via GPT-4, followed by rigorous human-led quality control, and enhanced with in-depth character profiles sourced from Baidu Baike. CharacterEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions. Comprehensive experiments on CharacterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation. Source code, data source and reward model will be publicly accessible at https://github.com/morecry/CharacterEval.
翻译:近年来,大语言模型的崛起彻底改变了生成式智能体的发展。其中,角色扮演对话智能体因能激发用户情感共鸣而备受关注。然而,该领域缺乏系统性基准的问题阻碍了其进展。为填补这一空白,我们提出CharacterEval——一个面向中文角色扮演对话智能体综合评估的基准,并配套构建了高质量数据集。该数据集包含1785组多轮角色扮演对话,涵盖23020个样例,涉及77个源自中国小说与剧本的经典角色。构建流程严谨:首先通过GPT-4进行初始对话生成,随后由人工主导严格质检,并辅以百度百科的深度角色档案进行优化。CharacterEval采用多维度评估框架,围绕四个维度设置十三项针对性指标。基于CharacterEval的全面实验表明,在中文角色扮演对话任务中,中文大语言模型展现出优于GPT-4的潜力。源代码、数据源及奖励模型将在https://github.com/morecry/CharacterEval 公开获取。