The rapid evolution of large language models (LLMs) necessitates effective benchmarks for evaluating their role knowledge, which is essential for establishing connections with the real world and providing more immersive interactions. This paper introduces RoleEval, a bilingual benchmark designed to assess the memorization, utilization, and reasoning capabilities of role knowledge. RoleEval comprises RoleEval-Global (including internationally recognized characters) and RoleEval-Chinese (including characters popular in China), with 6,000 Chinese-English parallel multiple-choice questions focusing on 300 influential people and fictional characters drawn from a variety of domains including celebrities, anime, comics, movies, TV series, games, and fiction. These questions cover basic knowledge and multi-hop reasoning abilities, aiming to systematically probe various aspects such as personal information, relationships, abilities, and experiences of the characters. To maintain high standards, we perform a hybrid quality check process combining automatic and human verification, ensuring that the questions are diverse, challenging, and discriminative. Our extensive evaluations of RoleEval across various open-source and proprietary large language models, under both the zero- and few-shot settings, reveal insightful findings. Notably, while GPT-4 outperforms other models on RoleEval-Global, Chinese LLMs excel on RoleEval-Chinese, highlighting significant knowledge distribution differences. We expect that RoleEval will highlight the significance of assessing role knowledge for foundation models across various languages and cultural settings.
翻译:大型语言模型(LLM)的快速发展亟需有效基准来评估其角色知识能力,这对于建立与现实世界的联系及提供更沉浸式交互至关重要。本文提出RoleEval——一个双语评估基准,旨在衡量模型对角色知识的记忆、运用与推理能力。该基准包含RoleEval-Global(涵盖国际知名角色)与RoleEval-Chinese(涵盖中国流行角色)两部分,包含6000道中英平行选择题,聚焦于300位来自名人、动漫、漫画、电影、电视剧、游戏及小说等多元领域的知名人物与虚构角色。题目涉及基础知识与多跳推理能力,系统性地探查角色在个人信息、人际关系、能力与经历等多维度信息。为维持高标准,我们采用自动化与人工验证相结合的混合质检流程,确保题目兼具多样性、挑战性与判别性。基于多种开源及商业大型语言模型在零样本与少样本场景下的全面评估,我们获得了深刻发现。值得关注的是,GPT-4在RoleEval-Global上表现最优,而中文LLM在RoleEval-Chinese上表现卓越,揭示了显著的知识分布差异。我们期待RoleEval能凸显评估基础模型在不同语言与文化环境中角色知识能力的重要性。