Large language models (LLMs) have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence. In this paper, we introduce RoleInteract, the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions. The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts and 30,800 multi-turn role-playing utterances. We conduct comprehensive evaluations on this benchmark using mainstream open-source and closed-source LLMs. We find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group. Experimental results on RoleInteract confirm its significance as a testbed for assessing the social interaction of role-playing conversational agents. The benchmark is publicly accessible at https://github.com/X-PLUG/RoleInteract.
翻译:大规模语言模型(LLMs)推动了各类AI对话智能体的发展,包括能够模拟不同角色和人类行为的角色扮演对话智能体。先前研究主要聚焦于增强这些智能体的对话能力、角色特定知识及风格属性,但在评估其社会智能方面存在显著空白。本文提出RoleInteract,这是首个从个体与群体两个社会交互层面系统评估角色扮演对话智能体社会性的基准测试。该基准测试基于多源数据构建,涵盖500个角色、逾6000个问题提示及30800条多轮角色扮演话语。我们利用主流开源与闭源LLMs在基准测试上开展全面评估,发现擅长个体交互的智能体未必擅长群体交互,且个体行为可能因组内其他智能体的影响而产生偏移。RoleInteract的实验结果证实了其作为评估角色扮演对话智能体社会交互能力的测试平台的重要性。该基准测试已在https://github.com/X-PLUG/RoleInteract 公开获取。