Large language models (LLMs) have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence. In this paper, we introduce RoleInteract, the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions. The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts and 30,800 multi-turn role-playing utterances. We conduct comprehensive evaluations on this benchmark using mainstream open-source and closed-source LLMs. We find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group. Experimental results on RoleInteract confirm its significance as a testbed for assessing the social interaction of role-playing conversational agents. The benchmark is publicly accessible at https://github.com/X-PLUG/RoleInteract.
翻译:大型语言模型(LLMs)推动了各类AI对话智能体的发展,包括能够模仿不同角色和人类行为的角色扮演对话智能体。尽管先前研究主要集中于提升这些智能体的对话能力、角色专业知识及其风格属性,但在评估其社会智能方面仍存在明显空白。本文提出RoleInteract——首个系统性评估角色扮演对话智能体社会性的基准测试,涵盖个体和群体两个层面的社会交互。该基准测试从多种来源构建,包含500个角色、逾6000个问题提示及30800回合多轮角色扮演对话。我们使用主流开源与闭源大语言模型对此基准进行了全面评估。研究发现,在个体层面表现优异的智能体,在群体层面未必同样出色。此外,个体行为可能因组内其他智能体的影响而发生漂移。RoleInteract的实验结果证实了其作为评估角色扮演对话智能体社会交互能力的测试平台的显著价值。该基准已在https://github.com/X-PLUG/RoleInteract 公开开放。