Large language models (LLMs) have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence. In this paper, we introduce RoleInteract, the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions. The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts and 30,800 multi-turn role-playing utterances. We conduct comprehensive evaluations on this benchmark using mainstream open-source and closed-source LLMs. We find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group. Experimental results on RoleInteract confirm its significance as a testbed for assessing the social interaction of role-playing conversational agents. The benchmark is publicly accessible at https://github.com/X-PLUG/RoleInteract.
翻译:大型语言模型(LLMs)推动了各类人工智能对话代理的发展,包括能模仿不同角色与人类行为的角色扮演对话代理。尽管已有研究主要聚焦于增强这些代理的对话能力、角色特定知识及风格属性,但在评估其社会智能方面仍存在明显空白。本文提出RoleInteract——首个系统评估角色扮演对话代理在个体与群体社交互动层面社会性的基准。该基准源自多源数据,涵盖500个角色、超6000个问题提示及30800条多轮角色扮演对话。我们利用主流开源与闭源LLMs对该基准进行全面评估,发现个体层面表现优异的代理未必擅长群体交互;此外,群体中其他代理的影响可能导致个体行为偏移。RoleInteract实验结果证实其作为评估角色扮演对话代理社交互动能力的测试平台具备显著价值。该基准已公开发布于https://github.com/X-PLUG/RoleInteract。