Large language models (LLMs) have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence. In this paper, we introduce SocialBench, the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions. The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts and 30,800 multi-turn role-playing utterances. We conduct comprehensive evaluations on this benchmark using mainstream open-source and closed-source LLMs. We find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group. Experimental results on SocialBench confirm its significance as a testbed for assessing the social interaction of role-playing conversational agents. The benchmark is publicly accessible at https://github.com/X-PLUG/SocialBench.
翻译:大型语言模型(LLMs)推动了各类人工智能对话代理的发展,包括模拟多样化角色和人类行为的角色扮演对话代理。尽管先前研究主要集中于提升这些代理的对话能力、角色特定知识和风格属性,但在评估其社会智能方面存在明显空白。本文提出SocialBench,这是首个旨在系统评估角色扮演对话代理在个体和群体层面社会互动中社会性的基准。该基准构建自多种来源,涵盖500个广泛角色、超过6,000个问题提示以及30,800轮多轮角色扮演话语。我们使用主流开源和闭源LLMs对该基准进行了全面评估。研究发现,在个体层面表现优异的代理并不意味其在群体层面同样熟练。此外,个体行为可能因群体内其他代理的影响而发生漂移。SocialBench上的实验结果证实了其作为评估角色扮演对话代理社会互动能力测试平台的重要性。该基准可通过https://github.com/X-PLUG/SocialBench公开访问。