The advanced role-playing capabilities of Large Language Models (LLMs) have paved the way for developing Role-Playing Agents (RPAs). However, existing benchmarks, such as HPD, which incorporates manually scored character relationships into the context for LLMs to sort coherence, and SocialBench, which uses specific profiles generated by LLMs in the context of multiple-choice tasks to assess character preferences, face limitations like poor generalizability, implicit and inaccurate judgments, and excessive context length. To address the above issues, we propose an automatic, scalable, and generalizable paradigm. Specifically, we construct a benchmark by extracting relations from a general knowledge graph and leverage RPA's inherent hallucination properties to prompt it to interact across roles, employing ChatGPT for stance detection and defining relationship hallucination along with three related metrics. Extensive experiments validate the effectiveness and stability of our metrics. Our findings further explore factors influencing these metrics and discuss the trade-off between relationship hallucination and factuality.
翻译:大型语言模型(LLM)先进的角色扮演能力为开发角色扮演代理(RPA)开辟了道路。然而,现有基准测试存在局限性,例如通用性差、判断隐式且不准确、上下文长度过长等问题。以HPD为例,其通过将人工评分的角色关系融入上下文供LLM排序连贯性;而SocialBench则利用LLM在多项选择题任务中生成的特定角色档案来评估角色偏好。为解决上述问题,我们提出了一种自动化、可扩展且可泛化的评估范式。具体而言,我们通过从通用知识图谱中提取关系构建基准测试,利用RPA固有的幻觉特性促使其跨角色交互,并采用ChatGPT进行立场检测,同时定义了关系幻觉及三项相关度量指标。大量实验验证了我们所提指标的有效性与稳定性。研究结果进一步探讨了影响这些指标的因素,并讨论了关系幻觉与事实性之间的权衡关系。