FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character's responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.

翻译：随着大语言模型在角色扮演任务中的发展，现有基准因范围狭窄、交互范式陈旧以及跨不同应用场景的适应性有限而迅速过时。为解决这一差距，我们引入了FURINA-Builder，这是一种新颖的多智能体协作流水线，能够自动构建任意规模、完全可自定义的角色扮演基准。作为角色扮演领域首个用于自适应评估的基准构建工具，它支持对任意角色在多样化场景和提示格式下进行评估。FURINA-Builder模拟测试角色与从精心构建的角色-场景库中抽取的其他角色之间的对话，同时由一个LLM评判器选择细粒度评估维度，并将测试角色的响应调整为最终测试语句。利用该流水线，我们构建了FURINA-Bench，这是一个全新的综合性角色扮演基准，包含既有角色和合成角色，每个角色均配有维度特定的评估标准。人工评估和初步可分离性分析验证了我们的流水线和基准设计。我们对尖端大语言模型进行了广泛评估，发现o3和DeepSeek-R1分别在英文和中文角色扮演任务上取得最佳性能。在所有模型中，既有角色始终优于合成角色，推理能力进一步加剧了这一差距。有趣的是，我们观察到模型规模并不会单调地减少幻觉。更为关键的是，对于推理型LLM，我们发现了一种新的权衡：推理提升了角色扮演性能，但同时也增加了角色扮演幻觉。这一权衡扩展到了所有LLM在角色扮演性能与可靠性之间的更广泛帕累托前沿。这些发现证明了FURINA-Builder的有效性以及FURINA-Bench所带来的挑战。