Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.
翻译:大语言模型(LLMs)正展现出类人的涌现能力,并日益被视为模拟个体沟通风格、行为倾向与人格特质的基础。然而,当前基于LLM的角色模拟评估仍存在局限:多数依赖合成对话、缺乏系统化框架,且未深入分析能力需求。为应对这些不足,我们提出了TwinVoice——一个用于评估多样化现实场景中角色模拟能力的综合性基准。TwinVoice涵盖三个维度:社会角色(公共社交互动)、人际角色(私人对话)与叙事角色(基于角色的表达)。该框架进一步将LLM性能评估分解为六项核心能力,包括观点一致性、记忆回溯、逻辑推理、词汇保真度、角色语气与句法风格。实验结果表明,尽管先进模型在角色模拟中达到中等准确度,但在句法风格与记忆回溯等能力上仍存在不足。因此,LLMs实现的平均性能仍显著低于人类基线水平。