Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.
翻译:大语言模型(LLMs)正展现出类人的涌现能力,并日益被视为模拟个体沟通风格、行为倾向与人格特质的基础。然而,当前基于LLM的人格模拟评估仍存在局限:多数依赖合成对话,缺乏系统化框架,且对能力需求的分析不足。为应对这些局限,我们提出了TwinVoice,一个用于评估多样化现实场景下人格模拟的综合基准。TwinVoice涵盖三个维度:社会人格(公共社交互动)、人际人格(私人对话)与叙事人格(基于角色的表达)。该基准进一步将LLM性能评估分解为六项基础能力,包括观点一致性、记忆回溯、逻辑推理、词汇保真度、人格语调与句法风格。实验结果表明,尽管先进模型在人格模拟方面达到了中等准确度,但在句法风格与记忆回溯等能力上仍显不足。因此,LLMs实现的平均性能仍显著低于人类基线水平。