Simulating real personalities with large language models requires grounding generation in authentic personal data. Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said. We address this gap with an interview-grounded evaluation framework for personality simulation at a large scale. We extract over 671,000 question-answer pairs from 23,000 verified interview transcripts across 1,000 public personalities, each with an average of 11.5 hours of interview content. We propose a multi-dimensional evaluation framework with four complementary metrics measuring content similarity, factual consistency, personality alignment, and factual knowledge retention. Through systematic comparison, we demonstrate that methods grounded in real interview data substantially outperform those relying solely on biographical profiles or the model's parametric knowledge. We further reveal a trade-off in how interview data is best utilized: retrieval-augmented methods excel at capturing personality style and response quality, while chronological-based methods better preserve factual consistency and knowledge retention. Our evaluation framework enables principled method selection based on application requirements, and our empirical findings provide actionable insights for advancing personality simulation research.
翻译:利用大型语言模型模拟真实人格需要将生成过程基于真实的个人数据。现有评估方法依赖人口统计调查、人格问卷或简短的人工智能引导访谈作为替代,但缺乏针对个体实际言论的直接评估。我们通过构建基于访谈的大规模人格模拟评估框架来解决这一缺陷。我们从1000位公众人物的23000份经核实的访谈记录中提取了超过671,000个问答对,每位人物平均拥有11.5小时的访谈内容。我们提出了一个多维评估框架,包含四个互补指标:内容相似性、事实一致性、人格对齐度和事实知识保留度。通过系统比较,我们证明基于真实访谈数据的方法显著优于仅依赖传记资料或模型参数知识的方法。我们进一步揭示了访谈数据最佳利用方式的权衡关系:检索增强方法在捕捉人格风格和回答质量方面表现优异,而基于时序的方法能更好地保持事实一致性和知识保留度。我们的评估框架支持根据应用需求进行原则性方法选择,实证研究结果为推进人格模拟研究提供了可操作的见解。