Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles

Personality traits are richly encoded in natural language, and large language models (LLMs) trained on human text can simulate personality when conditioned on persona descriptions. However, existing evaluations rely predominantly on questionnaire self-report by the conditioned model, are limited in architectural diversity, and rarely use real human psychometric data. Without addressing these limitations, it remains unclear whether personality conditioning produces psychometrically informative representations of individual differences or merely superficial alignment with trait descriptors. To test how robustly LLMs can encode personality into extended text, we condition LLMs on real psychometric profiles from 290 participants to generate first-person life story narratives, and then task independent LLMs to recover personality scores from those narratives alone. We show that personality scores can be recovered from the generated narratives at levels approaching human test-retest reliability (mean r = 0.750, 85% of the human ceiling), and that recovery is robust across 10 LLM narrative generators and 3 LLM personality scorers spanning 6 providers. Decomposing systematic biases reveals that scoring models achieve their accuracy while counteracting alignment-induced defaults. Content analysis of the generated narratives shows that personality conditioning produces behaviourally differentiated text: nine of ten coded features correlate significantly with the same features in participants' real conversations, and personality-driven emotional reactivity patterns in narratives replicate in real conversational data. These findings provide evidence that the personality-language relationship captured during pretraining supports robust encoding and decoding of individual differences, including characteristic emotional variability patterns that replicate in real human behaviour.

翻译：人格特质在自然语言中具有丰富编码，训练于人类文本的大型语言模型（LLMs）可在人格描述控制下模拟人格特质。然而，现有评估主要依赖受控模型的问卷自我报告，架构多样性有限，且极少使用真实人类心理测量数据。若未解决这些限制，便无法确定人格控制能否产生具有心理测量意义的个体差异表征，抑或仅与特质描述词形成浅层对应。为检验LLMs将人格特质编码至扩展文本的稳健性，我们基于290名参与者的真实心理测量特征控制LLMs生成第一人称人生故事叙事，随后交由独立LLMs仅凭这些叙事恢复人格评分。结果表明：从生成叙事中恢复的人格评分可接近人类重测信度水平（平均r=0.750，达到人类上限的85%），且该恢复结果在涵盖6个提供商的10个叙事生成模型与3个人格评分模型上保持稳健。通过分解系统偏差发现，评分模型在抵消控制诱导的默认倾向过程中实现了准确性。对生成叙事的内容分析表明，人格控制产生了行为分化式文本：十个编码特征中有九个与参与者真实对话中的对应特征显著相关，叙事中由人格驱动的情绪反应模式在真实对话数据中得以复现。这些发现表明，预训练阶段习得的人格-语言关系能够支持个体差异的稳健编码与解码，包括在真实人类行为中复现的特异性情绪变异模式。