Machine learning can predict human behavior well when substantial structured data and well-defined outcomes are available, but these models are typically limited to specific outcomes and cannot readily be applied to new domains. We test whether large language models (LLMs) can support a more general-purpose approach by building person-specific simulations (i.e., "generative agents") grounded in self-report data. Using data from a diverse national sample of 1,052 Americans, we build agents from (i) two-hour, semi-structured interviews (elicited using the American Voices Project interview schedule), (ii) structured surveys (the General Social Survey and Big Five personality inventory), or (iii) both sources combined. On held-out General Social Survey items, agent accuracy reached 83% (interview only), 82% (surveys only), and 86% (combined) of participants' two-week test-retest consistency, compared with agents prompted only with individuals' demographics (74%). Agents predicted personality traits and behaviors in experiments with similar accuracy, and reduced disparities in accuracy across racial and ideological groups relative to demographics-only baselines. Together, these results show that LLMs agents grounded in rich qualitative or quantitative self-report data can support general-purpose simulation of individuals across outcomes, without requiring task-specific training data.
翻译:机器学习在拥有大量结构化数据和明确结果时可较好地预测人类行为,但这些模型通常局限于特定结果,难以直接应用于新领域。我们检验了大型语言模型能否通过构建基于自我报告数据的个体模拟(即"生成式智能体")来支持更通用的方法。利用来自1052名具有全国代表性的美国样本数据,我们基于以下来源构建智能体:(i)两小时半结构化访谈(采用美国之声项目访谈提纲);(ii)结构化调查(综合社会调查与大五人格量表);(iii)两者结合。在保留的综合社会调查题目上,智能体准确率分别达到参与者两周重测信度的83%(仅访谈)、82%(仅调查)和86%(两者结合),而仅以个体人口统计特征提示的智能体为74%。智能体在实验中预测人格特质与行为时具有相似准确率,且相较于仅基于人口统计特征的基线模型,减少了跨种族和意识形态群体的准确率差异。这些结果表明,基于丰富定性或定量自我报告数据构建的LLM智能体能够支持跨结果的个体通用模拟,且无需特定任务的训练数据。