The humanlike responses of large language models (LLMs) have prompted social scientists to investigate whether LLMs can be used to simulate human participants in experiments, opinion polls and surveys. Of central interest in this line of research has been mapping out the psychological profiles of LLMs by prompting them to respond to standardized questionnaires. The conflicting findings of this research are unsurprising given that mapping out underlying, or latent, traits from LLMs' text responses to questionnaires is no easy task. To address this, we use psychometrics, the science of psychological measurement. In this study, we prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs. We used two kinds of persona descriptions: either generic (four or five random person descriptions) or specific (mostly demographics of actual humans from a large-scale human dataset). We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties, similar to human norms, but the data from both LLMs when using specific demographic profiles, show poor psychometrics properties. We conclude that, currently, when LLMs are asked to simulate silicon personas, their responses are poor signals of potentially underlying latent traits. Thus, our work casts doubt on LLMs' ability to simulate individual-level human behaviour across multiple-choice question answering tasks.
翻译:大语言模型(LLMs)产生类人响应的能力促使社会科学家探究其能否用于模拟实验、民意测验和调查中的人类参与者。此类研究的核心兴趣在于通过提示LLMs回答标准化问卷来描绘其心理特征。然而,由于从LLMs对问卷的文本响应中推断潜在特质并非易事,当前研究结果相互矛盾并不令人意外。为解决这一问题,我们采用心理测量学(心理测量的科学)方法。在本研究中,我们提示OpenAI的旗舰模型GPT-3.5和GPT-4扮演不同角色,并对一系列标准化人格构造测量工具做出响应。我们使用两种角色描述:通用型(四或五个随机人物描述)或特定型(主要基于大规模人类数据集中的实际人口统计信息)。研究发现,采用通用角色描述时,GPT-4(而非GPT-3.5)的响应展现出虽不完美但具有前景的心理测量特性(接近人类常模),而两个LLM在使用特定人口统计特征时均表现出较差的心理测量特性。我们得出结论:当前,当要求LLMs模拟硅基角色时,其响应是潜在特质的不良信号。因此,本研究对LLMs在多选题作答任务中模拟个体层面人类行为的能力提出了质疑。