Artificial Intelligence in Psychology Research

from arxiv, 28 pages, 2 visualizations (1 table and 1 figure), preregistered OSF database is available at https://osf.io/dzp8t/?view_only=45fff3953884443d81b628cdd5d50f7a

Large Language Models have vastly grown in capabilities. One potential application of such AI systems is to support data collection in the social sciences, where perfect experimental control is currently unfeasible and the collection of large, representative datasets is generally expensive. In this paper, we re-replicate 14 studies from the Many Labs 2 replication project (Klein et al., 2018) with OpenAI's text-davinci-003 model, colloquially known as GPT3.5. For the 10 studies that we could analyse, we collected a total of 10,136 responses, each of which was obtained by running GPT3.5 with the corresponding study's survey inputted as text. We find that our GPT3.5-based sample replicates 30% of the original results as well as 30% of the Many Labs 2 results, although there is heterogeneity in both these numbers (as we replicate some original findings that Many Labs 2 did not and vice versa). We also find that unlike the corresponding human subjects, GPT3.5 answered some survey questions with extreme homogeneity$\unicode{x2013}$with zero variation in different runs' responses$\unicode{x2013}$raising concerns that a hypothetical AI-led future may in certain ways be subject to a diminished diversity of thought. Overall, while our results suggest that Large Language Model psychology studies are feasible, their findings should not be assumed to straightforwardly generalise to the human case. Nevertheless, AI-based data collection may eventually become a viable and economically relevant method in the empirical social sciences, making the understanding of its capabilities and applications central.

翻译：大语言模型的能力已大幅增长。这类人工智能系统的一个潜在应用是支持社会科学中的数据收集，目前在该领域中，完美的实验控制尚不可行，且收集大规模、具有代表性的数据集通常成本高昂。在本文中，我们使用OpenAI的text-davinci-003模型（俗称GPT3.5）重复了Many Labs 2复制项目（Klein等，2018）中的14项研究。对于可分析的10项研究，我们共收集了10,136个响应，每个响应均通过将对应研究的调查问卷以文本形式输入GPT3.5获得。我们发现，基于GPT3.5的样本复制了30%的原始结果以及30%的Many Labs 2结果，尽管这些数字存在异质性（因为我们复制了一些Many Labs 2未复制的原始发现，反之亦然）。我们还发现，与相应的人类受试者不同，GPT3.5对某些调查问题的回答呈现出极端同质性——不同运行中的响应毫无变异——这引发了担忧，即假想中由AI主导的未来可能在某种程度上受制于思想多样性的减少。总体而言，尽管我们的结果表明大语言模型心理学研究是可行的，但其发现不应被假定为能直接普适到人类案例。尽管如此，基于AI的数据收集最终可能成为实证社会科学中一种可行且具有经济相关性的方法，因此理解其能力与应用至关重要。