We introduce a new type of test, called a Turing Experiment (TE), for evaluating how well a language model, such as GPT-3, can simulate different aspects of human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We give TEs that attempt to replicate well-established findings in prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a "hyper-accuracy distortion" present in some language models.
翻译:我们提出一种新型测试方法,称为图灵实验(Turing Experiment, TE),用于评估语言模型(如GPT-3)模拟人类行为不同方面的能力。与旨在模拟单个任意个体的图灵测试不同,TE要求模拟人类受试者研究中具有代表性的参与者样本。我们设计了能够复现先前研究中已得到充分验证结论的图灵实验,并构建了一套实施TE的方法论。通过该方法,我们比较了不同语言模型在复现经典经济学、心理语言学及社会心理学实验(包括:最后通牒博弈、花园路径句、米尔格拉姆电击实验、群体智慧)中的表现。前三项TE中,现有结论可通过最新模型复现,而最后一项TE揭示了部分语言模型中存在的“超精准失真”现象。