Surveys have recently gained popularity as a tool to study large language models. By comparing survey responses of models to those of human reference populations, researchers aim to infer the demographics, political opinions, or values best represented by current language models. In this work, we critically examine this methodology on the basis of the well-established American Community Survey by the U.S. Census Bureau. Evaluating 43 different language models using de-facto standard prompting methodologies, we establish two dominant patterns. First, models' responses are governed by ordering and labeling biases, for example, towards survey responses labeled with the letter "A". Second, when adjusting for these systematic biases through randomized answer ordering, models across the board trend towards uniformly random survey responses, irrespective of model size or pre-training data. As a result, in contrast to conjectures from prior work, survey-derived alignment measures often permit a simple explanation: models consistently appear to better represent subgroups whose aggregate statistics are closest to uniform for any survey under consideration.
翻译:调查问卷作为研究大型语言模型的一种工具近期广受关注。通过比较模型与人类参照群体的问卷回答,研究者试图推断当前语言模型最能代表哪些人口特征、政治观点或价值取向。本研究基于美国人口普查局权威的美国社区调查,对该方法进行了批判性检验。我们采用事实标准的提示方法评估了43种不同的语言模型,发现两个主导性规律:首先,模型的回答受顺序和标签偏见支配,例如倾向于选择标注字母"A"的选项;其次,当通过随机化答案排序来校正这些系统性偏差后,所有模型均呈现趋近于均匀随机分布的应答模式,且不受模型规模或预训练数据的影响。因此,与先前研究的推测相反,基于调查问卷的对齐度测量往往存在更简单的解释:对于任何给定问卷,模型始终更倾向于表征那些总体统计分布最接近均匀分布的亚群体。