大型语言模型调查问卷响应的质疑 (Questioning the Survey Responses of Large Language Models)

Surveys have recently gained popularity as a tool to study large language models. By comparing survey responses of models to those of human reference populations, researchers aim to infer the demographics, political opinions, or values best represented by current language models. In this work, we critically examine this methodology on the basis of the well-established American Community Survey by the U.S. Census Bureau. Evaluating 43 different language models using de-facto standard prompting methodologies, we establish two dominant patterns. First, models' responses are governed by ordering and labeling biases, for example, towards survey responses labeled with the letter "A". Second, when adjusting for these systematic biases through randomized answer ordering, models across the board trend towards uniformly random survey responses, irrespective of model size or pre-training data. As a result, in contrast to conjectures from prior work, survey-derived alignment measures often permit a simple explanation: models consistently appear to better represent subgroups whose aggregate statistics are closest to uniform for any survey under consideration.

翻译：调查问卷作为研究大型语言模型的一种工具近期广受关注。通过比较模型与人类参照群体的问卷回答，研究者试图推断当前语言模型最能代表哪些人口特征、政治观点或价值取向。本研究基于美国人口普查局权威的美国社区调查，对该方法进行了批判性检验。我们采用事实标准的提示方法评估了43种不同的语言模型，发现两个主导性规律：首先，模型的回答受顺序和标签偏见支配，例如倾向于选择标注字母"A"的选项；其次，当通过随机化答案排序来校正这些系统性偏差后，所有模型均呈现趋近于均匀随机分布的应答模式，且不受模型规模或预训练数据的影响。因此，与先前研究的推测相反，基于调查问卷的对齐度测量往往存在更简单的解释：对于任何给定问卷，模型始终更倾向于表征那些总体统计分布最接近均匀分布的亚群体。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/