Many in-silico simulations of human survey responses with large language models (LLMs) focus on generating closed-ended survey responses, whereas LLMs are typically trained to generate open-ended text instead. Previous research has used a diverse range of methods for generating closed-ended survey responses with LLMs, and a standard practice remains to be identified. In this paper, we systematically investigate the impact that various Survey Response Generation Methods have on predicted survey responses. We present the results of 32 mio. simulated survey responses across 8 Survey Response Generation Methods, 4 political attitude surveys, and 10 open-weight language models. We find significant differences between the Survey Response Generation Methods in both individual-level and subpopulation-level alignment. Our results show that Restricted Generation Methods perform best overall, and that reasoning output does not consistently improve alignment. Our work underlines the significant impact that Survey Response Generation Methods have on simulated survey responses, and we develop practical recommendations on the application of Survey Response Generation Methods.
翻译:许多使用大型语言模型(LLM)进行人类调查问卷在计算机模拟的研究,侧重于生成封闭式调查问卷回答,而LLM通常接受的是生成开放式文本的训练。先前研究采用了多种多样的方法利用LLM生成封闭式调查问卷回答,但标准化实践仍有待确定。本文系统研究了不同调查问卷回答生成方法对预测的调查问卷回答的影响。我们展示了涵盖8种调查问卷回答生成方法、4项政治态度调查和10个开源语言模型的3200万条模拟调查问卷回答的结果。我们发现,在个体层面和子群体层面的一致性上,不同调查问卷回答生成方法之间均存在显著差异。结果表明,限制性生成方法总体上表现最佳,而推理输出并未持续提高一致性。我们的工作强调了调查问卷回答生成方法对模拟调查问卷回答的显著影响,并提出了关于调查问卷回答生成方法应用的实用建议。