How can researchers identify beliefs that large language models (LLMs) hide? As LLMs become more sophisticated and the prevalence of alignment faking increases, combined with their growing integration into high-stakes decision-making, responding to this challenge has become critical. This paper proposes that a list experiment, a simple method widely used in the social sciences, can be applied to study the hidden beliefs of LLMs. List experiments were originally developed to circumvent social desirability bias in human respondents, which closely parallels alignment faking in LLMs. The paper implements a list experiment on models developed by Anthropic, Google, and OpenAI and finds hidden approval of mass surveillance across all models, as well as some approval of torture, discrimination, and first nuclear strike. Importantly, a placebo treatment produces a null result, validating the method. The paper then compares list experiments with direct questioning and discusses the utility of the approach.
翻译:研究人员如何识别大型语言模型(LLM)所隐藏的信念?随着LLM日益复杂化、对齐伪装现象的普遍化,以及它们在高风险决策中日益深入的应用,应对这一挑战已变得至关重要。本文提出,列表实验这一社会科学领域广泛使用的简单方法,可应用于研究LLM的隐藏信念。列表实验最初是为规避人类受访者的社会期望偏差而设计的,这与LLM中的对齐伪装现象高度相似。本文在Anthropic、Google和OpenAI开发的模型上实施了列表实验,发现所有模型均隐藏着对大规模监控的认可,以及对酷刑、歧视和首次核打击的某种程度的认可。重要的是,安慰剂处理产生了零结果,验证了该方法的有效性。本文随后将列表实验与直接提问法进行了比较,并讨论了该方法的实用价值。