Large language models (LLMs) are increasingly used to simulate survey responses, but synthetic data can be misaligned with the human population, leading to unreliable inference. We develop a general framework that converts LLM-simulated responses into reliable confidence sets for population parameters of human responses, quantifying the uncertainty induced by the human-LLM misalignment. The key design choice is the number of simulated responses: too many produce overly narrow sets with poor coverage, while too few yield overly wide and uninformative sets dominated by stochastic noise. We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity. Experiments on real survey datasets reveal heterogeneous simulation fidelity across different LLMs and domains.
翻译:大型语言模型(LLM)越来越多地被用于模拟调查响应,但合成数据可能与人类群体不一致,导致推断不可靠。我们开发了一个通用框架,将LLM模拟的响应转换为人类响应总体参数的可信置信集,量化了由人类与LLM不一致所引发的不确定性。关键设计选择是模拟响应的数量:过多会导致集合过窄且覆盖不足,而过少则会产生过宽且无信息的集合,被随机噪声主导。我们提出了一种数据驱动的方法,自适应地选择模拟样本量,以实现名义上的平均情况覆盖,无论LLM的模拟保真度或置信集构建过程如何。所选样本量进一步被证明反映了LLM所能代表的有效人类群体规模,提供了其模拟保真度的定量度量。在真实调查数据集上的实验揭示了不同LLM和领域之间模拟保真度的异质性。