Large language models (LLMs) are increasingly used to simulate survey responses, but synthetic data can be misaligned with the human population, leading to unreliable inference. We develop a general framework that converts LLM-simulated responses into reliable confidence sets for population parameters of human responses, quantifying the uncertainty induced by the human-LLM misalignment. The key design choice is the number of simulated responses: too many produce overly narrow sets with poor coverage, while too few yield overly wide and uninformative sets dominated by stochastic noise. We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity. Experiments on real survey datasets reveal heterogeneous simulation fidelity across different LLMs and domains.
翻译:大型语言模型(LLM)日益被用于模拟调查响应,但合成数据可能偏离人类群体,导致推断不可靠。我们开发了一个通用框架,将LLM模拟的响应转化为人类响应总体参数的可信置信集,量化了人类与LLM不匹配所引发的不确定性。关键的设计选择是模拟响应的数量:过多会产生覆盖不足的过窄集合,而过少则会产生由随机噪声主导的过宽且无信息量的集合。我们提出了一种数据驱动方法,能够自适应地选择模拟样本量,以实现名义上的平均覆盖,无论LLM的模拟保真度或置信集构建过程如何。进一步证明,所选样本量能够反映LLM可代表的有效人类群体规模,从而提供对其模拟保真度的定量度量。在真实调查数据集上的实验揭示了不同LLM和领域间模拟保真度的异质性。