Organizations and researchers show increasing interest in using large language models (LLMs) in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes recovers the effect that would have been measured on the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs. The framework shows that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. When these conditions fail, the effect of interest is only partially identified, and we provide diagnostics that can falsify surrogacy on historical experiments together with a bound on the worst-case bias from limited overlap. We further show that the stochasticity inherent to LLMs introduces both bias and variance, but using an average of multiple draws as the surrogate mitigates both. We illustrate the methods and theory in simulations and an application to A/B tests on Upworthy headlines. A central takeaway from our work is that the validity of LLM outcomes as surrogates can only be falsified for past treatments and never verified for new ones, so human experiments remain indispensable for novel interventions. We discuss the role of LLM choice, prompting, and temperature as design variables, and how to size human experiments for validation.
翻译:组织与研究者日益关注使用大语言模型替代人类参与A/B测试,以期实现更快速、更低成本的实验。本文研究当基于LLM结果估计的处理效应,在何种条件下能还原针对目标人群直接测量的效应。若LLM与人类结果分布等价,则任何标准估计量均是有效的,但这一假设并不现实。为此,我们构建了一个将替代终点理论适配至LLM的统计框架。该框架表明:在比分布等价更弱的替代性与可比性条件下,校准LLM结果至人类结果能识别平均处理效应。当这些条件不成立时,目标效应仅能被部分识别,我们提供了基于历史实验可证伪替代性的诊断方法,并给出了有限重叠下最坏偏差的上界。进一步证明:LLM固有随机性会同时引入偏差与方差,但使用多次采样的均值作为替代指标可同时缓解两者。我们通过模拟实验及Upworthy标题A/B测试的应用案例,验证了方法与理论。本文的核心启示是:LLM结果作为替代指标的有效性仅能被历史处理所证伪,而无法被新处理所证实——因此,针对新型干预的人类实验仍不可或缺。最后探讨了LLM选择、提示设计及温度参数作为设计变量的作用,以及如何规划用于验证的人类实验规模。