Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference

Organizations and researchers show increasing interest in using large language models (LLMs) in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes recovers the effect that would have been measured on the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs. The framework shows that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. When these conditions fail, the effect of interest is only partially identified, and we provide diagnostics that can falsify surrogacy on historical experiments together with a bound on the worst-case bias from limited overlap. We further show that the stochasticity inherent to LLMs introduces both bias and variance, but using an average of multiple draws as the surrogate mitigates both. We illustrate the methods and theory in simulations and an application to A/B tests on Upworthy headlines. A central takeaway from our work is that the validity of LLM outcomes as surrogates can only be falsified for past treatments and never verified for new ones, so human experiments remain indispensable for novel interventions. We discuss the role of LLM choice, prompting, and temperature as design variables, and how to size human experiments for validation.

翻译：组织与研究者日益关注使用大语言模型替代人类参与A/B测试，以期实现更快速、更低成本的实验。本文研究当基于LLM结果估计的处理效应，在何种条件下能还原针对目标人群直接测量的效应。若LLM与人类结果分布等价，则任何标准估计量均是有效的，但这一假设并不现实。为此，我们构建了一个将替代终点理论适配至LLM的统计框架。该框架表明：在比分布等价更弱的替代性与可比性条件下，校准LLM结果至人类结果能识别平均处理效应。当这些条件不成立时，目标效应仅能被部分识别，我们提供了基于历史实验可证伪替代性的诊断方法，并给出了有限重叠下最坏偏差的上界。进一步证明：LLM固有随机性会同时引入偏差与方差，但使用多次采样的均值作为替代指标可同时缓解两者。我们通过模拟实验及Upworthy标题A/B测试的应用案例，验证了方法与理论。本文的核心启示是：LLM结果作为替代指标的有效性仅能被历史处理所证伪，而无法被新处理所证实——因此，针对新型干预的人类实验仍不可或缺。最后探讨了LLM选择、提示设计及温度参数作为设计变量的作用，以及如何规划用于验证的人类实验规模。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【伯克利博士论文】从推理服务到模型训练：面向大规模 LLM 智能体的高效系统构建

专知会员服务

19+阅读 · 1月2日

【AAAI2026】NeSTR：一种用于大型语言模型的神经-符号可溯因框架，用于时间推理

专知会员服务

17+阅读 · 2025年12月10日

LLM/智能体作为数据分析师：综述

专知会员服务

38+阅读 · 2025年9月30日

基于大语言模型（LLM）的智能体推理框架：从方法到场景的综述

专知会员服务

55+阅读 · 2025年8月26日