Large Language Models (LLMs) have demonstrated impressive potential to simulate human behavior. Using a causal inference framework, we empirically and theoretically analyze the challenges of conducting LLM-simulated experiments, and explore potential solutions. In the context of demand estimation, we show that variations in the treatment included in the prompt (e.g., price of focal product) can cause variations in unspecified confounding factors (e.g., price of competitors, historical prices, outside temperature), introducing endogeneity and yielding implausibly flat demand curves. We propose a theoretical framework suggesting this endogeneity issue generalizes to other contexts and won't be fully resolved by merely improving the training data. Unlike real experiments where researchers assign pre-existing units across conditions, LLMs simulate units based on the entire prompt, which includes the description of the treatment. Therefore, due to associations in the training data, the characteristics of individuals and environments simulated by the LLM can be affected by the treatment assignment. We explore two potential solutions. The first specifies all contextual variables that affect both treatment and outcome, which we demonstrate to be challenging for a general-purpose LLM. The second explicitly specifies the source of treatment variation in the prompt given to the LLM (e.g., by informing the LLM that the store is running an experiment). While this approach only allows the estimation of a conditional average treatment effect that depends on the specific experimental design, it provides valuable directional results for exploratory analysis.
翻译:大语言模型(LLM)在模拟人类行为方面展现出令人瞩目的潜力。我们采用因果推理框架,从实证与理论两个层面分析开展LLM模拟实验所面临的挑战,并探索潜在解决方案。在需求估计场景中,我们证明提示中纳入的处理变量(如核心产品价格)的变动,会引发未明确指定的混杂因素(如竞品价格、历史价格、外部温度)的连带波动,由此产生内生性问题,最终导致需求曲线呈现不可靠的扁平形态。我们提出的理论框架表明,这种内生性问题具有跨场景普适性,且无法仅通过改进训练数据得到彻底解决。与传统实验中研究者将既有实验单元随机分配至不同条件不同,LLM基于包含处理条件描述的完整提示模拟实验单元。因此,受训练数据中关联关系的影响,LLM模拟的个体特征与情境特征会随处理条件的改变而改变。我们探索两种潜在解决方案:第一种方案需指定所有同时影响处理变量与结果变量的情境变量,但在通用型LLM中实现存在困难;第二种方案要求在提示中向LLM明确说明处理变量的变异来源(例如告知LLM该商店正在进行实验)。尽管该方案仅能估算依赖于特定实验设计的条件平均处理效应,但其为探索性分析提供了具有方向性参考价值的结论。