Knowing the effect of an intervention is critical for human decision-making, but current approaches for causal effect estimation rely on manual data collection and structuring, regardless of the causal assumptions. This increases both the cost and time-to-completion for studies. We show how large, diverse observational text data can be mined with large language models (LLMs) to produce inexpensive causal effect estimates under appropriate causal assumptions. We introduce NATURAL, a novel family of causal effect estimators built with LLMs that operate over datasets of unstructured text. Our estimators use LLM conditional distributions (over variables of interest, given the text data) to assist in the computation of classical estimators of causal effect. We overcome a number of technical challenges to realize this idea, such as automating data curation and using LLMs to impute missing information. We prepare six (two synthetic and four real) observational datasets, paired with corresponding ground truth in the form of randomized trials, which we used to systematically evaluate each step of our pipeline. NATURAL estimators demonstrate remarkable performance, yielding causal effect estimates that fall within 3 percentage points of their ground truth counterparts, including on real-world Phase 3/4 clinical trials. Our results suggest that unstructured text data is a rich source of causal effect information, and NATURAL is a first step towards an automated pipeline to tap this resource.
翻译:了解干预措施的效果对人类决策至关重要,但当前因果效应估计方法无论因果假设如何,均依赖于人工数据收集与结构化处理。这既增加了研究成本,也延长了研究周期。本文展示了如何利用大语言模型(LLMs)挖掘大规模、多样化的观测文本数据,在适当的因果假设下生成低成本的因果效应估计。我们提出了NATURAL——一个基于LLMs构建的新型因果效应估计器家族,可直接处理非结构化文本数据集。该估计器利用LLM的条件分布(针对给定文本数据的目标变量)辅助计算经典的因果效应估计量。为实现这一构想,我们攻克了若干技术挑战,包括自动化数据整理及使用LLMs补全缺失信息。我们构建了六个(两个合成与四个真实)观测数据集,并配以随机试验形式的对应真实基准,用于系统评估流程的每个环节。NATURAL估计器展现出卓越性能,其因果效应估计值与真实基准的偏差在3个百分点以内,其中包含真实世界的3/4期临床试验数据。我们的研究结果表明,非结构化文本数据是因果效应信息的丰富来源,而NATURAL正是开发自动化流程以挖掘该资源的第一步。