Confounding is a significant obstacle to unbiased estimation of causal effects from observational data. For settings with high-dimensional covariates -- such as text data, genomics, or the behavioral social sciences -- researchers have proposed methods to adjust for confounding by adapting machine learning methods to the goal of causal estimation. However, empirical evaluation of these adjustment methods has been challenging and limited. In this work, we build on a promising empirical evaluation strategy that simplifies evaluation design and uses real data: subsampling randomized controlled trials (RCTs) to create confounded observational datasets while using the average causal effects from the RCTs as ground-truth. We contribute a new sampling algorithm, which we call RCT rejection sampling, and provide theoretical guarantees that causal identification holds in the observational data to allow for valid comparisons to the ground-truth RCT. Using synthetic data, we show our algorithm indeed results in low bias when oracle estimators are evaluated on the confounded samples, which is not always the case for a previously proposed algorithm. In addition to this identification result, we highlight several finite data considerations for evaluation designers who plan to use RCT rejection sampling on their own datasets. As a proof of concept, we implement an example evaluation pipeline and walk through these finite data considerations with a novel, real-world RCT -- which we release publicly -- consisting of approximately 70k observations and text data as high-dimensional covariates. Together, these contributions build towards a broader agenda of improved empirical evaluation for causal estimation.
翻译:混杂因素是观测数据中因果效应无偏估计的重要障碍。针对高维协变量场景(如文本数据、基因组学或行为社会科学),研究者已提出通过适配机器学习方法进行因果估计的混杂调整方案。然而,这些调整方法的实证评估面临挑战且十分有限。本研究基于一种具有前景的实证评估策略——通过子采样随机对照试验(RCT)构建混杂观测数据集,并将RCT中的平均因果效应作为金标准——以简化评估设计并利用真实数据。我们提出一种名为RCT拒绝抽样的新采样算法,并从理论上证明该算法能确保观测数据中的因果可识别性,从而支持与金标准RCT进行有效比较。通过合成数据实验表明,当使用理想估计器对混杂样本进行评估时,我们的算法确实具有低偏差特性,而现有算法未必能实现这一效果。除可识别性结果外,我们还为计划在自身数据集上使用RCT拒绝抽样的评估设计者梳理了若干有限数据考量。作为概念验证,我们实现了一个示例评估流程,并利用新发布的全新真实世界RCT数据(包含约7万条观测及作为高维协变量的文本数据)逐步演示这些有限数据考量。这些成果共同推动了改进因果估计实证评估这一更广泛的研究议程。