RCT Rejection Sampling for Causal Estimation Evaluation

Confounding is a significant obstacle to unbiased estimation of causal effects from observational data. For settings with high-dimensional covariates -- such as text data, genomics, or the behavioral social sciences -- researchers have proposed methods to adjust for confounding by adapting machine learning methods to the goal of causal estimation. However, empirical evaluation of these adjustment methods has been challenging and limited. In this work, we build on a promising empirical evaluation strategy that simplifies evaluation design and uses real data: subsampling randomized controlled trials (RCTs) to create confounded observational datasets while using the average causal effects from the RCTs as ground-truth. We contribute a new sampling algorithm, which we call RCT rejection sampling, and provide theoretical guarantees that causal identification holds in the observational data to allow for valid comparisons to the ground-truth RCT. Using synthetic data, we show our algorithm indeed results in low bias when oracle estimators are evaluated on the confounded samples, which is not always the case for a previously proposed algorithm. In addition to this identification result, we highlight several finite data considerations for evaluation designers who plan to use RCT rejection sampling on their own datasets. As a proof of concept, we implement an example evaluation pipeline and walk through these finite data considerations with a novel, real-world RCT -- which we release publicly -- consisting of approximately 70k observations and text data as high-dimensional covariates. Together, these contributions build towards a broader agenda of improved empirical evaluation for causal estimation.

翻译：混杂是观测数据中因果效应无偏估计的主要障碍。对于高维协变量（如文本数据、基因组学或行为社会科学）的场景，研究者已提出通过适配机器学习方法以实现因果估计目的的混杂调整方法。然而，这些调整方法的实证评估面临挑战且十分有限。本研究基于一种有前景的实证评估策略——通过子采样随机对照试验（RCT）生成混杂观测数据集，同时将RCT中的平均因果效应作为金标准——简化评估设计并利用真实数据。我们提出一种新的采样算法，称为RCT拒绝采样，并提供理论保证：当与金标准RCT进行有效比较时，观测数据中的因果识别成立。通过合成数据实验表明，当在混杂样本上评估预言机估计量时，我们的算法确实产生低偏差，而先前提出的算法并非总能实现。除该识别结果外，我们为计划在其数据集上使用RCT拒绝采样的评估设计者重点探讨若干有限数据考量要素。作为概念验证，我们实现了一个示例评估流水线，并结合一个新的真实世界RCT（约7万观测值与作为高维协变量的文本数据，已公开发布）逐步说明这些有限数据考量要素。这些贡献共同推动了改善因果估计实证评估的广泛议程。