RCT Rejection Sampling for Causal Estimation Evaluation

Confounding is a significant obstacle to unbiased estimation of causal effects from observational data. For settings with high-dimensional covariates -- such as text data, genomics, or the behavioral social sciences -- researchers have proposed methods to adjust for confounding by adapting machine learning methods to the goal of causal estimation. However, empirical evaluation of these adjustment methods has been challenging and limited. In this work, we build on a promising empirical evaluation strategy that simplifies evaluation design and uses real data: subsampling randomized controlled trials (RCTs) to create confounded observational datasets while using the average causal effects from the RCTs as ground-truth. We contribute a new sampling algorithm, which we call RCT rejection sampling, and provide theoretical guarantees that causal identification holds in the observational data to allow for valid comparisons to the ground-truth RCT. Using synthetic data, we show our algorithm indeed results in low bias when oracle estimators are evaluated on the confounded samples, which is not always the case for a previously proposed algorithm. In addition to this identification result, we highlight several finite data considerations for evaluation designers who plan to use RCT rejection sampling on their own datasets. As a proof of concept, we implement an example evaluation pipeline and walk through these finite data considerations with a novel, real-world RCT -- which we release publicly -- consisting of approximately 70k observations and text data as high-dimensional covariates. Together, these contributions build towards a broader agenda of improved empirical evaluation for causal estimation.

翻译：混杂因素是妨碍从观测数据中无偏估计因果效应的重大障碍。对于具有高维协变量的场景（如文本数据、基因组学或行为社会科学），研究者已提出通过调整机器学习方法以实现因果估计目标来消除混杂效应的方法。然而，这些调整方法的实证评估一直充满挑战且非常有限。本研究基于一种有前景的实证评估策略——通过子采样随机对照试验（RCT）构建混杂观测数据集，同时将RCT中的平均因果效应作为基准真值，从而简化评估设计并利用真实数据。我们提出一种新的采样算法，称为RCT拒绝采样，并从理论上保证观测数据中因果可识别性成立，从而能够与RCT基准真值进行有效比较。通过合成数据，我们展示了该算法在基于混杂样本评估无偏估计量时确实能产生较低的偏差，而已有算法则未必如此。除这一可识别性结果外，我们还为计划在自身数据集上使用RCT拒绝采样的评估设计者强调了若干有限数据注意事项。作为概念验证，我们实现了一个示例评估流程，并通过一个新颖的真实世界RCT（包含约7万条观测数据及作为高维协变量的文本数据，已公开发布）对该有限数据注意事项进行了详解。这些贡献共同推动了改进因果估计实证评估这一更广泛目标。