Causal inference on the average treatment effect (ATE) using non-probability samples, such as electronic health records (EHR), faces challenges from sample selection bias and high-dimensional covariates. This requires considering a selection model alongside treatment and outcome models that are typical ingredients in causal inference. This paper considers integrating large non-probability samples with external probability samples from a design survey, addressing moderately high-dimensional confounders and variables that influence selection. In contrast to the two-step approach that separates variable selection and debiased estimation, we propose a one-step plug-in doubly robust (DR) estimator of the ATE. We construct a novel penalized estimating equation by minimizing the squared asymptotic bias of the DR estimator. Our approach facilitates ATE inference in high-dimensional settings by ignoring the variability in estimating nuisance parameters, which is not guaranteed in conventional likelihood approaches with non-differentiable L1-type penalties. We provide a consistent variance estimator for the DR estimator. Simulation studies demonstrate the double robustness of our estimator under misspecification of either the outcome model or the selection and treatment models, as well as the validity of statistical inference under penalized estimation. We apply our method to integrate EHR data from the Michigan Genomics Initiative with an external probability sample.
翻译:使用非概率样本(如电子健康记录)对平均处理效应进行因果推断时,面临样本选择偏差和高维协变量的挑战。这要求我们在因果推断的典型要素(处理模型和结果模型)之外,还需考虑选择模型。本文研究如何将大规模非概率样本与来自设计调查的外部概率样本相结合,处理中等高维的混淆变量和影响选择的变量。与将变量选择和去偏估计分开的两步法不同,我们提出了一个基于单步插补的双重稳健平均处理效应估计量。通过最小化双重稳健估计量的渐近偏差平方,我们构建了新颖的惩罚估计方程。我们的方法通过忽略估计冗余参数的变异性来实现高维场景下的平均处理效应推断,而这在使用不可微L1型惩罚的常规似然方法中是难以保证的。我们为双重稳健估计量提供了一致方差估计量。仿真实验表明,在结果模型或选择与处理模型设定错误的情况下,我们的估计量具有双重稳健性,并且基于惩罚估计的统计推断具有有效性。我们将该方法应用于整合密歇根基因组计划中的电子健康记录数据与外部概率样本。