In this article, we aim to provide a general and complete understanding of semi-supervised (SS) causal inference for treatment effects. Specifically, we consider two such estimands: (a) the average treatment effect and (b) the quantile treatment effect, as prototype cases, in an SS setting, characterized by two available data sets: (i) a labeled data set of size $n$, providing observations for a response and a set of high dimensional covariates, as well as a binary treatment indicator; and (ii) an unlabeled data set of size $N$, much larger than $n$, but without the response observed. Using these two data sets, we develop a family of SS estimators which are ensured to be: (1) more robust and (2) more efficient than their supervised counterparts based on the labeled data set only. Beyond the 'standard' double robustness results (in terms of consistency) that can be achieved by supervised methods as well, we further establish root-n consistency and asymptotic normality of our SS estimators whenever the propensity score in the model is correctly specified, without requiring specific forms of the nuisance functions involved. Such an improvement of robustness arises from the use of the massive unlabeled data, so it is generally not attainable in a purely supervised setting. In addition, our estimators are shown to be semi-parametrically efficient as long as all the nuisance functions are correctly specified. Moreover, as an illustration of the nuisance estimators, we consider inverse-probability-weighting type kernel smoothing estimators involving unknown covariate transformation mechanisms, and establish in high dimensional scenarios novel results on their uniform convergence rates, which should be of independent interest. Numerical results on both simulated and real data validate the advantage of our methods over their supervised counterparts with respect to both robustness and efficiency.
翻译:本文旨在为半监督因果推断中的处理效应估计提供一个通用且完整的理论框架。具体而言,我们以两种典型估计量——(a)平均处理效应与(b)分位数处理效应——作为研究原型,在半监督设定下展开分析。该设定具有以下数据特征:(i)一个规模为 $n$ 的标注数据集,包含响应变量、高维协变量以及二元处理指示变量的观测;(ii)一个规模为 $N$(远大于 $n$)的未标注数据集,其中响应变量未被观测。基于这两个数据集,我们构建了一类半监督估计量,并确保其相较于仅使用标注数据的监督学习方法具备:(1)更强的鲁棒性与(2)更高的估计效率。除了监督方法同样能达到的“标准”双重稳健性(即一致性)结果外,我们进一步证明:只要模型中的倾向得分设定正确,即使不限定干扰函数的具体形式,所提出的半监督估计量仍具有 $\sqrt{n}$ 相合性与渐近正态性。这种鲁棒性的提升源于对海量未标注数据的利用,因而在纯监督设定中通常无法实现。此外,当所有干扰函数均正确设定时,我们的估计量被证明是半参有效的。进一步地,作为干扰函数估计的示例,我们研究了涉及未知协变量变换机制的反概率加权型核平滑估计量,并在高维场景下建立了其一致收敛速率的新理论结果,该结果本身具有独立的理论价值。基于模拟数据与真实数据的数值实验均验证了所提方法在鲁棒性与效率方面相较于监督学习对照方法的优越性。