In many real-world causal inference applications, the primary outcomes (labels) are often partially missing, especially if they are expensive or difficult to collect. If the missingness depends on covariates (i.e., missingness is not completely at random), analyses based on fully observed samples alone may be biased. Incorporating surrogates, which are fully observed post-treatment variables related to the primary outcome, can improve estimation in this case. In this paper, we study the role of surrogates in estimating continuous treatment effects and propose a doubly robust method to efficiently incorporate surrogates in the analysis, which uses both labeled and unlabeled data and does not suffer from the above selection bias problem. Importantly, we establish the asymptotic normality of the proposed estimator and show possible improvements on the variance compared with methods that solely use labeled data. Extensive simulations show our methods enjoy appealing empirical performance.
翻译:在许多现实世界的因果推断应用中,主要结果(标签)常常存在部分缺失,尤其是在收集成本高昂或难以获取的情况下。若缺失机制依赖于协变量(即缺失并非完全随机),仅基于完全观测样本的分析可能存在偏差。在这种情况下,引入替代结果——即与主要结果相关且在处理后完全观测的变量——能够改进估计效果。本文研究了替代结果在估计连续处理效应中的作用,并提出了一种双重稳健方法,以高效地将替代结果纳入分析。该方法同时利用有标签和无标签数据,且不受上述选择偏差问题的影响。重要的是,我们证明了所提估计量的渐近正态性,并展示了其方差相较于仅使用有标签数据的方法可能存在的改进。大量模拟实验表明,我们的方法具有良好的实证性能。