In many real-world causal inference applications, the primary outcomes (labels) are often partially missing, especially if they are expensive or difficult to collect. If the missingness depends on covariates (i.e., missingness is not completely at random), analyses based on fully-observed samples alone may be biased. Incorporating surrogates, which are fully observed post-treatment variables related to the primary outcome, can improve estimation in this case. In this paper, we study the role of surrogates in estimating continuous treatment effects and propose a doubly robust method to efficiently incorporate surrogates in the analysis, which uses both labeled and unlabeled data and does not suffer from the above selection bias problem. Importantly, we establish asymptotic normality of the proposed estimator and show possible improvements on the variance compared with methods that solely use labeled data. Extensive simulations show our methods enjoy appealing empirical performance.
翻译:在许多现实世界的因果推断应用中,主结果(标签)常存在部分缺失,尤其是在数据采集成本高昂或实施困难的情况下。若数据缺失与协变量相关(即非完全随机缺失),仅基于完整观测样本的分析可能会产生偏倚。本文研究了替代结果在估计连续治疗效果中的作用,并提出一种双重稳健方法,通过有效整合完全观测的治疗后变量(替代结果)来改进估计。该方法同时使用标记与未标记数据,可规避上述选择性偏倚问题。重要的是,我们建立了所提估计量的渐近正态性,并证明其相较于仅使用标记数据的方法在方差上具有潜在改善。大量仿真实验表明,该方法在实际应用中表现优异。