Estimating the causal effects of interventions is crucial to policy and decision-making, yet outcome data are often missing or subject to non-standard measurement error. While ground-truth outcomes can sometimes be obtained through costly data annotation or follow-up, budget constraints typically allow only a fraction of the dataset to be labeled. We address this challenge by optimizing which data points should be sampled for outcome information in order to improve efficiency in average treatment effect estimation with missing outcomes. We derive a closed-form solution for the optimal batch sampling probability by minimizing the asymptotic variance of a doubly robust estimator for causal inference with missing outcomes. Motivated by our street outreach partners, we extend the framework to costly annotations of unstructured data, such as text or images in healthcare and social services. Across simulated and real-world datasets, including one of outreach interventions in homelessness services, our approach achieves substantially lower mean-squared error and recovers the AIPW estimate with fewer labels than existing baselines. In practice, we show that our method can match confidence intervals obtained with 361 random samples using only 90 optimized samples - saving 75% of the labeling budget.
翻译:估计干预措施的因果效应对于政策制定和决策至关重要,然而结果数据常常缺失或受到非标准测量误差的影响。虽然有时可以通过昂贵的数据标注或后续追踪获取真实结果,但预算限制通常只允许对数据集的一小部分进行标注。我们通过优化应采样哪些数据点来获取结果信息,以提升存在结果缺失时平均处理效应估计的效率。通过最小化缺失结果因果推断中双重稳健估计量的渐近方差,我们推导出了最优批采样概率的闭式解。受街头外展合作伙伴的启发,我们将该框架扩展到非结构化数据(如医疗及社会服务中的文本或图像)的昂贵标注场景。在模拟数据集和真实数据集(包括一项无家可归者服务外展干预数据集)上,我们的方法实现了显著更低的均方误差,且相比现有基线方法,能用更少的标签恢复AIPW估计。在实际应用中,我们展示该方法仅用90个优化样本即可匹配361个随机样本获得的置信区间——节省了75%的标注预算。