Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the input (e.g. an individual) and the label (e.g. an outcome). The treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we introduce mixed confounded missingness (MCM), a new missingness mechanism where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment introduces bias in covariates. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data. We highlight that our experiments encompass both average treatment effects and conditional average treatment effects.
翻译:摘要:缺失数据是实际场景中的系统性问题,会在处理效应估计中引入噪声和偏差,使得基于缺失数据估计处理效应变得尤为棘手。关键原因在于,除输入(如个体)和标签(如结果)外,处理变量这一额外变量的存在,导致标准的缺失数据假设不再充分。处理变量引入了关于部分变量为何缺失的额外复杂性,而先前研究尚未充分探讨这一点。在我们的研究中,我们提出了混合混杂缺失(MCM)这一新的缺失机制:在该机制下,部分缺失值决定了处理分配,而另一些缺失值则受处理分配影响。针对MCM机制,我们证明:如果对所有缺失变量进行简单插补,将会导致处理效应模型性能低下,因为插补过程实际上消除了提供无偏估计所需的信息;然而,完全不进行插补同样会导致有偏估计,因为由处理决定的缺失值会在协变量中引入偏差。我们的解决方案是选择性插补——基于MCM机制的知识,精确识别哪些变量应该被插补、哪些不应被插补。实验表明,与其他缺失数据处理方法相比,各类学习器从选择性插补中获益更多。需强调的是,我们的实验同时涵盖了平均处理效应和条件平均处理效应的评估。