Accurate imputation of missing data is critical to downstream machine learning performance. We formulate missing data imputation as a risk minimisation problem, which highlights a covariate shift between the observed and unobserved data distributions. This covariate shift induced bias is not accounted for by popular imputation methods and leads to suboptimal performance. In this paper, we derive theoretically valid importance weights that correct for the induced distributional bias. Furthermore, we propose a novel imputation algorithm that jointly estimates both the importance weights and imputation models, enabling bias correction throughout the imputation process. Empirical results across benchmark datasets show reductions in root mean squared error and Wasserstein distance of up to 7% and 20%, respectively, compared to otherwise identical unweighted methods.
翻译:缺失数据的准确插补对下游机器学习性能至关重要。本文将缺失数据插补问题形式化为风险最小化问题,揭示了观测数据与未观测数据分布间的协变量偏移。这种由协变量偏移引发的偏差未被主流插补方法所考虑,导致次优性能。本文推导出理论上有效的逆概率权重以校正这种分布偏差,并提出一种新型插补算法,能够联合估计逆概率权重与插补模型,实现在整个插补过程中的偏差校正。基准数据集上的实验结果表明,相较于未加权的同等方法,均方根误差与Wasserstein距离分别降低最高达7%和20%。