Conditional expectation with regularization for missing data imputation

Missing data frequently occurs in datasets across various domains, such as medicine, sports, and finance. In many cases, to enable proper and reliable analyses of such data, the missing values are often imputed, and it is necessary that the method used has a low root mean square error (RMSE) between the imputed and the true values. In addition, for some critical applications, it is also often a requirement that the imputation method is scalable and the logic behind the imputation is explainable, which is especially difficult for complex methods that are, for example, based on deep learning. Based on these considerations, we propose a new algorithm named "conditional Distribution-based Imputation of Missing Values with Regularization" (DIMV). DIMV operates by determining the conditional distribution of a feature that has missing entries, using the information from the fully observed features as a basis. As will be illustrated via experiments in the paper, DIMV (i) gives a low RMSE for the imputed values compared to state-of-the-art methods; (ii) fast and scalable; (iii) is explainable as coefficients in a regression model, allowing reliable and trustable analysis, makes it a suitable choice for critical domains where understanding is important such as in medical fields, finance, etc; (iv) can provide an approximated confidence region for the missing values in a given sample; (v) suitable for both small and large scale data; (vi) in many scenarios, does not require a huge number of parameters as deep learning approaches; (vii) handle multicollinearity in imputation effectively; and (viii) is robust to the normally distributed assumption that its theoretical grounds rely on.

翻译：缺失数据在各领域的数据集中频繁出现，例如医学、体育和金融领域。在许多情况下，为了对这些数据进行恰当且可靠的分析，通常会对缺失值进行插补，并且要求所用方法在插补值与真实值之间的均方根误差较低。此外，对于某些关键应用，还要求插补方法具有可扩展性，且其背后的逻辑可解释，这对于基于深度学习等复杂方法而言尤为困难。基于这些考虑，我们提出了一种名为“基于条件分布的正则化缺失值插补”的新算法。DIMV通过利用完全观测特征的信息来确定具有缺失条目的特征的条件分布。正如本文实验所示，DIMV具有以下特点：（i）与现有先进方法相比，其插补值的均方根误差较低；（ii）快速且可扩展；（iii）可解释为回归模型中的系数，从而实现可靠且可信的分析，使其适用于医学、金融等需要理解的关键领域；（iv）可为给定样本中的缺失值提供近似的置信区间；（v）适用于小规模和大规模数据；（vi）在许多场景下，不需要像深度学习方法那样大量的参数；（vii）能有效处理插补中的多重共线性；（viii）对其理论依据所依赖的正态分布假设具有鲁棒性。