Missing data frequently occurs in datasets across various domains, such as medicine, sports, and finance. In many cases, to enable proper and reliable analyses of such data, the missing values are often imputed, and it is necessary that the method used has a low root mean square error (RMSE) between the imputed and the true values. In addition, for some critical applications, it is also often a requirement that the imputation method is scalable and the logic behind the imputation is explainable, which is especially difficult for complex methods that are, for example, based on deep learning. Based on these considerations, we propose a new algorithm named "conditional Distribution-based Imputation of Missing Values with Regularization" (DIMV). DIMV operates by determining the conditional distribution of a feature that has missing entries, using the information from the fully observed features as a basis. As will be illustrated via experiments in the paper, DIMV (i) gives a low RMSE for the imputed values compared to state-of-the-art methods; (ii) fast and scalable; (iii) is explainable as coefficients in a regression model, allowing reliable and trustable analysis, makes it a suitable choice for critical domains where understanding is important such as in medical fields, finance, etc; (iv) can provide an approximated confidence region for the missing values in a given sample; (v) suitable for both small and large scale data; (vi) in many scenarios, does not require a huge number of parameters as deep learning approaches; (vii) handle multicollinearity in imputation effectively; and (viii) is robust to the normally distributed assumption that its theoretical grounds rely on.
翻译:缺失数据在医学、体育和金融等多个领域的数据集中频繁出现。在许多情况下,为确保对此类数据进行适当且可靠的分析,缺失值常需进行插补,且所用方法需使插补值与真实值之间的均方根误差尽可能低。此外,对于某些关键应用,还要求插补方法具有可扩展性且其背后的逻辑可解释——这对基于深度学习等复杂方法而言尤其困难。基于上述考量,我们提出一种名为“基于正则化的条件分布缺失值插补”的新算法。DIMV通过利用完全观测特征的信息,确定存在缺失项特征的条件分布来运作。如论文实验所示,DIMV:(i) 与现有最优方法相比,插补值的均方根误差更低;(ii) 快速且可扩展;(iii) 可解释为回归模型中的系数,支持可靠可信的分析,使其成为理解至关重要的关键领域(如医学、金融等)的合适选择;(iv) 可为给定样本中的缺失值提供近似置信区域;(v) 适用于小规模和大规模数据;(vi) 在多数场景下无需像深度学习方法那样使用大量参数;(vii) 有效处理插补中的多重共线性;(viii) 对其理论基础所依赖的正态分布假设具有稳健性。