Numerical data imputation algorithms replace missing values by estimates to leverage incomplete data sets. Current imputation methods seek to minimize the error between the unobserved ground truth and the imputed values. But this strategy can create artifacts leading to poor imputation in the presence of multimodal or complex distributions. To tackle this problem, we introduce the $k$NN$\times$KDE algorithm: a data imputation method combining nearest neighbor estimation ($k$NN) and density estimation with Gaussian kernels (KDE). We compare our method with previous data imputation methods using artificial and real-world data with different data missing scenarios and various data missing rates, and show that our method can cope with complex original data structure, yields lower data imputation errors, and provides probabilistic estimates with higher likelihood than current methods. We release the code in open-source for the community: https://github.com/DeltaFloflo/knnxkde
翻译:数值缺失值填充算法通过估计值替换缺失数据,从而实现对不完整数据集的利用。现有填充方法旨在最小化未观测真实值与填充值之间的误差。然而,当数据呈现多模态或复杂分布时,这种策略可能产生伪影,导致填充效果不佳。为解决此问题,我们提出$k$NN$\times$KDE算法:一种结合近邻估计($k$NN)与高斯核密度估计(KDE)的数据填充方法。我们通过人工和真实数据,在不同数据缺失场景及多种缺失率条件下,将本方法与现有填充方法进行对比,结果表明该方法能够适应复杂的原始数据结构,获得更低的填充误差,并以更高似然值提供概率估计。我们已开源算法代码供社区使用:https://github.com/DeltaFloflo/knnxkde