This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A classical imputation method, the Expectation Maximization (EM) algorithm for Gaussian mixture models, has shown interesting properties when compared to other popular approaches such as those based on k-nearest neighbors or on multiple imputations by chained equations. However, Gaussian mixture models are known to be non-robust to heterogeneous data, which can lead to poor estimation performance when the data is contaminated by outliers or follows non-Gaussian distributions. To overcome this issue, a new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data. This paper shows that this problem reduces to the estimation of a mixture of Angular Gaussian distributions under generic assumptions (i.e., each sample is drawn from a mixture of elliptical distributions, which is possibly different for one sample to another). In that case, the complete-data likelihood associated with mixtures of elliptical distributions is well adapted to the EM framework with missing data thanks to its conditional distribution, which is shown to be a multivariate $t$-distribution. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data. Furthermore, experiments conducted on real-world datasets show that this algorithm is very competitive when compared to other classical imputation methods.
翻译:本文针对含噪声和非高斯数据的缺失值插补问题展开研究。经典插补方法——高斯混合模型的期望最大化(EM)算法,在与其他流行方法(如基于k近邻或链式方程多重插补的方法)相比时展现出良好特性。然而,高斯混合模型对异质数据非鲁棒,当数据受异常值污染或服从非高斯分布时,可能导致较差的估计性能。为克服该问题,本文研究了一种新的EM算法,该算法针对具有处理潜在缺失数据能力的椭圆分布混合模型。研究表明,在通用假设(即每个样本来自可能因样本而异的椭圆分布混合)下,该问题可简化为角高斯分布混合的估计。在椭圆分布混合的EM框架下,得益于其条件分布(被证明为多元t分布),完整数据似然函数非常适用于含缺失数据的情境。合成数据实验表明,该算法对异常值具有鲁棒性,且可用于非高斯数据。此外,在真实数据集上的实验证明,该算法相较于其他经典插补方法具有显著竞争力。