Model-based unsupervised learning, as any learning task, stalls as soon as missing data occurs. This is even more true when the missing data are informative, or said missing not at random (MNAR). In this paper, we propose model-based clustering algorithms designed to handle very general types of missing data, including MNAR data. To do so, we introduce a mixture model for different types of data (continuous, count, categorical and mixed) to jointly model the data distribution and the MNAR mechanism, remaining vigilant to the relative degrees of freedom of each. Several MNAR models are discussed, for which the cause of the missingness can depend on both the values of the missing variable themselves and on the class membership. However, we focus on a specific MNAR model, called MNARz, for which the missingness only depends on the class membership. We first underline its ease of estimation, by showing that the statistical inference can be carried out on the data matrix concatenated with the missing mask considering finally a standard MAR mechanism. Consequently, we propose to perform clustering using the Expectation Maximization algorithm, specially developed for this simplified reinterpretation. Finally, we assess the numerical performances of the proposed methods on synthetic data and on the real medical registry TraumaBase as well.
翻译:基于模型的非监督学习,如同任何学习任务一样,一旦出现缺失数据便会陷入困境。当缺失数据具有信息性,即所谓的非随机缺失(MNAR)时,这一问题尤为突出。本文提出了旨在处理包括MNAR数据在内的非常一般类型的缺失数据的基于模型的聚类算法。为此,我们引入了一种针对不同类型数据(连续型、计数型、分类型及混合型)的混合模型,以联合建模数据分布与MNAR机制,并审慎控制各部分的相对自由度。讨论了几种MNAR模型,其中缺失的原因可能同时依赖于缺失变量本身的值和类别归属。然而,我们聚焦于一种特定的MNAR模型,称为MNARz,其中缺失性仅依赖于类别归属。我们首先通过指出统计推断可以在数据矩阵与缺失掩码拼接后,并最终考虑标准随机缺失(MAR)机制来进行,从而强调了其易于估计的特性。因此,我们提出使用专门为此简化后的重新解释而开发的期望最大化(EM)算法进行聚类。最后,我们在合成数据以及真实的医疗注册数据库TraumaBase上评估了所提方法的数值性能。