Model-based unsupervised learning, as any learning task, stalls as soon asmissing data occurs. This is even more true when the missing data are infor-mative, or said missing not at random (MNAR). In this paper, we proposemodel-based clustering algorithms designed to handle very general typesof missing data, including MNAR data. To do so, we introduce a mixturemodel for different types of data (continuous, count, categorical and mixed)to jointly model the data distribution and the MNAR mechanism, remainingvigilant to the degrees of freedom of each. Eight different MNAR modelswhich depend on the class membership and/or on the values of the missingvariables themselves are proposed. For a particular type of MNAR mod-els, for which the missingness depends on the class membership, we showthat the statistical inference can be carried out on the data matrix concate-nated with the missing mask considering a MAR mechanism instead; thisspecifically underlines the versatility of the studied MNAR models. Then,we establish sufficient conditions for identifiability of parameters of both thedata distribution and the mechanism. Regardless of the type of data and themechanism, we propose to perform clustering using EM or stochastic EMalgorithms specially developed for the purpose. Finally, we assess the nu-merical performances of the proposed methods on synthetic data and on thereal medical registry TraumaBase as well.
翻译:基于模型的无监督学习,如同任何学习任务一样,一旦出现缺失数据就会陷入停滞。当缺失数据具有信息性(即非随机缺失,MNAR)时,这一情况尤为突出。本文提出了专门处理包括MNAR数据在内的多种通用缺失数据类型(连续型、计数型、类别型及混合型)的基于模型聚类算法。为此,我们引入了一种适用于不同数据类型的混合模型,以联合建模数据分布与MNAR机制,同时警惕各模型自由度。提出了八种依赖于类成员关系和/或缺失变量本身取值的MNAR模型。针对一类特殊的MNAR模型(其缺失机制依赖于类成员关系),我们证明在数据矩阵与缺失掩码拼接后的矩阵上,可考虑将数据视为随机缺失(MAR)机制进行统计推断——这特别凸显了所研究MNAR模型的灵活性。进而,我们建立了数据分布与缺失机制参数可识别性的充分条件。无论数据类型与缺失机制如何,我们均采用为此专门开发的期望最大化(EM)或随机期望最大化(SEM)算法进行聚类。最后,通过合成数据与真实医疗注册数据库TraumaBase评估了所提方法的数值性能。