Model-based clustering integrated with variable selection is a powerful tool for uncovering latent structures within complex data. However, its effectiveness is often hindered by challenges such as identifying relevant variables that define heterogeneous subgroups and handling data that are missing not at random, a prevalent issue in fields like transcriptomics. While several notable methods have been proposed to address these problems, they typically tackle each issue in isolation, thereby limiting their flexibility and adaptability. This paper introduces a unified framework designed to address these challenges simultaneously. Our approach incorporates a data-driven penalty matrix into penalized clustering to enable more flexible variable selection, along with a mechanism that explicitly models the relationship between missingness and latent class membership. We demonstrate that, under certain regularity conditions, the proposed framework achieves both asymptotic consistency and selection consistency, even in the presence of missing data. This unified strategy significantly enhances the capability and efficiency of model-based clustering, advancing methodologies for identifying informative variables that define homogeneous subgroups in the presence of complex missing data patterns. The performance of the framework, including its computational efficiency, is evaluated through simulations and demonstrated using both synthetic and real-world transcriptomic datasets.
翻译:基于模型的聚类与变量选择相结合是揭示复杂数据中潜在结构的强大工具。然而,其有效性常常受到诸多挑战的阻碍,例如识别定义异质子群的相关变量,以及处理非随机缺失的数据——这在转录组学等领域是一个普遍存在的问题。尽管已有一些重要方法被提出来解决这些问题,但它们通常孤立地处理每个问题,从而限制了其灵活性和适应性。本文引入了一个旨在同时应对这些挑战的统一框架。我们的方法将数据驱动的惩罚矩阵纳入惩罚聚类中,以实现更灵活的变量选择,同时包含一个显式建模缺失机制与潜在类别成员之间关系的机制。我们证明,在一定的正则性条件下,即使存在缺失数据,所提出的框架也能同时实现渐近一致性和选择一致性。这一统一策略显著增强了基于模型的聚类的能力和效率,推进了在复杂缺失数据模式下识别定义同质子群的信息变量的方法学。该框架的性能,包括其计算效率,通过模拟进行了评估,并使用了合成和真实世界的转录组数据集进行了验证。