High-throughput gene expression data exhibit high dimensionality, complex intergene dependence, and pronounced biological heterogeneity across samples, presenting major challenges for unsupervised clustering and disease subtype discovery. We introduce a module-structured mixture factor model that combines finite mixture modelling with low-rank latent factor representations defined at the gene-module level. By explicitly modelling gene modules in both the mean and covariance structure, the proposed framework decomposes expression variability into global gene-specific effects, cluster-specific module-level shifts, latent dependence within modules, and gene-specific residual noise. An Expectation--Conditional Maximisation algorithm is developed for parameter estimation, allowing stable and scalable inference in high-dimensional transcriptomic settings. This framework enables interpretable unsupervised identification of disease-associated molecular subtypes and phenotypic heterogeneity across two autoimmune diseases using a large clinical transcriptomic dataset.
翻译:高通量基因表达数据呈现高维度、复杂的基因间依赖性以及显著的样本间生物异质性,给无监督聚类和疾病亚型发现带来了重大挑战。我们提出了一种模块化混合因子模型,该模型将有限混合建模与基因模块层面定义的低秩潜在因子表示相结合。通过同时在均值结构和协方差结构中显式建模基因模块,所提出的框架将表达变异分解为全局基因特异性效应、簇特异性模块级偏移、模块内潜在依赖性以及基因特异性残差噪声。我们开发了一种期望-条件最大化算法用于参数估计,从而在高维转录组学环境中实现稳定且可扩展的推断。该框架利用大规模临床转录组学数据集,能够对被识别的疾病相关分子亚型和两种自身免疫疾病的表型异质性进行可解释的无监督分析。