A Bayesian Finite Mixture Model Approach for Mixed-type Data Clustering and Variable Selection with Censored Biomarkers

Clustering mixed-type data remains a major challenge in biomedical research to uncover clinically meaningful subgroups within heterogeneous patient populations. Most existing clustering methods impose restrictive assumptions like local independence, fail to accommodate censored biomarkers, or unable to quantify variable importance. We propose a Bayesian finite mixture model (BFMM) clustering framework that addresses these limitations. BFMM flexibly models both continuous and categorical variables, incorporates three covariance structures to capture cluster-specific dependencies among continuous features, and handles censored observations through likelihood-based imputation. To facilitate feature prioritization, BFMM uses spike-and-slab priors to estimate variable importance on a continuous 0-1 scale. Simulation studies demonstrate that BFMM outperforms existing methods in clustering accuracy, particularly given strong within-cluster correlation or censored variables, and reliably distinguishes informative features from noise under varying conditions. We applied BFMM to two real-world datasets: (1) the SENECA cohort integrating electronic health records from patients with Sepsis; and (2) the EDEN randomized trial of patients with acute lung injury. In both settings, BFMM identified clinically interpretable phenotypes and revealed variable-specific contributions to subgroup differentiation. In the EDEN trial, it also uncovered evidence of treatment heterogeneity. These findings validate BFMM as an effective, interpretable, and practically useful clustering tool for complex biomedical datasets.

翻译：在生物医学研究中，对混合型数据进行聚类以揭示异质性患者群体中具有临床意义的亚组仍是一项重大挑战。现有的大多数聚类方法施加了诸如局部独立性等限制性假设，无法处理删失的生物标志物，或难以量化变量重要性。我们提出了一种贝叶斯有限混合模型（BFMM）聚类框架来解决这些局限性。BFMM能够灵活地对连续型和分类变量进行建模，融合了三种协方差结构以捕捉连续特征间的簇特异性依赖关系，并通过基于似然的插补来处理删失观测值。为促进特征排序，BFMM使用尖峰-平板先验在连续的0-1尺度上估计变量重要性。模拟研究表明，BFMM在聚类准确性方面优于现有方法，尤其是在存在强簇内相关性或删失变量的情况下，并且能在不同条件下可靠地区分信息特征与噪声。我们将BFMM应用于两个真实世界数据集：（1）整合了脓毒症患者电子健康记录的SENECA队列；（2）针对急性肺损伤患者的EDEN随机试验。在这两种场景中，BFMM均识别出临床可解释的表型，并揭示了各变量对亚组分化的具体贡献。在EDEN试验中，它还发现了治疗异质性的证据。这些发现验证了BFMM作为一种针对复杂生物医学数据集的有效、可解释且实用的聚类工具。