We propose a novel method that performs adaptive clustering with DPMM using collapsed VI, while incorporating weakly-informative priors for DP concentration parameter alpha and base distribution G0. We illustrate the importance of G0 covariance structure and prior choice by considering different parameterisations of the data covariance matrix. On high-dimensional Gaussian simulations, our model demonstrates substantially faster convergence than a state-of-the-art MCMC splice sampler. We further evaluate performances on Negative Binomial simulations and conduct sensitivity analyses to assess robustness on realistic data conditions. Application to a publicly available leukemia transcriptomic data set comprising 72 samples and 2,194 gene expression successfully recovers every known sub-type, all while identifying additional gene expression-based sub-clusters with meaningful biological interpretation.
翻译:我们提出了一种新颖方法,该方法利用折叠变分推断(VI)执行基于狄利克雷过程混合模型(DPMM)的自适应聚类,同时为DP浓度参数α和基分布G0引入了弱信息先验。通过考虑数据协方差矩阵的不同参数化方式,我们阐明了G0协方差结构与先验选择的重要性。在高维高斯模拟实验中,我们的模型展现出比最先进的MCMC切片采样器显著更快的收敛速度。我们进一步在负二项分布模拟数据上评估了模型性能,并进行了敏感性分析以评估其在真实数据条件下的鲁棒性。该方法应用于一个包含72个样本和2,194个基因表达量的公开白血病转录组数据集,不仅成功识别出所有已知亚型,还发现了具有明确生物学意义的、基于基因表达特征的额外亚簇。