Effective clustering of biomedical data is crucial in precision medicine, enabling accurate stratifiction of patients or samples. However, the growth in availability of high-dimensional categorical data, including `omics data, necessitates computationally efficient clustering algorithms. We present VICatMix, a variational Bayesian finite mixture model designed for the clustering of categorical data. The use of variational inference (VI) in its training allows the model to outperform competitors in term of efficiency, while maintaining high accuracy. VICatMix furthermore performs variable selection, enhancing its performance on high-dimensional, noisy data. The proposed model incorporates summarisation and model averaging to mitigate poor local optima in VI, allowing for improved estimation of the true number of clusters simultaneously with feature saliency. We demonstrate the performance of VICatMix with both simulated and real-world data, including applications to datasets from The Cancer Genome Atlas (TCGA), showing its use in cancer subtyping and driver gene discovery. We demonstrate VICatMix's utility in integrative cluster analysis with different `omics datasets, enabling the discovery of novel subtypes. \textbf{Availability:} VICatMix is freely available as an R package, incorporating C++ for faster computation, at https://github.com/j-ackierao/VICatMix.
翻译:生物医学数据的有效聚类在精准医疗中至关重要,能够实现患者或样本的精确分层。然而,随着高维分类数据(包括组学数据)可用性的增长,亟需计算高效的聚类算法。本文提出VICatMix,一种专为分类数据聚类设计的变分贝叶斯有限混合模型。在训练中使用变分推断(VI)使该模型在效率上超越现有方法,同时保持高精度。VICatMix进一步执行变量选择,提升了其在高维噪声数据上的性能。所提出的模型结合了数据概化与模型平均策略,以缓解VI中的不良局部最优问题,从而在估计真实聚类数量的同时优化特征显著性评估。我们通过模拟数据和真实世界数据(包括来自癌症基因组图谱(TCGA)的数据集)验证了VICatMix的性能,展示了其在癌症亚型分型和驱动基因发现中的应用。我们还论证了VICatMix在不同组学数据集整合聚类分析中的效用,能够发现新的疾病亚型。\textbf{可用性:}VICatMix已作为R软件包开源发布,通过集成C++代码实现加速计算,访问地址为https://github.com/j-ackierao/VICatMix。