Effective clustering of biomedical data is crucial in precision medicine, enabling accurate stratifiction of patients or samples. However, the growth in availability of high-dimensional categorical data, including `omics data, necessitates computationally efficient clustering algorithms. We present VICatMix, a variational Bayesian finite mixture model designed for the clustering of categorical data. The use of variational inference (VI) in its training allows the model to outperform competitors in term of efficiency, while maintaining high accuracy. VICatMix furthermore performs variable selection, enhancing its performance on high-dimensional, noisy data. The proposed model incorporates summarisation and model averaging to mitigate poor local optima in VI, allowing for improved estimation of the true number of clusters simultaneously with feature saliency. We demonstrate the performance of VICatMix with both simulated and real-world data, including applications to datasets from The Cancer Genome Atlas (TCGA), showing its use in cancer subtyping and driver gene discovery. We demonstrate VICatMix's utility in integrative cluster analysis with different `omics datasets, enabling the discovery of novel subtypes. \textbf{Availability:} VICatMix is freely available as an R package, incorporating C++ for faster computation, at \url{https://github.com/j-ackierao/VICatMix}.
翻译:生物医学数据的有效聚类在精准医疗中至关重要,能够实现对患者或样本的精确分层。然而,随着高维分类数据(包括组学数据)可用性的增长,亟需计算高效的聚类算法。本文提出VICatMix——一种专为分类数据聚类设计的变分贝叶斯有限混合模型。训练过程中采用变分推断(VI)使该模型在保持高精度的同时,在效率方面优于同类方法。VICatMix进一步执行变量选择,从而提升其在高维噪声数据上的性能。所提出的模型通过集成摘要统计与模型平均来缓解VI中的不良局部最优问题,能够同步改进真实聚类数量与特征显著性的估计。我们通过模拟数据和真实数据(包括来自癌症基因组图谱(TCGA)的数据集应用)展示了VICatMix的性能,证明了其在癌症亚型分型和驱动基因发现中的效用。我们还展示了VICatMix在不同组学数据集整合聚类分析中的应用价值,该能力有助于发现新型亚型。\textbf{可用性:}VICatMix已作为R软件包免费发布,内部集成C++代码以加速计算,访问地址为\url{https://github.com/j-ackierao/VICatMix}。