Healthcare data from patient or population cohorts are often characterized by sparsity, high missingness and relatively small sample sizes. In addition, being able to quantify uncertainty is often important in a medical context. To address these analytical requirements we propose a deep generative Bayesian model for multinomial count data. We develop a collapsed Gibbs sampling procedure that takes advantage of a series of augmentation relations, inspired by the Zhou$\unicode{x2013}$Cong$\unicode{x2013}$Chen model. We visualise the model's ability to identify coherent substructures in the data using a dataset of handwritten digits. We then apply it to a large experimental dataset of DNA mutations in cancer and show that we can identify biologically meaningful clusters of mutational signatures in a fully data-driven way.
翻译:来自患者或人群队列的医疗健康数据通常具有稀疏性、高缺失率及相对较小的样本量特征。此外,在医学背景下,量化不确定性往往至关重要。为应对这些分析需求,我们提出了一种面向多项计数数据的深度生成贝叶斯模型。受Zhou–Cong–Chen模型的启发,我们开发了一种利用系列增广关系的折叠吉布斯采样方法。通过手写数字数据集,我们直观展示了该模型识别数据中连贯子结构的能力。随后将该模型应用于癌症DNA突变的大规模实验数据集,结果表明我们能够以完全数据驱动的方式识别具有生物学意义的突变特征聚类。