We analyze large, multi-dimensional, sparse counting data sets, finding unsupervised groups to provide unique insights into genetic data. We create gene and biological pathway groups based on patients' variants to find common risk factors for four common types of cancer (breast, lung, prostate, and colorectal) and autism spectrum disorder. To accomplish this, we extend latent Dirichlet allocation to multiple dimensions and design distinct methods for hierarchical topic modeling. We find that our conditional hierarchical Bayesian Tucker decomposition models are more coherent than baseline models.
翻译:本研究分析大规模、多维、稀疏的计数数据集,通过发现无监督分组为遗传数据提供独特见解。我们基于患者的基因变异创建基因与生物通路分组,以寻找四种常见癌症(乳腺癌、肺癌、前列腺癌和结直肠癌)及自闭症谱系障碍的共同风险因素。为实现这一目标,我们将潜在狄利克雷分布扩展至多维空间,并设计了分层主题建模的独特方法。研究发现,我们的条件分层贝叶斯塔克分解模型比基线模型具有更高的语义连贯性。