High-dimensional data analysis typically focuses on low-dimensional structure, often to aid interpretation and computational efficiency. Graphical models provide a powerful methodology for learning the conditional independence structure in multivariate data by representing variables as nodes and dependencies as edges. Inference is often focused on individual edges in the latent graph. Nonetheless, there is increasing interest in determining more complex structures, such as communities of nodes, for multiple reasons, including more effective information retrieval and better interpretability. In this work, we propose a multilayer graphical model where we first cluster nodes and then, at the second layer, investigate the relationships among groups of nodes. Specifically, nodes are partitioned into "supernodes" with a data-coherent size-biased tessellation prior which combines ideas from Bayesian nonparametrics and Voronoi tessellations. This construct allows accounting also for dependence of nodes within supernodes. At the second layer, dependence structure among supernodes is modelled through a Gaussian graphical model, where the focus of inference is on "superedges". We provide theoretical justification for our modelling choices. We design tailored Markov chain Monte Carlo schemes, which also enable parallel computations. We demonstrate the effectiveness of our approach for large-scale structure learning in simulations and a transcriptomics application.
翻译:高维数据分析通常关注低维结构,这通常有助于解释和计算效率。图模型通过将变量表示为节点、依赖关系表示为边,为学习多元数据中的条件独立结构提供了强大方法。推理通常集中于潜在图中的单条边。然而,出于多重原因(包括更有效的信息检索和更好的可解释性),对确定更复杂结构(如节点社区)的兴趣日益增加。在本工作中,我们提出了一种多层图模型:首先对节点进行聚类,然后在第二层研究节点组之间的关系。具体而言,节点通过一种数据自洽的尺寸有偏剖面先验被划分为"超节点",该先验结合了贝叶斯非参数方法和沃罗诺伊剖面的思想。这种构造还能考虑超节点内节点的依赖性。在第二层,超节点间的依赖结构通过高斯图模型建模,推理重点在于"超边"。我们为建模选择提供了理论依据,设计了定制的马尔可夫链蒙特卡洛方案(支持并行计算),并通过模拟实验和转录组学应用证明了该方法在大规模结构学习中的有效性。