High dimensional and heterogeneous count data are collected in various applied fields. In this paper, we look closely at high-resolution sequencing data on the microbiome, which have enabled researchers to study the genomes of entire microbial communities. Revealing the underlying interactions between these communities is of vital importance to learn how microbes influence human health. To perform structural learning from multivariate count data such as these, we develop a novel Gaussian copula graphical model with two key elements. Firstly, we employ parametric regression to characterize the marginal distributions. This step is crucial for accommodating the impact of external covariates. Neglecting this adjustment could potentially introduce distortions in the inference of the underlying network of dependences. Secondly, we advance a Bayesian structure learning framework, based on a computationally efficient search algorithm that is suited to high dimensionality. The approach returns simultaneous inference of the marginal effects and of the dependence structure, including graph uncertainty estimates. A simulation study and a real data analysis of microbiome data highlight the applicability of the proposed approach at inferring networks from multivariate count data in general, and its relevance to microbiome analyses in particular. The proposed method is implemented in the R package BDgraph.
翻译:高维异质性计数数据广泛存在于各类应用领域中。本文聚焦于微生物组的高分辨率测序数据,这类数据使研究者能够解析整个微生物群落的基因组。揭示这些群落间的潜在相互作用,对于理解微生物如何影响人类健康具有关键意义。为从此类多变量计数数据中实现结构学习,我们提出了一种包含两个核心要素的新型高斯Copula图模型。首先,采用参数回归对边际分布进行建模,这一步骤对于容纳外部协变量的影响至关重要——若忽略此调整,可能导致对底层依赖网络推断的失真。其次,我们基于一种适用于高维场景的高效搜索算法,构建了贝叶斯结构学习框架。该方法可实现边际效应与依赖结构的同步推断,并提供图结构不确定性估计。通过模拟研究与微生物组真实数据分析,验证了所提方法在从多变量计数数据中推断网络的广泛适用性,尤其凸显其在微生物组分析中的独特价值。该方法的实现已整合至R语言BDgraph软件包中。