The Dirichlet-multinomial (DM) distribution plays a fundamental role in modern statistical methodology development and application. Recently, the DM distribution and its variants have been used extensively to model multivariate count data generated by high-throughput sequencing technology in omics research due to its ability to accommodate the compositional structure of the data as well as overdispersion. A major limitation of the DM distribution is that it is unable to handle excess zeros typically found in practice which may bias inference. To fill this gap, we propose a novel Bayesian zero-inflated DM model for multivariate compositional count data with excess zeros. We then extend our approach to regression settings and embed sparsity-inducing priors to perform variable selection for high-dimensional covariate spaces. Throughout, modeling decisions are made to boost scalability without sacrificing interpretability or imposing limiting assumptions. Extensive simulations and an application to a human gut microbiome data set are presented to compare the performance of the proposed method to existing approaches. We provide an accompanying R package with a user-friendly vignette to apply our method to other data sets.
翻译:狄利克雷-多项式(DM)分布在现代统计方法学的发展与应用中扮演着基础性角色。近年来,由于能够适应数据的成分结构并处理过度离散问题,DM分布及其变体被广泛用于组学研究中高通量测序技术生成的多变量计数数据建模。DM分布的主要局限性在于无法处理实际数据中常见的过多零值,这可能导致推断偏差。为弥补这一缺陷,我们提出了一种新颖的贝叶斯零膨胀DM模型,适用于存在过多零值的多变量成分计数数据。随后,我们将该方法扩展至回归框架,并嵌入了稀疏诱导先验以实现高维协变量空间下的变量选择。在整个建模过程中,我们始终在提升可扩展性与保持可解释性、避免施加限制性假设之间取得平衡。通过大量模拟实验及人类肠道微生物组数据的实际应用,我们将所提方法与现有方法进行了性能比较。我们还提供了配套的R语言软件包及用户友好的使用手册,便于将该方法应用于其他数据集。