Count-compositional data arise in many different fields, including high-throughput sequencing experiments, ecological surveys, and palaeoclimate studies, where a common, important goal is to understand how covariates relate to the observed compositions. Existing methods often fail to simultaneously address key challenges inherent in such data, namely: overdispersion, an excess of zeros, cross-sample heterogeneity, and complex covariate effects. To address these concerns, we propose two novel Bayesian models based on ensembles of regression trees. Specifically, we leverage the recently introduced zero-and-$N$-inflated multinomial distribution and assign independent nonparametric Bayesian additive regression tree (BART) priors to both the compositional and structural zero probability components of the model, to flexibly capture covariate effects. We further extend this by adding latent random effects to capture overdispersion and more general dependence structures among the categories. We develop an efficient inferential algorithm combining recent data augmentation schemes with established BART sampling routines. We evaluate our proposed models in simulation studies and illustrate their applicability through a case study of palaeoclimate modelling.
翻译:计数组成数据广泛出现在高通量测序实验、生态调查和古气候研究等多个领域,其中一个共同且重要的目标是理解协变量如何与观测到的组成结构相关联。现有方法通常无法同时应对此类数据固有的关键挑战,即:过度离散、过量零值、跨样本异质性以及复杂的协变量效应。为解决这些问题,我们提出两种基于回归树集成的新型贝叶斯模型。具体而言,我们利用新近引入的零与N值膨胀多项分布,将独立的非参数贝叶斯加性回归树先验分别赋予模型的成分概率和结构零概率分量,以灵活捕捉协变量效应。我们进一步扩展该模型,通过添加潜在随机效应来捕获过度离散及类别间更一般的依赖结构。我们开发了一种高效的推断算法,结合了近期数据增广方案与成熟的BART采样流程。我们通过模拟研究评估了所提出的模型,并通过古气候建模案例研究展示了其适用性。