In ecology, the description of species composition and biodiversity calls for statistical methods that involve estimating features of interest in unobserved samples based on an observed one. In the last decade, the Bayesian nonparametrics literature has thoroughly investigated the case where data arise from a homogeneous population. In this work, we propose a novel framework to address heterogeneous populations, specifically dealing with scenarios where data arise from two areas. This setting significantly increases the mathematical complexity of the problem and, as a consequence, it has received limited attention in the literature. While early approaches leverage computational methods, we provide a distributional theory for the in-sample analysis of any observed sample and enable out-of-sample prediction for the number of unseen distinct and shared species in additional samples of arbitrary sizes. The latter also extends the frequentist estimators, which solely deal with one-step-ahead prediction. Furthermore, our results can be applied to address sample size determination in sampling problems aimed at detecting distinct and shared species. Our results are illustrated in a real-world dataset concerning a population of ants in the city of Trieste.
翻译:在生态学中,物种组成和生物多样性的描述需要基于观测样本推断未观测样本中感兴趣特征的统计方法。过去十年间,贝叶斯非参数文献已深入研究了数据来自同质总体的情况。本文提出了一种处理异质总体的新框架,专门应对数据来自两个区域的场景。这一设置显著增加了问题的数学复杂度,因此文献中对其关注有限。早期方法主要依赖计算手段,而我们则为任意观测样本的样本内分析提供了分布理论,并能够对任意规模新增样本中未观测到的独有物种及共有物种数量进行样本外预测。后者还拓展了仅处理单步预测的频率派估计量。此外,我们的结果可应用于抽样问题中的样本量确定,以检测独有和共有物种。我们在的里雅斯特市蚂蚁种群的真实数据集上展示了研究结果。