In ecology, the description of species composition and biodiversity calls for statistical methods that involve estimating features of interest in unobserved samples based on an observed one. In the last decade, the Bayesian nonparametrics literature has thoroughly investigated the case where data arise from a homogeneous population. In this work, we propose a novel framework to address heterogeneous populations, specifically dealing with scenarios where data arise from two areas. This setting significantly increases the mathematical complexity of the problem and, as a consequence, it has received limited attention in the literature. While early approaches leverage computational methods, we provide a distributional theory for the in-sample analysis of any observed sample and enable out-of-sample prediction for the number of unseen distinct and shared species in additional samples of arbitrary sizes. The latter also extends the frequentist estimators, which solely deal with one-step-ahead prediction. Furthermore, our results can be applied to address sample size determination in sampling problems aimed at detecting distinct and shared species. Our results are illustrated in a real-world dataset concerning a population of ants in the city of Trieste.
翻译:在生态学中,对物种组成和生物多样性的描述需要借助统计方法,这些方法基于观测到的样本估计未观测样本中的感兴趣特征。过去十年中,贝叶斯非参数文献已深入研究了数据来自同质总体的情况。本研究提出一个新颖框架以应对异质总体,特别是处理数据来自两个区域的场景。这一设定显著增加了问题的数学复杂性,因此该方向在文献中受到的关注有限。早期方法多依赖计算技术,而我们则提供了任何观测样本的样本内分析分布理论,并能够对任意规模额外样本中未观测到的独特物种和共有物种数量进行样本外预测。后者还扩展了仅处理单步预测的频率派估计量。此外,我们的结果可应用于旨在检测独特物种和共有物种的抽样问题中的样本量确定。我们在意大利的里雅斯特市蚂蚁种群的真实数据集中展示了研究结果。