Species-sampling problems (SSPs) refer to a vast class of statistical problems calling for the estimation of (discrete) functionals of the unknown species composition of an unobservable population. A common feature of SSPs is their invariance with respect to species labelling, which is at the core of the Bayesian nonparametric (BNP) approach to SSPs under the popular Pitman-Yor process (PYP) prior. In this paper, we develop a BNP approach to SSPs that are not "invariant" to species labelling, in the sense that an ordering or ranking is assigned to species' labels. Inspired by the population genetics literature on age-ordered alleles' compositions, we study the following SSP with ordering: given an observable sample from an unknown population of individuals belonging to species (alleles), with species' labels being ordered according to weights (ages), estimate the frequencies of the first $r$ order species' labels in an enlarged sample obtained by including additional unobservable samples. By relying on an ordered PYP prior, we obtain an explicit posterior distribution of the first $r$ order frequencies, with estimates being of easy implementation and computationally efficient. We apply our approach to the analysis of genetic variation, showing its effectiveness in estimating the frequency of the oldest allele, and then we discuss other potential applications.
翻译:物种抽样问题(SSPs)指一类广泛的统计问题,需要估计不可观测总体中未知物种组成(离散型)的函数。SSPs 的共同特征是对物种标签具有不变性,这也是在流行的 Pitman-Yor 过程(PYP)先验下基于贝叶斯非参数(BNP)方法处理 SSPs 的核心。本文中,我们发展了一种针对“非”物种标签不变的 SSPs 的 BNP 方法,即物种标签被赋予排序或等级。受群体遗传学中按年龄排序的等位基因组成文献的启发,我们研究了以下带排序的 SSP:给定来自一个属于物种(等位基因)的个体未知可观测总体的可观测样本,其中物种标签根据权重(年龄)排序,估计通过加入额外不可观测样本得到的扩充样本中前 $r$ 个排序物种标签的频率。通过依赖有序的 PYP 先验,我们得到了前 $r$ 个排序频率的显式后验分布,其估计易于实现且计算高效。我们将该方法应用于遗传变异分析,展示了其在估计最古老等位基因频率方面的有效性,并进一步讨论了其他潜在应用。