Species-sampling problems (SSPs) refer to a vast class of statistical problems calling for the estimation of (discrete) functionals of the unknown species composition of an unobservable population. A common feature of SSPs is their invariance with respect to species labeling, which is at the core of the Bayesian nonparametric (BNP) approach to SSPs under the popular Pitman-Yor process (PYP) prior. In this paper, we develop a BNP approach to SSPs that are not "invariant" to species labeling, in the sense that an ordering or ranking is assigned to species' labels. Inspired by the population genetics literature on age-ordered alleles' compositions, we study the following SSP with ordering: given an observable sample from an unknown population of individuals belonging to species (alleles), with species' labels being ordered according to weights (ages), estimate the frequencies of the first $r$ order species' labels in an enlarged sample obtained by including additional unobservable samples. By relying on an ordered PYP prior, we obtain an explicit posterior distribution of the first $r$ order frequencies, with estimates being of easy implementation and computationally efficient. We apply our approach to the analysis of genetic variation, showing its effectiveness in estimating the frequency of the oldest allele, and then we discuss other potential applications.
翻译:物种抽样问题(SSPs)是一类广泛的统计问题,旨在估计不可观测总体中未知物种组成的(离散)泛函。SSPs的常见特征是其对物种标签的置换不变性,而这正是基于流行Pitman-Yor过程(PYP)先验的贝叶斯非参数(BNP)方法处理SSPs的核心。本文提出了一种BNP方法以处理非"标签置换不变"的SSPs——即物种标签具有顺序或排序。受群体遗传学文献中按年龄排序的等位基因组成的启发,我们研究了以下带排序的SSP:给定来自未知个体总体(个体属于各物种/等位基因)的可观测样本,其中物种标签根据权重(年龄)排序,在一个包含额外不可观测样本的扩大样本中,估计前r个有序物种标签的频率。通过采用有序PYP先验,我们获得了前r个顺序频率的显式后验分布,其估计易于实现且计算高效。我们将该方法应用于遗传变异分析,展示了其在估计最古老等位基因频率方面的有效性,并讨论了其他潜在应用。