Probability proportional to size (PPS) sampling schemes with a target sample size aim to produce a sample comprising a specified number $n$ of items while ensuring that each item in the population appears in the sample with a probability proportional to its specified "weight" (also called its "size"). These two objectives, however, cannot always be achieved simultaneously. Existing PPS schemes prioritize control of the sample size, violating the PPS property if necessary. We provide a new PPS scheme that allows a different trade-off: our method enforces the PPS property at all times while ensuring that the sample size never exceeds the target value $n$. The sample size is exactly equal to $n$ if possible, and otherwise has maximal expected value and minimal variance. Thus we bound the sample size, thereby avoiding storage overflows and helping to control the time required for analytics over the sample, while allowing the user complete control over the sample contents. The method is both simple to implement and efficient, being a one-pass streaming algorithm with an amortized processing time of $O(1)$ per item.
翻译:以目标样本容量为目标的概率与规模成比例(PPS)抽样方案旨在生成包含指定数量$n$个项目的样本,同时确保总体中的每个项目以与其指定“权重”(亦称其“规模”)成比例的概率出现在样本中。然而,这两个目标并非总能同时实现。现有的PPS方案优先控制样本容量,必要时会违反PPS特性。我们提出了一种新的PPS方案,它允许一种不同的权衡:我们的方法始终强制执行PPS特性,同时确保样本容量永远不会超过目标值$n$。在可能的情况下,样本容量恰好等于$n$;否则,其期望值最大且方差最小。因此,我们对样本容量进行了限制,从而避免了存储溢出,并有助于控制对样本进行分析所需的时间,同时允许用户完全控制样本内容。该方法实现简单且高效,是一种单遍流式算法,每个项目的摊销处理时间为$O(1)$。