Let $S$ be a finite set, and $X_1,\ldots,X_n$ an i.i.d. uniform sample from $S$. To estimate the size $|S|$, without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order $|S|^\frac{1}{2}$. On the other hand, if $S=\{1,2,\ldots,|S|\}$, the maximum of the sample blown up by $n/(n-1)$ gives an efficient estimator based on any growing sample size. This paper gives refinements that interpolate between these extremes. A general non-asymptotic theory is developed. This includes estimating the volume of a compact convex set, the unseen species problem, and a host of testing problems that follow from the question `Is this new observation a typical pick from a large prespecified population?' We also treat regression style predictors. A general theorem gives non-parametric finite $n$ error bounds in all cases.
翻译:令$S$为一有限集合,$X_1,\ldots,X_n$为从$S$中独立同分布抽取的均匀样本。在无额外结构信息时,可通过等待重复样本并利用生日问题来估计集合规模$|S|$,该方法所需样本量级为$|S|^\frac{1}{2}$。另一方面,若$S=\{1,2,\ldots,|S|\}$,则样本最大值经$n/(n-1)$放大后可构成基于任意增长样本量的高效估计量。本文提出了介于这两种极端情况之间的改进方法,并建立了通用的非渐近理论体系。该框架涵盖紧凸集体积估计、未观测物种问题,以及从“该新观测是否来自大型预设群体的典型抽样?”这一核心问题衍生出的众多检验问题。同时,本文亦处理回归式预测问题。通过一个通用定理,我们在所有案例中均给出了非参数有限$n$误差界。