We consider the problem of testing the fit of a sample of items from many categories to the uniform distribution over the categories. As a class of alternative hypotheses, we consider the removal of an $\ell_p$ ball of radius $\epsilon$ around the uniform rate sequence for $p \leq 2$. When the number of samples $n$ and number of categories $N$ go to infinity while $\epsilon$ goes to zero, the minimax risk $R_\epsilon^*$ in testing based on the sample's histogram (number of absent categories, singletons, collisions, ...) asymptotes to $2\Phi(-n N^{2-2/p} \epsilon^2/\sqrt{8N})$, with $\Phi(x)$ the normal CDF. This characterization allows comparing the many estimators previously proposed for this problem at the constant level rather than the rate of convergence of their risks. The minimax test mostly relies on collisions when $n/N$ is small, but otherwise behaves like the chisquared test. Empirical studies over a range of problem parameters show that this estimate is accurate in finite samples and that our test is significantly better than the chisquared test or a test that only uses collisions. Our analysis relies on the asymptotic normality of histogram ordinates, the equivalence between the minimax setting and a Bayesian setting, and the characterization of the least favorable prior by reducing a multi-dimensional optimization problem to a one-dimensional problem.
翻译:我们研究从多个类别中抽取样本与均匀分布拟合的检验问题。作为备择假设类,考虑在$p \leq 2$条件下移除均匀率序列周围半径为$\epsilon$的$\ell_p$球。当样本量$n$与类别数$N$趋于无穷大且$\epsilon$趋于零时,基于样本直方图(缺失类别数、独次观测数、碰撞次数等)的检验极小极大风险$R_\epsilon^*$渐近于$2\Phi(-n N^{2-2/p} \epsilon^2/\sqrt{8N})$,其中$\Phi(x)$为正态分布累积分布函数。该刻画允许在常数层面而非风险收敛速率层面比较此前针对该问题提出的多种估计量。当$n/N$较小时,极小极大检验主要依赖碰撞信息,其他情形则与卡方检验表现一致。在广泛问题参数范围内的实证研究表明,该估计在有限样本下准确,且我们提出的检验显著优于仅依赖碰撞的检验或卡方检验。我们的分析依赖于直方图坐标的渐近正态性、极小极大框架与贝叶斯框架的等价性,以及通过将多维优化问题降维为一维问题来刻画最不利先验分布。