We consider the problem of testing the fit of a discrete sample of items from many categories to the uniform distribution over the categories. As a class of alternative hypotheses, we consider the removal of an $\ell_p$ ball of radius $\epsilon$ around the uniform rate sequence for $p \leq 2$. We deliver a sharp characterization of the asymptotic minimax risk when $\epsilon \to 0$ as the number of samples and number of dimensions go to infinity, for testing based on the occurrences' histogram (number of absent categories, singletons, collisions, ...). For example, for $p=1$ and in the limit of a small expected number of samples $n$ compared to the number of categories $N$ (aka "sub-linear" regime), the minimax risk $R^*_\epsilon$ asymptotes to $2 \bar{\Phi}\left(n \epsilon^2/\sqrt{8N}\right) $, with $\bar{\Phi}(x)$ the normal survival function. Empirical studies over a range of problem parameters show that this estimate is accurate in finite samples, and that our test is significantly better than the chisquared test or a test that only uses collisions. Our analysis is based on the asymptotic normality of histogram ordinates, the equivalence between the minimax setting to a Bayesian one, and the reduction of a multi-dimensional optimization problem to a one-dimensional problem.
翻译:我们考虑从多个类别中抽取的离散样本是否拟合均匀分布的问题。作为备择假设类,我们考虑在$p \leq 2$时,从均匀率序列中移除半径为$\epsilon$的$\ell_p$球。针对基于出现次数直方图(缺失类别数、独子数、碰撞数等)的检验,我们给出了当样本数与维度数趋于无穷且$\epsilon \to 0$时渐近极小极大风险的精确刻画。例如,当$p=1$且期望样本数$n$远小于类别数$N$(即"次线性"区域)时,极小极大风险$R^*_\epsilon$渐近趋近于$2 \bar{\Phi}\left(n \epsilon^2/\sqrt{8N}\right)$,其中$\bar{\Phi}(x)$为正态生存函数。针对一系列问题参数的实证研究表明,该估计在有限样本下具有准确性,且我们的检验显著优于卡方检验或仅使用碰撞数的检验。分析基于直方图坐标的渐近正态性、极小极大设定与贝叶斯设定之间的等价性,以及将多维优化问题约化为一维问题。