We consider the problem of testing the fit of a discrete sample of items from many categories to the uniform distribution over the categories. As a class of alternative hypotheses, we consider the removal of an $\ell_p$ ball of radius $\epsilon$ around the uniform rate sequence for $p \leq 2$. We deliver a sharp characterization of the asymptotic minimax risk when $\epsilon \to 0$ as the number of samples and number of dimensions go to infinity, for testing based on the occurrences' histogram (number of absent categories, singletons, collisions, ...). For example, for $p=1$ and in the limit of a small expected number of samples $n$ compared to the number of categories $N$ (aka "sub-linear" regime), the minimax risk $R^*_\epsilon$ asymptotes to $2 \bar{\Phi}\left(n \epsilon^2/\sqrt{8N}\right) $, with $\bar{\Phi}(x)$ the normal survival function. Empirical studies over a range of problem parameters show that this estimate is accurate in finite samples, and that our test is significantly better than the chisquared test or a test that only uses collisions. Our analysis is based on the asymptotic normality of histogram ordinates, the equivalence between the minimax setting to a Bayesian one, and the reduction of a multi-dimensional optimization problem to a one-dimensional problem.
翻译:我们考虑对来自多个类别的离散样本是否服从均匀分布进行拟合检验的问题。作为备择假设类,我们考虑移除以均匀率序列为中心、半径为$\epsilon$的$\ell_p$球(其中$p \leq 2$)。对于基于出现频率直方图(缺失类别数、单例数、碰撞数等)的检验,我们给出了当$\epsilon \to 0$且样本数与维度数趋于无穷时渐近极小极大风险的精确刻画。例如,当$p=1$且期望样本数$n$远小于类别数$N$(即“亚线性”区间)时,极小极大风险$R^*_\epsilon$渐近趋于$2 \bar{\Phi}\left(n \epsilon^2/\sqrt{8N}\right)$,其中$\bar{\Phi}(x)$为正态生存函数。在广泛问题参数范围内的实证研究表明,该估计在有限样本中具有准确性,且我们的检验显著优于卡方检验或仅使用碰撞数的检验。我们的分析基于直方图坐标的渐近正态性、极小极大设定与贝叶斯设定的等价性,以及将多维优化问题降维为一维问题。