Estimating the underlying distribution from \textit{iid} samples is a classical and important problem in statistics. When the alphabet size is large compared to number of samples, a portion of the distribution is highly likely to be unobserved or sparsely observed. The missing mass, defined as the sum of probabilities $\text{Pr}(x)$ over the missing letters $x$, and the Good-Turing estimator for missing mass have been important tools in large-alphabet distribution estimation. In this article, given a positive function $g$ from $[0,1]$ to the reals, the missing $g$-mass, defined as the sum of $g(\text{Pr}(x))$ over the missing letters $x$, is introduced and studied. The missing $g$-mass can be used to investigate the structure of the missing part of the distribution. Specific applications for special cases such as order-$\alpha$ missing mass ($g(p)=p^{\alpha}$) and the missing Shannon entropy ($g(p)=-p\log p$) include estimating distance from uniformity of the missing distribution and its partial estimation. Minimax estimation is studied for order-$\alpha$ missing mass for integer values of $\alpha$ and exact minimax convergence rates are obtained. Concentration is studied for a class of functions $g$ and specific results are derived for order-$\alpha$ missing mass and missing Shannon entropy. Sub-Gaussian tail bounds with near-optimal worst-case variance factors are derived. Two new notions of concentration, named strongly sub-Gamma and filtered sub-Gaussian concentration, are introduced and shown to result in right tail bounds that are better than those obtained from sub-Gaussian concentration.
翻译:基于独立同分布样本估计潜在分布是统计学中一个经典且重要的问题。当字母表规模相对于样本数量较大时,分布的部分成分很可能未被观测或观测稀疏。缺失质量定义为缺失字母$x$对应概率$\text{Pr}(x)$之和,而Good-Turing缺失质量估计量已成为大字母表分布估计的重要工具。本文中,给定从$[0,1]$到实数的正函数$g$,我们引入并研究缺失$g$-质量,即缺失字母$x$对应$g(\text{Pr}(x))$之和。缺失$g$-质量可用于探究分布缺失部分的结构。针对特殊情况的具体应用包括$\alpha$阶缺失质量($g(p)=p^{\alpha}$)和缺失香农熵($g(p)=-p\log p$),这些应用涉及估计缺失分布与均匀分布的偏离程度及其部分估计。我们研究了整数$\alpha$的$\alpha$阶缺失质量的极小极大估计,并获得了精确的极小极大收敛速率。针对一类函数$g$研究其集中性,并推导出$\alpha$阶缺失质量和缺失香农熵的具体结果。我们推导了具有近最优最坏情况方差因子的次高斯尾界。引入两种新的集中性概念(强次伽马集中和滤波次高斯集中),并证明其产生的右尾界优于次高斯集中所得结果。