We study the problem of estimating the stationary mass -- also called the unigram mass -- that is missing from a single trajectory of a discrete-time, ergodic Markov chain. This problem has several applications -- for example, estimating the stationary missing mass is critical for accurately smoothing probability estimates in sequence models. While the classical Good--Turing estimator from the 1950s has appealing properties for i.i.d. data, it is known to be biased in the Markovian setting, and other heuristic estimators do not come equipped with guarantees. Operating in the general setting in which the size of the state space may be much larger than the length $n$ of the trajectory, we develop a linear-runtime estimator called Windowed Good--Turing (WingIt) and show that its risk decays as $\widetilde{O}(\mathsf{T_{mix}}/n)$, where $\mathsf{T_{mix}}$ denotes the mixing time of the chain in total variation distance. Notably, this rate is independent of the size of the state space and minimax-optimal up to a logarithmic factor in $n / \mathsf{T_{mix}}$. We also present an upper bound on the variance of the missing mass random variable, which may be of independent interest. We extend our estimator to approximate the stationary mass placed on elements occurring with small frequency in the trajectory. Finally, we demonstrate the efficacy of our estimators both in simulations on canonical chains and on sequences constructed from natural language text.
翻译:我们研究从离散时间、遍历马尔可夫链的单一轨迹中估计缺失的平稳质量(亦称为单元组质量)的问题。该问题具有多种应用——例如,准确估计平稳缺失质量对于在序列模型中平滑概率估计至关重要。虽然20世纪50年代提出的经典Good–Turing估计器在独立同分布数据上具有吸引人的性质,但已知其在马尔可夫设定下存在偏差,且其他启发式估计器缺乏理论保证。在状态空间规模可能远大于轨迹长度 $n$ 的一般设定下,我们开发了一种称为窗口化Good–Turing(WingIt)的线性时间估计器,并证明其风险以 $\widetilde{O}(\mathsf{T_{mix}}/n)$ 的速率衰减,其中 $\mathsf{T_{mix}}$ 表示链在总变差距离下的混合时间。值得注意的是,该速率与状态空间规模无关,并且在 $n / \mathsf{T_{mix}}$ 的对数因子范围内是极小极大最优的。我们还给出了缺失质量随机变量方差的一个上界,这可能具有独立的研究价值。我们将估计器推广到近似估计轨迹中出现频率较低的元素的平稳质量。最后,我们在典型链的模拟实验以及基于自然语言文本构建的序列上验证了所提估计器的有效性。