How good is Good-Turing for Markov samples?

The Good-Turing (GT) estimator for the missing mass (i.e., total probability of missing symbols) in $n$ samples is the number of symbols that appeared exactly once divided by $n$. For i.i.d. samples, the bias and squared-error risk of the GT estimator can be shown to fall as $1/n$ by bounding the expected error uniformly over all symbols. In this work, we study convergence of the GT estimator for missing stationary mass (i.e., total stationary probability of missing symbols) of Markov samples on an alphabet $\mathcal{X}$ with stationary distribution $[\pi_x:x \in \mathcal{X}]$ and transition probability matrix (t.p.m.) $P$. This is an important and interesting problem because GT is widely used in applications with temporal dependencies such as language models assigning probabilities to word sequences, which are modelled as Markov. We show that convergence of GT depends on convergence of $(P^{\sim x})^n$, where $P^{\sim x}$ is $P$ with the $x$-th column zeroed out. This, in turn, depends on the Perron eigenvalue $\lambda^{\sim x}$ of $P^{\sim x}$ and its relationship with $\pi_x$ uniformly over $x$. For randomly generated t.p.ms and t.p.ms derived from New York Times and Charles Dickens corpora, we numerically exhibit such uniform-over-$x$ relationships between $\lambda^{\sim x}$ and $\pi_x$. This supports the observed success of GT in language models and practical text data scenarios. For Markov chains with rank-2, diagonalizable t.p.ms having spectral gap $\beta$, we show minimax rate upper and lower bounds of $1/(n\beta^5)$ and $1/(n\beta)$, respectively, for the estimation of stationary missing mass. This theoretical result extends the $1/n$ minimax rate for i.i.d. or rank-1 t.p.ms to rank-2 Markov, and is a first such minimax rate result for missing mass of Markov samples.

翻译：Good-Turing（GT）估计器通过将出现恰好一次的符号数量除以样本量 $n$ 来估计缺失质量（即未观测符号的总概率）。对于独立同分布样本，通过在所有符号上一致地界定期望误差，可证明GT估计器的偏差和均方误差风险以 $1/n$ 速率衰减。本研究聚焦于马尔可夫样本中平稳缺失质量（即未观测符号的总平稳概率）的GT估计收敛性。该问题具有重要意义，因为GT被广泛应用于存在时间依赖性的场景（例如语言模型中为词序列分配概率的马尔可夫建模）。研究表明，GT的收敛性取决于 $(P^{\sim x})^n$ 的收敛性，其中 $P^{\sim x}$ 是将转移概率矩阵 $P$ 的第 $x$ 列置零后得到的矩阵。这进一步依赖于 $P^{\sim x}$ 的Perron特征根 $\lambda^{\sim x}$ 与 $\pi_x$ 在 $x$ 上的一致性关系。通过随机生成转移概率矩阵以及源自《纽约时报》和查尔斯·狄更斯语料库的转移概率矩阵，我们数值验证了 $\lambda^{\sim x}$ 与 $\pi_x$ 间的这种全局一致性关系，这解释了GT在语言模型和实际文本数据场景中的成功应用。对于具有谱间隙 $\beta$ 的秩-2可对角化转移概率矩阵的马尔可夫链，我们证明了平稳缺失质量估计的极小极大风险上界和下界分别为 $1/(n\beta^5)$ 和 $1/(n\beta)$。该理论结果将独立同分布或秩-1转移概率矩阵的 $1/n$ 极小极大速率扩展至秩-2马尔可夫情形，并首次建立了马尔可夫样本缺失质量的极小极大速率结果。