Empirical Lossless Compression Bound of a Data Sequence

We consider the lossless compression bound of any individual data sequence. If we fit the data by a parametric model, the entropy quantity $nH({\hat \theta}_n)$ obtained by plugging in the maximum likelihood estimate is an underestimate of the bound, where $n$ is the number of words. Shtarkov showed that the normalized maximum likelihood (NML) distribution or code length is optimal in a minimax sense for any parametric family. We show by the local asymptotic normality that the NML code length for the exponential families is $nH(\hat \theta_n) +\frac{d}{2}\log \, \frac{n}{2\pi} +\log \int_{\Theta} |I(\theta)|^{1/2}\, d\theta+o(1)$, where $d$ is the model dimension or dictionary size, and $|I(\theta)|$ is the determinant of the Fisher information matrix. We also demonstrate that sequentially predicting the optimal code length for the next word via a Bayesian mechanism leads to the mixture code, whose pathwise length is given by $nH({\hat \theta}_n) +\frac{d}{2}\log \, \frac{n}{2\pi} +\log \frac{|\, I({\hat \theta}_n)|^{1/2}}{w({\hat \theta}_n)}+o(1) $, where $w(\theta)$ is a prior. The asymptotics apply to not only discrete symbols but also continuous data if the code length for the former is replaced by the description length for the latter. The analytical result is exemplified by calculating compression bounds of protein-encoding DNA sequences under different parsing models. Typically, the highest compression is achieved when the parsing is in phase of the amino acid codons. On the other hand, the compression rates of pseudo-random sequences are larger than 1 regardless parsing models. These model-based results are in consistency with that random sequences are incompressible as asserted by the Kolmogorov complexity theory. The empirical lossless compression bound is particularly more accurate when dictionary size is relatively large.

翻译：我们考虑任意单个数据序列的无损压缩界。若通过参数模型拟合数据，则代入最大似然估计得到的熵量 $nH({\hat \theta}_n)$ 是对该界限的低估，其中 $n$ 为词数。Shtarkov 指出，归一化最大似然（NML）分布或码长在极小极大意义下对任意参数族是最优的。我们通过局部渐近正态性证明，指数族的 NML 码长为 $nH(\hat \theta_n) +\frac{d}{2}\log \, \frac{n}{2\pi} +\log \int_{\Theta} |I(\theta)|^{1/2}\, d\theta+o(1)$，其中 $d$ 为模型维度或词典大小，$|I(\theta)|$ 为 Fisher 信息矩阵的行列式。我们还证明，通过贝叶斯机制顺序预测下一词的最优码长将得到混合码，其路径码长为 $nH({\hat \theta}_n) +\frac{d}{2}\log \, \frac{n}{2\pi} +\log \frac{|\, I({\hat \theta}_n)|^{1/2}}{w({\hat \theta}_n)}+o(1) $，其中 $w(\theta)$ 为先验。该渐近结果不仅适用于离散符号，也适用于连续数据——只需将前者中的码长替换为后者中的描述长度。我们通过计算不同解析模型下蛋白质编码 DNA 序列的压缩界来例证该分析结果。通常，当解析与氨基酸密码子相位对齐时，压缩率最高。另一方面，伪随机序列的压缩率无论解析模型如何均大于1。这些基于模型的结果与柯尔莫哥洛夫复杂性理论中随机序列不可压缩的断言一致。当词典规模相对较大时，经验无损压缩界尤为精确。