无损数据压缩的样本复杂度 (The Sample Complexity of Lossless Data Compression)

A new framework is introduced for examining and evaluating the fundamental limits of lossless data compression, that emphasizes genuinely non-asymptotic results. The {\em sample complexity} of compressing a given source is defined as the smallest blocklength at which it is possible to compress that source at a specified rate and to within a specified excess-rate probability. This formulation parallels corresponding developments in statistics and computer science, and it facilitates the use of existing results on the sample complexity of various hypothesis testing problems. For arbitrary sources, the sample complexity of general variable-length compressors is shown to be tightly coupled with the sample complexity of prefix-free codes and fixed-length codes. For memoryless sources, it is shown that the sample complexity is characterized not by the source entropy, but by its Rényi entropy of order~$1/2$. Nonasymptotic bounds on the sample complexity are obtained, with explicit constants. Generalizations to Markov sources are established, showing that the sample complexity is determined by the source's Rényi entropy rate of order~$1/2$. Finally, bounds on the sample complexity of universal data compression are developed for arbitrary families of memoryless sources. There, the sample complexity is characterized by the minimum Rényi divergence of order~$1/2$ between elements of the family and the uniform distribution. The connection of this problem with identity testing and with the associated separation rates is explored and discussed.

翻译：本文引入了一个新框架，用于检验和评估无损数据压缩的基本极限，该框架强调真正非渐近的结果。压缩给定信源的**样本复杂度**被定义为：在指定压缩率和指定超出率概率下，能够压缩该信源所需的最小分组长度。这一表述方式与统计学和计算机科学中的相应发展相平行，并便于利用关于各类假设检验问题样本复杂度的现有结果。对于任意信源，证明了通用变长压缩器的样本复杂度与无前缀码和定长码的样本复杂度紧密耦合。对于无记忆信源，证明了样本复杂度并非由信源熵决定，而是由其阶数为~$1/2$的Rényi熵所刻画。文中获得了样本复杂度的非渐近界，并给出了显式常数。进一步推广到马尔可夫信源，表明样本复杂度由信源的阶数为~$1/2$的Rényi熵率决定。最后，针对任意无记忆信源族，推导了通用数据压缩的样本复杂度界。其中，样本复杂度由族中元素与均匀分布之间阶数为~$1/2$的最小Rényi散度所刻画。本文探讨并讨论了该问题与身份检验以及相关分离速率的联系。