On nonlinear compression costs: when Shannon meets Rényi

Shannon entropy is the shortest average codeword length a lossless compressor can achieve by encoding i.i.d. symbols. However, there are cases in which the objective is to minimize the \textit{exponential} average codeword length, i.e. when the cost of encoding/decoding scales exponentially with the length of codewords. The optimum is reached by all strategies that map each symbol $x_i$ generated with probability $p_i$ into a codeword of length $\ell^{(q)}_D(i)=-\log_D\frac{p_i^q}{\sum_{j=1}^Np_j^q}$. This leads to the minimum exponential average codeword length, which equals the R\'enyi, rather than Shannon, entropy of the source distribution. We generalize the established Arithmetic Coding (AC) compressor to this framework. We analytically show that our generalized algorithm provides an exponential average length which is arbitrarily close to the R\'enyi entropy, if the symbols to encode are i.i.d.. We then apply our algorithm to both simulated (i.i.d. generated) and real (a piece of Wikipedia text) datasets. While, as expected, we find that the application to i.i.d. data confirms our analytical results, we also find that, when applied to the real dataset (composed by highly correlated symbols), our algorithm is still able to significantly reduce the exponential average codeword length with respect to the classical `Shannonian' one. Moreover, we provide another justification of the use of the exponential average: namely, we show that by minimizing the exponential average length it is possible to minimize the probability that codewords exceed a certain threshold length. This relation relies on the connection between the exponential average and the cumulant generating function of the source distribution, which is in turn related to the probability of large deviations. We test and confirm our results again on both simulated and real datasets.

翻译：香农熵是无损压缩器通过对独立同分布符号进行编码所能达到的最短平均码字长度。然而，在某些情形下，目标是极小化码字长度的指数平均值，即当编码/解码成本随码字长度呈指数增长时。最优策略是将每个以概率$p_i$生成的符号$x_i$映射为长度为$\ell^{(q)}_D(i)=-\log_D\frac{p_i^q}{\sum_{j=1}^Np_j^q}$的码字。此时，极小指数平均码字长度等于源分布的雷尼熵而非香农熵。我们将经典的算术编码（AC）压缩器推广至该框架。本文分析证明，若待编码符号独立同分布，我们的广义算法能达到任意接近雷尼熵的指数平均长度。随后，我们将算法应用于模拟（独立同分布生成）数据集和真实数据集（一段维基百科文本）。与预期相符，对独立同分布数据的应用验证了我们的分析结果；同时发现，当应用于由高度相关符号组成的真实数据集时，我们的算法仍能显著降低指数平均码字长度，优于经典的“香农式”算法。此外，我们提供了采用指数平均的另一个理由：通过极小化指数平均长度，可以最小化码字长度超过特定阈值的概率。该关系依赖于指数平均与源分布累积量生成函数之间的关联，而后者又与大幅偏差概率相关。我们通过模拟和真实数据集再次验证了上述结论。