Lossy data compression lies at the heart of modern communication and storage systems. Shannon's rate-distortion theory provides the fundamental limit on how much a source can be compressed at a given fidelity, but it assumes infinitely long block lengths that are never realized in practice. We present a self-contained tutorial on rate-distortion theory for the simplest non-trivial source: a Bernoulli$(p)$ sequence with Hamming distortion. We derive the classical rate-distortion function $RD = Hp - HD$ from first principles, illustrate its computation via the Blahut-Arimoto algorithm, and then develop the finite block length refinements that characterize how the minimum achievable rate approaches the Shannon limit as the block length $n$ grows. The central quantity in this refinement is the \emph{rate-distortion dispersion} $V(D)$, which governs the $O(1/\sqrt{n})$ penalty for operating at finite block lengths. We accompany all theoretical developments with numerical examples and figures generated by accompanying Python scripts.
翻译:有损数据压缩是现代通信与存储系统的核心。香农的率失真理论给出了在给定保真度下信源可压缩程度的基本极限,但该理论假设了无限长的块长度,而这在实践中永远无法实现。本文针对最简单的非平凡信源——具有汉明失真的伯努利$(p)$序列——提供了一个自包含的率失真理论教程。我们从基本原理推导出经典的率失真函数$RD = Hp - HD$,通过Blahut-Arimoto算法说明其计算过程,进而发展了有限块长细化理论,以刻画当块长度$n$增长时,最小可达速率如何逼近香农极限。这一细化理论的核心量是\emph{率失真弥散}$V(D)$,它主导了在有限块长下运行所带来的$O(1/\sqrt{n})$性能损失。所有的理论推导都辅以数值算例和由配套Python脚本生成的图表。