It is well-known in the field of lossless data compression that probabilistic next-symbol prediction can be used to compress sequences of symbols. Deep neural networks are able to capture rich dependencies in data, offering a powerful means of estimating these probabilities and hence an avenue towards more effective compression algorithms. However, both compressor and decompressor must have exactly matching predictions; even small differences from non-determinism (which often happen with learned models due to hardware, software, or computation order) can lead to cascading decoding failures. In this paper, we formalize the problem of prediction mismatch in model-driven compression, and introduce Probability Matching Interval Coding (PMATIC), a model-agnostic algorithm that tolerates bounded prediction mismatch with low overhead. PMATIC works with the predicted probabilities, making it compatible as a drop-in replacement for the arithmetic encoder in model-driven compression tools. We show theoretical correctness and performance bounds for PMATIC, and validate these results on text data. These results confirm that, when paired an advanced prediction model, PMATIC is robust to prediction mismatch while achieving compression rates that out-perform standard modern compression tools.
翻译:在无损数据压缩领域,众所周知,概率性下一符号预测可用于压缩符号序列。深度神经网络能够捕捉数据中丰富的依赖关系,为估计这些概率提供了强大手段,从而为实现更有效的压缩算法开辟了途径。然而,压缩器和解压器必须具有完全匹配的预测;即使由非确定性(在基于学习的模型中,由于硬件、软件或计算顺序的差异,这种情况经常发生)导致的微小差异,也可能引发级联的解码失败。在本文中,我们形式化了模型驱动压缩中的预测失配问题,并引入了概率匹配区间编码(PMATIC),这是一种模型无关的算法,能够以较低的开销容忍有界的预测失配。PMATIC基于预测的概率工作,使其可作为模型驱动压缩工具中算术编码器的即插即用替代方案。我们展示了PMATIC的理论正确性和性能界限,并在文本数据上验证了这些结果。这些结果证实,当与先进的预测模型配对时,PMATIC对预测失配具有鲁棒性,同时实现了优于标准现代压缩工具的压缩率。