Despite the tremendous success of diffusion generative models in text-to-image generation, replicating this success in the domain of image compression has proven difficult. In this paper, we demonstrate that diffusion can significantly improve perceptual quality at a given bit-rate, outperforming state-of-the-art approaches PO-ELIC and HiFiC as measured by FID score. This is achieved using a simple but theoretically motivated two-stage approach combining an autoencoder targeting MSE followed by a further score-based decoder. However, as we will show, implementation details matter and the optimal design decisions can differ greatly from typical text-to-image models.
翻译:尽管扩散生成模型在文本到图像生成领域取得了巨大成功,但在图像压缩领域复制这一成功却颇具挑战。本文证明,在给定码率下,扩散能显著提升感知质量,在FID评分上优于当前最先进的PO-ELIC和HiFiC方法。这一成果通过一个简单但具有理论依据的两阶段方法实现:首先使用以MSE为目标的自动编码器,随后采用基于评分的解码器。然而,我们将表明,实现细节至关重要,最优设计决策可能与典型的文本到图像模型存在显著差异。