This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. The approach relies on the transform coding paradigm, where an image is mapped into a latent space for entropy coding and, from there, mapped back to the data space for reconstruction. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach thus introduces an additional ``content'' latent variable on which the reverse diffusion process is conditioned and uses this variable to store information about the image. The remaining ``texture'' latent variables characterizing the diffusion process are synthesized (stochastically or deterministically) at decoding time. We show that the model's performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving five datasets and sixteen image quality assessment metrics show that our approach yields the strongest reported FID scores while also yielding competitive performance with state-of-the-art models in several SIM-based reference metrics.
翻译:本文提出了一种基于扩散生成模型的端到端优化有损图像压缩框架。该方法采用变换编码范式,将图像映射到潜在空间进行熵编码,再由此映射回数据空间进行重建。与基于VAE的神经压缩方法(其均值解码器为确定性神经网络)不同,本文的解码器采用条件扩散模型。该方法引入了额外的"内容"潜变量,逆向扩散过程以此变量为条件,并利用该变量存储图像信息。表征扩散过程的剩余"纹理"潜变量在解码阶段(随机或确定性)合成。研究表明,该模型的性能可针对感兴趣的感知指标进行调整。我们在五个数据集和十六种图像质量评估指标上进行了广泛实验,结果表明,该方法在FID评分上取得了当前最优结果,同时在多项基于SIM的参考指标上与最新模型相比表现出具有竞争力的性能。