Image codecs are typically optimized to trade-off bitrate vs, distortion metrics. At low bitrates, this leads to compression artefacts which are easily perceptible, even when training with perceptual or adversarial losses. To improve image quality, and to make it less dependent on the bitrate, we propose to decode with iterative diffusion models, instead of feed-forward decoders trained using MSE or LPIPS distortions used in most neural codecs. In addition to conditioning the model on a vector-quantized image representation, we also condition on a global textual image description to provide additional context. We dub our model PerCo for 'perceptual compression', and compare it to state-of-the-art codecs at rates from 0.1 down to 0.003 bits per pixel. The latter rate is an order of magnitude smaller than those considered in most prior work. At this bitrate a 512x768 Kodak image is encoded in less than 153 bytes. Despite this ultra-low bitrate, our approach maintains the ability to reconstruct realistic images. We find that our model leads to reconstructions with state-of-the-art visual quality as measured by FID and KID, and that the visual quality is less dependent on the bitrate than previous methods.
翻译:图像编解码器通常针对比特率与失真度量之间的权衡进行优化。在低比特率条件下,即便采用感知损失或对抗性损失进行训练,仍会产生易于察觉的压缩伪影。为提升图像质量并降低其对比特率的依赖性,我们提出采用迭代扩散模型进行解码,替代大多数神经编解码器中基于MSE或LPIPS失真训练的前馈解码器。除了以矢量量化图像表征为条件外,我们还引入全局文本图像描述作为附加上下文信息。我们将该模型命名为PerCo(感知压缩),并与先进编解码器在0.1至0.003比特/像素的码率范围内进行比较——后者比多数现有工作所研究的码率低一个数量级。在该比特率下,一张512×768像素的Kodak图像编码后不足153字节。尽管码率极低,我们的方法仍能重建具有真实感的图像。实验表明,该模型在FID和KID指标上达到了最先进的视觉质量,且其视觉质量对码率的依赖性弱于先前方法。