Recent advances in text-guided image compression have shown great potential to enhance the perceptual quality of reconstructed images. These methods, however, tend to have significantly degraded pixel-wise fidelity, limiting their practicality. To fill this gap, we develop a new text-guided image compression algorithm that achieves both high perceptual and pixel-wise fidelity. In particular, we propose a compression framework that leverages text information mainly by text-adaptive encoding and training with joint image-text loss. By doing so, we avoid decoding based on text-guided generative models -- known for high generative diversity -- and effectively utilize the semantic information of text at a global level. Experimental results on various datasets show that our method can achieve high pixel-level and perceptual quality, with either human- or machine-generated captions. In particular, our method outperforms all baselines in terms of LPIPS, with some room for even more improvements when we use more carefully generated captions.
翻译:近期文本引导图像压缩的进展在提升重建图像的感知质量方面展现出巨大潜力。然而,现有方法往往显著降低像素级保真度,限制了其实用性。为弥补这一不足,我们提出了一种新型文本引导图像压缩算法,可同时实现高感知保真度与像素级保真度。具体而言,我们设计了一种压缩框架,通过文本自适应编码与联合图文损失训练,主要利用文本信息进行编码。通过这种方式,我们避免了基于文本引导生成模型(以生成多样性高著称)的解码过程,并在全局层面有效利用文本语义信息。在多个数据集上的实验结果表明,无论使用人工标注还是机器生成的描述文本,我们的方法均能实现高像素级质量与感知质量。尤其值得注意的是,本方法在LPIPS指标上全面优于所有基线模型,且通过采用更精细生成的文本描述,性能仍有进一步提升空间。