The explosion of data has resulted in more and more associated text being transmitted along with images. Inspired by from distributed source coding, many works utilize image side information to enhance image compression. However, existing methods generally do not consider using text as side information to enhance perceptual compression of images, even though the benefits of multimodal synergy have been widely demonstrated in research. This begs the following question: How can we effectively transfer text-level semantic dependencies to help image compression, which is only available to the decoder? In this work, we propose a novel deep image compression method with text-guided side information to achieve a better rate-perception-distortion tradeoff. Specifically, we employ the CLIP text encoder and an effective Semantic-Spatial Aware block to fuse the text and image features. This is done by predicting a semantic mask to guide the learned text-adaptive affine transformation at the pixel level. Furthermore, we design a text-conditional generative adversarial networks to improve the perceptual quality of reconstructed images. Extensive experiments involving four datasets and ten image quality assessment metrics demonstrate that the proposed approach achieves superior results in terms of rate-perception trade-off and semantic distortion.
翻译:数据爆炸导致越来越多与图像相关的文本随之传输。受分布式信源编码启发,许多工作利用图像侧信息增强图像压缩。然而现有方法通常不考虑利用文本作为侧信息来增强图像的感知压缩,尽管多模态协同的优势已在研究中广泛证实。这引出了以下问题:如何有效传递文本级语义依赖关系以辅助图像压缩(该信息仅解码端可用)?本文提出一种新颖的文本引导侧信息深度图像压缩方法,以实现更好的率-感知-失真权衡。具体而言,我们采用CLIP文本编码器和高效的语义-空间感知模块,通过预测语义掩码在像素级引导学习到的文本自适应仿射变换,从而融合文本与图像特征。此外,我们设计了文本条件生成对抗网络来提升重建图像的感知质量。在四个数据集和十项图像质量评估指标上的大量实验表明,所提方法在率-感知权衡和语义失真方面均取得了优越结果。