Image-based single-modality compression learning approaches have demonstrated exceptionally powerful encoding and decoding capabilities in the past few years , but suffer from blur and severe semantics loss at extremely low bitrates. To address this issue, we propose a multimodal machine learning method for text-guided image compression, in which the semantic information of text is used as prior information to guide image compression for better compression performance. We fully study the role of text description in different components of the codec, and demonstrate its effectiveness. In addition, we adopt the image-text attention module and image-request complement module to better fuse image and text features, and propose an improved multimodal semantic-consistent loss to produce semantically complete reconstructions. Extensive experiments, including a user study, prove that our method can obtain visually pleasing results at extremely low bitrates, and achieves a comparable or even better performance than state-of-the-art methods, even though these methods are at 2x to 4x bitrates of ours.
翻译:基于图像的单一模态压缩学习方法在过去几年中展现了极其强大的编码与解码能力,但在极低比特率下会出现模糊和严重的语义损失。为解决该问题,我们提出了一种文本引导图像压缩的多模态机器学习方法,其中文本的语义信息作为先验信息指导图像压缩,以获得更好的压缩性能。我们全面研究了文本描述在编解码器不同组件中的作用,并证明了其有效性。此外,我们采用图像-文本注意力模块和图像-请求互补模块来更好地融合图像与文本特征,并提出了改进的多模态语义一致损失以生成语义完整的重建结果。大量实验(包括用户研究)证明,我们的方法在极低比特率下可获得视觉上令人满意的结果,并且与当前最优方法相比,即便后者的比特率是我们的2至4倍,我们的方法仍能达到与之相当甚至更好的性能。