In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression framework consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. {Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding.} Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.
翻译:近年来,大型视觉语言模型在多模态任务中展现出卓越的性能和泛化能力,从而在各种应用场景中替代人类成为视觉信息的接收者。本文首次提出一种由预编辑模块和端到端编解码器组成的可变码率图像压缩框架,旨在为不同的大型视觉语言模型实现优异的码率-精度性能。具体而言,不同于针对特定任务或若干代表性任务优化自适应预编辑网络,我们提出一种专为大型视觉语言模型设计的新型优化策略,该策略基于表征与判别能力,并结合了令牌级失真与排序度量进行构建。预编辑模块与可变码率端到端图像编解码器通过基于大模型语义令牌的损失函数进行联合训练,从而为多样化的数据与任务引入了增强的泛化能力。实验结果表明,相较于当前最先进的编码标准——通用视频编码,所提框架能够高效地实现更优的码率-精度性能。同时,在多模态任务上的实验验证了所提框架的鲁棒性与泛化能力。