We present a new image compression paradigm to achieve ``intelligently coding for machine'' by cleverly leveraging the common sense of Large Multimodal Models (LMMs). We are motivated by the evidence that large language/multimodal models are powerful general-purpose semantics predictors for understanding the real world. Different from traditional image compression typically optimized for human eyes, the image coding for machines (ICM) framework we focus on requires the compressed bitstream to more comply with different downstream intelligent analysis tasks. To this end, we employ LMM to \textcolor{red}{tell codec what to compress}: 1) first utilize the powerful semantic understanding capability of LMMs w.r.t object grounding, identification, and importance ranking via prompts, to disentangle image content before compression, 2) and then based on these semantic priors we accordingly encode and transmit objects of the image in order with a structured bitstream. In this way, diverse vision benchmarks including image classification, object detection, instance segmentation, etc., can be well supported with such a semantically structured bitstream. We dub our method ``\textit{SDComp}'' for ``\textit{S}emantically \textit{D}isentangled \textit{Comp}ression'', and compare it with state-of-the-art codecs on a wide variety of different vision tasks. SDComp codec leads to more flexible reconstruction results, promised decoded visual quality, and a more generic/satisfactory intelligent task-supporting ability.
翻译:本文提出一种新的图像压缩范式,通过巧妙利用大型多模态模型的常识性知识,实现“面向机器的智能编码”。我们的研究动机源于大型语言/多模态模型作为理解现实世界的通用语义预测器所展现的强大能力。与传统通常针对人眼优化的图像压缩不同,我们关注的面向机器的图像编码框架要求压缩后的比特流能更好地适配各种下游智能分析任务。为此,我们采用大型多模态模型来\textcolor{red}{告知编解码器应压缩的内容}:1)首先利用大型多模态模型通过提示词在目标定位、识别与重要性排序方面强大的语义理解能力,在压缩前对图像内容进行解耦;2)随后基于这些语义先验,我们以结构化比特流的形式按序编码并传输图像中的目标对象。通过这种方式,包括图像分类、目标检测、实例分割等在内的多样化视觉基准任务均可通过这种语义结构化的比特流获得良好支持。我们将该方法命名为“\textit{SDComp}”(语义解耦压缩),并在多种视觉任务上与最先进的编解码器进行比较。SDComp编解码器能够实现更灵活的重建结果、保证的解码视觉质量,以及更通用/令人满意的智能任务支持能力。