In this paper, we propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference. Conventional quantization methods sequentially search the layer-wise rounding functions by minimizing activation discretization errors, which fails to acquire optimal quantization strategy without considering cross-layer dependency. On the contrary, we mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy searching with low search cost. Specifically, we observe the strong correlation between the activation entropy and the cross-layer dependency concerning output discretization errors. Therefore, we employ the entropy as the proxy to partition blocks optimally, which aims to achieve satisfying trade-offs between discretization errors and the search cost. Moreover, we optimize the visual encoder to disentangle the cross-layer dependency for fine-grained decomposition of search space, so that the search cost is further reduced without harming the quantization accuracy. Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation on diverse multi-modal reasoning tasks. Code is available at https://github.com/ChangyuanWang17/QVLM.
翻译:本文提出了一种用于大规模视觉语言模型的高效多模态推理后训练量化框架。传统量化方法通过最小化激活离散化误差来顺序搜索逐层舍入函数,由于未考虑跨层依赖性而无法获得最优量化策略。与此相反,我们挖掘了对整个视觉语言模型离散化误差具有显著影响的跨层依赖性,并将该依赖性嵌入到低搜索成本的最优量化策略搜索中。具体而言,我们观察到激活熵与输出离散化误差相关的跨层依赖性之间存在强相关性。因此,我们采用熵作为代理来最优划分模块,旨在实现离散化误差与搜索成本之间的理想权衡。此外,我们优化视觉编码器以解耦跨层依赖性,实现对搜索空间的细粒度分解,从而在不损害量化精度的前提下进一步降低搜索成本。实验结果表明,我们的方法在多种多模态推理任务上,将13B参数的LLaVA模型内存压缩了2.78倍,生成速度提升了1.44倍,且性能无损失。代码发布于https://github.com/ChangyuanWang17/QVLM。