This paper presents the first study to explore the potential of parameter quantization for multimodal large language models to alleviate the significant resource constraint encountered during vision-language instruction tuning. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. This method is grounded in two key innovations: (1) The learning of group-wise scale factors for quantized LLM weights to mitigate the quantization error arising from activation outliers and achieve more effective vision-language instruction tuning; (2) The implementation of a multimodal warmup that progressively integrates linguistic and multimodal training samples, thereby preventing overfitting of the quantized model to multimodal data while ensuring stable adaptation of multimodal large language models to downstream vision-language tasks. Extensive experiments demonstrate that models quantized by QSLAW perform on par with, or even surpass, their full-precision counterparts, while facilitating up to 1.4 times reduction in VL tuning time and GPU consumption. Our code is released at https://github.com/xjjxmu/QSLAW.
翻译:本文首次探索了参数量化在多模态大语言模型中的应用潜力,以缓解视觉-语言指令微调过程中面临的显著资源限制。我们提出了一种基于多模态预热训练的量化感知尺度学习方法,命名为QSLAW。该方法基于两项关键创新:(1) 为量化后的LLM权重学习分组尺度因子,以缓解由激活异常值引起的量化误差,实现更有效的视觉-语言指令微调;(2) 实施多模态预热训练,逐步整合语言和多模态训练样本,从而防止量化模型对多模态数据的过拟合,同时确保多模态大语言模型对下游视觉-语言任务的稳定适配。大量实验表明,经QSLAW量化的模型性能与全精度模型相当甚至更优,同时可将VL微调时间和GPU消耗降低高达1.4倍。我们的代码已发布于https://github.com/xjjxmu/QSLAW。