Mixture-of-Experts Multimodal Large Language Models (MoE-MLLMs) offer remarkable performance but incur prohibitive GPU memory costs, making compression essential. Among PTQ methods, expert-level mixed-precision quantization has proven effective for MoE-LLMs, yet suffers notable degradation on MoE-MLLMs due to two overlooked biases in expert importance estimation. (1) At the cross-modal level, the numerical dominance of vision tokens causes expert selection frequency to be dominated by vision tokens, masking experts that are critical to the text modality; (2) at the intra-vision level, the large proportion of redundant vision tokens further skew frequency statistics, obscuring experts critical for informative visual content. To bridge gaps, we propose MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE-MLLMs that decomposes expert selection frequency by modality, filters redundant vision tokens to obtain denoised visual frequency, and further evaluates quantization sensitivity per modality as a complementary signal to frequency-based estimation. These signals are integrated into an Integer Linear Programming formulation to assign per-expert bit-widths under a given budget. Extensive experiments show that MODE is particularly well-suited for MoE-MLLMs, limiting average performance loss to within 2.9% at W3A16, with larger gains at the extreme 2-bit setting.
翻译:摘要:混合专家多模态大语言模型(MoE-MLLMs)虽性能卓越,但GPU显存消耗极高,因此模型压缩至关重要。在后训练量化方法中,专家级混合精度量化对MoE-LLMs效果显著,但在MoE-MLLMs上却因专家重要性评估中两种被忽视的偏差而性能显著下降:(1) 跨模态层面,视觉标记的数量优势导致专家选择频率被视觉标记主导,掩盖了对文本模态至关重要的专家;(2) 视觉内部层面,大量冗余视觉标记进一步扭曲频率统计,使对信息性视觉内容关键的专家难以识别。为弥合这些差距,我们提出MODE——一种面向MoE-MLLMs的模态分解型专家级混合精度量化框架。该框架按模态分解专家选择频率,过滤冗余视觉标记以获取去噪后的视觉频率,并进一步评估每个模态的量化敏感度作为频率估计的补充信号。将这些信号整合到整数线性规划公式中,在给定预算下为每位专家分配位宽。大量实验表明,MODE特别适用于MoE-MLLMs,在W3A16设置下将平均性能损失限制在2.9%以内,在极端2位设置下提升更显著。