Quantization is an effective technique for reducing the storage footprint and computational costs of Large Language Models (LLMs), but it often results in performance degradation. Existing post-training quantization methods typically use small, English-only calibration sets; however, their impact on multilingual models remains underexplored. We systematically evaluate eight calibration settings (five single-language and three multilingual mixes) on two quantizers (GPTQ, AWQ) on data from 10 languages. Our findings reveal a consistent trend: non-English and multilingual calibration sets significantly improve perplexity compared to English-only baselines. Specifically, we observe notable average perplexity gains across both quantizers on Llama3.1 8B and Qwen2.5 7B, with multilingual mixes achieving the largest overall reductions of up to 3.52 points in perplexity. Furthermore, our analysis indicates that tailoring calibration sets to the evaluation language yields the largest improvements for individual languages, underscoring the importance of linguistic alignment. We also identify specific failure cases where certain language-quantizer combinations degrade performance, which we trace to differences in activation range distributions across languages. These results highlight that static one-size-fits-all calibration is suboptimal and that tailoring calibration data, both in language and diversity, plays a crucial role in robustly quantizing multilingual LLMs.
翻译:量化是减少大型语言模型存储占用和计算成本的有效技术,但通常会导致性能下降。现有的训练后量化方法通常使用小型、仅英语的校准集;然而,其对多语言模型的影响仍未得到充分探索。我们基于10种语言的数据,系统评估了两种量化器(GPTQ、AWQ)在八种校准设置(五种单语言和三种多语言混合)下的表现。我们的研究结果揭示了一致趋势:与仅英语基线相比,非英语和多语言校准集能显著改善困惑度。具体而言,我们在Llama3.1 8B和Qwen2.5 7B模型上观察到两种量化器均取得显著的平均困惑度提升,其中多语言混合校准实现了最大的整体困惑度降低,降幅高达3.52点。进一步分析表明,针对评估语言定制校准集能为单一语言带来最大改进,这凸显了语言对齐的重要性。我们还识别出特定故障案例,其中某些语言与量化器的组合会导致性能退化,我们将其归因于不同语言间激活值分布范围的差异。这些结果强调,静态的“一刀切”校准方案并非最优选择,根据语言特性和多样性定制校准数据,对于实现稳健的多语言大语言模型量化具有至关重要的作用。