Quantization techniques are widely used to improve inference speed and deployment of large language models. While a wide body of work examines the impact of quantized LLMs on English tasks, none have examined the effect of quantization across languages. We conduct a thorough analysis of quantized multilingual LLMs, focusing on their performance across languages and at varying scales. We use automatic benchmarks, LLM-as-a-Judge methods, and human evaluation, finding that (1) harmful effects of quantization are apparent in human evaluation, and automatic metrics severely underestimate the detriment: a 1.7% average drop in Japanese across automatic tasks corresponds to a 16.0% drop reported by human evaluators on realistic prompts; (2) languages are disparately affected by quantization, with non-Latin script languages impacted worst; and (3) challenging tasks such as mathematical reasoning degrade fastest. As the ability to serve low-compute models is critical for wide global adoption of NLP technologies, our results urge consideration of multilingual performance as a key evaluation criterion for efficient models.
翻译:量化技术被广泛应用于提升大语言模型的推理速度与部署效率。尽管已有大量工作研究了量化大语言模型在英语任务上的影响,但尚未有研究探讨量化对不同语言的影响。我们对量化后的多语言大语言模型进行了全面分析,重点关注其在不同语言和不同规模下的性能表现。我们使用了自动基准测试、LLM-as-a-Judge方法以及人工评估,发现:(1)量化的负面影响在人工评估中表现明显,而自动指标严重低估了其损害:在自动任务中,日语平均下降1.7%对应的是人工评估者在真实提示上报告的16.0%下降;(2)不同语言受量化的影响差异显著,非拉丁文字语言受到的影响最严重;(3)具有挑战性的任务(如数学推理)性能下降最快。由于提供低计算量模型的能力对于自然语言处理技术的全球广泛采用至关重要,我们的研究结果呼吁将多语言性能作为高效模型的关键评估标准加以考量。