For consumer usage of locally deployed LLMs, the GGUF format and k\_quantization are invaluable tools for maintaining the performance of the original model while reducing it to sizes deployable with consumer-grade hardware. The number of bits dedicated to each weight from the original model is reduced based on how important they are thought to be during model inference. This importance is arrived at through the application of an 'importance matrix'-a relatively small text document meant to be representative of the LLM's standard use-cases. In the vast majority of quants available online, this document is primarily written in English. It was therefore an open question whether performance on English language tasks was preserved through the sacrifice of multilingual performance and whether it can be preserved with alternate importance matrices. This article investigates these hypotheses by quantizing Llama3.3 70B on importance matrices written in three languages (English, Norwegian, and Malayalam) and evaluating them on the MixEval dataset in both English and Norwegian. All experiments related to yielded non-significant results indicating that current quantization practices do not disproportionately harm multilingual performance.
翻译:对于本地部署大型语言模型(LLM)的消费级应用而言,GGUF格式与k_量化技术是极具价值的工具,它们能在保持原始模型性能的同时,将模型压缩至可在消费级硬件上部署的规模。该技术根据权重在模型推理过程中的重要性,减少原始模型中每个权重所占用的比特数。这种重要性通过"重要性矩阵"来确定——这是一个相对较小的文本文件,旨在代表LLM的标准使用场景。当前网络上绝大多数量化模型所使用的重要性矩阵主要采用英文文本编写。因此存在一个开放性问题:量化过程是否以牺牲多语言性能为代价来保持英语任务性能?以及是否可以通过采用不同语言的重要性矩阵来保持多语言性能?本文通过使用三种语言(英语、挪威语和马拉雅拉姆语)编写的重要性矩阵对Llama3.3 70B模型进行量化,并在MixEval数据集上对英语和挪威语任务进行评估,从而验证这些假设。所有实验均产生非显著性结果,表明当前的量化实践不会对多语言性能造成不成比例的损害。