Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). Although quantization's effects on various LLM capabilities have been extensively studied, one critical area remains underexplored: factual knowledge recall (FKR), the process by which LLMs access stored knowledge. To this end, we conduct comprehensive experiments using three common quantization techniques at distinct bit widths, in conjunction with interpretability-driven analyses on two tasks, knowledge memorization and latent multi-hop reasoning. We show that quantization typically results in information loss within LLMs, consequently diminishing their capacity for FKR. This effect is particularly amplified in smaller models within the same architectural families. However, models quantized at reduced bit precision do not consistently exhibit inferior performance and occasionally quantization may even enhance model FKR. We find that BitSandBytes demonstrates highest preservation of the original full-precision model's FKR. Despite variability across models and methods, quantization causes modest performance degradation and remains an effective compression strategy.
翻译:量化方法被广泛用于加速推理和简化大型语言模型(LLMs)的部署。尽管量化对LLMs各项能力的影响已得到广泛研究,但一个关键领域仍未被充分探索:事实知识召回(FKR),即LLMs访问存储知识的过程。为此,我们使用三种常见量化技术在不同比特宽度下进行了全面的实验,并结合可解释性驱动的分析,对知识记忆和潜在多跳推理两项任务进行了研究。我们发现,量化通常会导致LLMs内部的信息损失,从而削弱其进行FKR的能力。这种效应在相同架构家族中的较小模型中尤为显著。然而,以较低比特精度量化的模型并不总是表现出较差的性能,量化有时甚至可能增强模型的FKR。我们发现,BitSandBytes在保持原始全精度模型的FKR方面表现最佳。尽管不同模型和方法之间存在差异,量化仅导致适度的性能下降,并且仍然是一种有效的压缩策略。