Pruning and quantization form the foundation of model compression for neural networks, enabling efficient inference for large language models (LLMs). Recently, various quantization and pruning techniques have demonstrated state-of-the-art performance in a post-training setting. They rely upon calibration data, a small set of unlabeled examples, to generate layer activations. However, no prior work has systematically investigated how the calibration data impacts the effectiveness of model compression methods. In this paper, we present the first extensive empirical study on the effect of calibration data upon LLM performance. We trial a variety of pruning and quantization methods, tasks, models, and datasets. Surprisingly, we find substantial variations in downstream task performance, contrasting existing work that suggests a greater level of robustness to the calibration data. Finally, we make a series of recommendations for the effective use of calibration data in LLM quantization and pruning.
翻译:剪枝与量化构成了神经网络模型压缩的基础,使得大型语言模型(LLM)能够实现高效推理。近年来,多种量化与剪枝技术在训练后场景中展现出最先进的性能。这些技术依赖校准数据(即少量无标签样本)来生成层激活值。然而,此前尚无研究系统性地探究校准数据如何影响模型压缩方法的有效性。本文首次针对校准数据对LLM性能的影响开展了大规模实证研究。我们测试了多种剪枝与量化方法、任务、模型及数据集。令人惊讶的是,我们发现下游任务性能存在显著差异,这与先前认为校准数据具有较高鲁棒性的研究结论形成对比。最后,我们就如何在LLM量化与剪枝中有效使用校准数据提出了一系列建议。