The increasing size of deep neural networks necessitates effective model compression to improve computational efficiency and reduce their memory footprint. Sparsity and quantization are two prominent compression methods that have individually demonstrated significant reduction in computational and memory footprints while preserving model accuracy. While effective, the interplay between these two methods remains an open question. In this paper, we investigate the interaction between these two methods and assess whether their combination impacts final model accuracy. We mathematically prove that applying sparsity before quantization is the optimal sequence for these operations, minimizing error in computation. Our empirical studies across a wide range of models, including OPT and Llama model families (125M-8B) and ViT corroborate these theoretical findings. In addition, through rigorous analysis, we demonstrate that sparsity and quantization are not orthogonal; their interaction can significantly harm model accuracy, with quantization error playing a dominant role in this degradation. Our findings extend to the efficient deployment of large models in resource-limited compute platforms and reduce serving cost, offering insights into best practices for applying these compression methods to maximize efficacy without compromising accuracy.
翻译:深度神经网络规模的日益增长,使得有效的模型压缩成为提升计算效率和减少内存占用的必要手段。稀疏性与量化是两种主流的压缩方法,各自在保持模型精度的同时,已显著降低了计算和内存开销。尽管各自有效,但这两种方法之间的协同作用仍是一个开放性问题。本文研究了这两种方法之间的相互作用,并评估了它们的组合是否会影响最终模型的精度。我们从数学上证明了先应用稀疏性再进行量化是这些操作的最优顺序,能够最小化计算误差。我们在包括OPT和Llama模型家族(125M-8B)以及ViT在内的广泛模型上进行的实证研究,证实了这些理论发现。此外,通过严格分析,我们证明了稀疏性与量化并非正交;它们的相互作用会显著损害模型精度,其中量化误差在此性能下降中起主导作用。我们的研究结果可推广至资源受限的计算平台上大型模型的高效部署,并降低服务成本,为应用这些压缩方法以在不损害精度的前提下最大化效能提供了最佳实践指导。