Deep neural networks, while achieving remarkable success across diverse tasks, demand significant resources, including computation, GPU memory, bandwidth, storage, and energy. Network quantization, as a standard compression and acceleration technique, reduces storage costs and enables potential inference acceleration by discretizing network weights and activations into a finite set of integer values. However, current quantization methods are often complex and sensitive, requiring extensive task-specific hyperparameters, where even a single misconfiguration can impair model performance, limiting generality across different models and tasks. In this paper, we propose Quantization without Tears (QwT), a method that simultaneously achieves quantization speed, accuracy, simplicity, and generality. The key insight of QwT is to incorporate a lightweight additional structure into the quantized network to mitigate information loss during quantization. This structure consists solely of a small set of linear layers, keeping the method simple and efficient. More importantly, it provides a closed-form solution, allowing us to improve accuracy effortlessly under 2 minutes. Extensive experiments across various vision, language, and multimodal tasks demonstrate that QwT is both highly effective and versatile. In fact, our approach offers a robust solution for network quantization that combines simplicity, accuracy, and adaptability, which provides new insights for the design of novel quantization paradigms.
翻译:深度神经网络虽然在各类任务中取得了显著成功,但其对计算资源、GPU内存、带宽、存储和能耗的需求巨大。网络量化作为一种标准的压缩与加速技术,通过将网络权重和激活值离散化为有限的整数值集合,能够降低存储成本并实现潜在的推理加速。然而,现有的量化方法通常复杂且敏感,需要大量针对特定任务的超参数,其中即使单个参数配置不当也可能损害模型性能,从而限制了方法在不同模型与任务间的通用性。本文提出无痛量化(Quantization without Tears, QwT),该方法能够同时实现量化的速度、精度、简洁性与通用性。QwT的核心思想是在量化网络中引入一个轻量级的附加结构,以缓解量化过程中的信息损失。该结构仅由少量线性层构成,保持了方法的简洁性与高效性。更重要的是,它提供了闭式解,使我们能够在2分钟内轻松提升量化精度。在视觉、语言和多模态任务上的大量实验表明,QwT方法既高效又具备广泛的适用性。事实上,我们的方法为网络量化提供了一个兼具简洁性、准确性与适应性的鲁棒解决方案,这为新型量化范式的设计提供了新的思路。