Post-training quantization is widely employed to reduce the computational demands of neural networks. Typically, individual substructures, such as layers or blocks of layers, are quantized with the objective of minimizing quantization errors in their pre-activations by fine-tuning the corresponding weights. Deriving this local objective from the global objective of minimizing task loss involves two key simplifications: assuming substructures are mutually independent and ignoring the knowledge of subsequent substructures as well as the task loss. In this work, we assess the effects of these simplifications on weight-only quantization of large language models. We introduce two multi-block fine-tuning strategies and compare them against the baseline of fine-tuning single transformer blocks. The first captures correlations of weights across blocks by jointly optimizing multiple quantized blocks. The second incorporates knowledge of subsequent blocks by minimizing the error in downstream pre-activations rather than focusing solely on the quantized block. Our findings indicate that the effectiveness of these methods depends on the specific network model, with no impact on some models but demonstrating significant benefits for others.
翻译:后训练量化被广泛用于降低神经网络的计算需求。通常,量化针对单个子结构(如层或模块)进行,其目标是通过微调相应权重来最小化其预激活中的量化误差。从最小化任务损失的全局目标推导出这一局部目标涉及两个关键简化:假设子结构相互独立,并忽略后续子结构的知识以及任务损失。在本研究中,我们评估了这些简化对大型语言模型仅权重量化的影响。我们引入了两种多模块微调策略,并将其与微调单个Transformer模块的基线方法进行比较。第一种策略通过联合优化多个量化模块来捕捉跨模块的权重相关性。第二种策略则通过最小化下游预激活的误差(而非仅关注被量化的模块)来融入后续模块的知识。我们的研究结果表明,这些方法的有效性取决于具体的网络模型:对某些模型没有影响,但对其他模型则显示出显著的益处。