Weight quantisation is an essential technique for enabling efficient training and deployment of modern deep learning models. However, the recipe book of quantisation formats is large and formats are often chosen empirically. In this paper, we propose a framework for systematic design and analysis of quantisation formats. By connecting the question of format design with the classical quantisation theory, we show that the strong practical performance of popular formats comes from their ability to represent values using variable-length codes. We frame the problem as minimising the KL divergence between original and quantised model outputs under a model size constraint, which can be approximated by minimising the squared quantisation error, a well-studied problem where entropy-constrained quantisers with variable-length codes are optimal. We develop non-linear quantisation curves for block-scaled data across multiple distribution families and observe that these formats, along with sparse outlier formats, consistently outperform fixed-length formats, indicating that they also exploit variable-length encoding. Finally, by using the relationship between the Fisher information and KL divergence, we derive the optimal allocation of bit-widths to individual parameter tensors across the model's layers, saving up to 0.25 bits per parameter when applied to large language models.
翻译:权重量化是实现现代深度学习模型高效训练与部署的关键技术。然而,量化格式的种类繁多,其选择往往依赖于经验。本文提出了一种用于系统化设计与分析量化格式的框架。通过将格式设计问题与经典量化理论相联系,我们证明了流行格式之所以具有优异的实际性能,源于其利用变长编码表示数值的能力。我们将该问题形式化为在模型大小约束下最小化原始模型与量化模型输出之间的KL散度,该问题可近似为最小化平方量化误差——这是一个已有深入研究的课题,其中采用变长编码的熵约束量化器是最优的。我们针对多种分布族开发了适用于块缩放数据的非线性量化曲线,并观察到这些格式(连同稀疏异常值格式)始终优于定长格式,表明它们同样利用了变长编码机制。最后,通过利用费舍尔信息与KL散度之间的关系,我们推导出模型中各层参数张量的最优位宽分配方案,将其应用于大型语言模型时,可节省高达每参数0.25比特的存储开销。