With the increasing complexity of generative AI models, post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile and TVs. Existing PTQ schemes, however, consume considerable time and resources, which could be a bottleneck in real situations where frequent model updates and multiple hyperparameter tunings are required. As a cost-effective alternative, learning-free PTQ schemes have been proposed. However, the performance is somewhat limited because they cannot consider the inter-layer dependency within the attention module, which is a significant feature of Transformers. In this paper, we thus propose a novel PTQ algorithm that balances accuracy and efficiency. The key idea of the proposed algorithm called aespa is to perform quantization layer-wise for efficiency while targeting attention-wise reconstruction to consider the cross-layer dependency. Through extensive experiments on various language models and complexity analysis, we demonstrate that aespa is accurate and efficient in quantizing Transformer models.
翻译:随着生成式AI模型日益复杂,训练后量化已成为在移动设备和电视等边缘设备上部署超大规模模型的有效解决方案。然而,现有PTQ方案需要消耗大量时间和计算资源,这在需要频繁更新模型和调整超参数的实际应用场景中可能成为瓶颈。作为更具成本效益的替代方案,无学习的PTQ方案被提出,但其性能因无法考虑注意力模块中的层间依赖关系而受到限制,而层间依赖正是Transformer架构的重要特征。为此,本文提出一种平衡精度与效率的新型PTQ算法。该算法命名为aespa,其核心思想是通过逐层量化保证效率,同时以注意力机制为单位进行重建以考虑跨层依赖关系。通过对多种语言模型的大量实验及复杂度分析,我们证明aespa在量化Transformer模型时兼具高精度与高效率。