Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization. Using transformer as a representative generative AI model, we carried out experiments that well demonstrate the effectiveness of the proposed design solution.
翻译:近期研究表明,在对Transformer等生成式AI模型进行推理时,不同权重的重要性会表现出显著的上下文依赖性变化。这自然揭示了通过自适应配置权重量化来提升生成式AI推理效率的广阔潜力。尽管可配置权重量化能够直接利用现代GPU和AI加速器中可变精度算术的硬件支持,但现有研究很少探讨如何利用可变权重量化来相应提升AI模型的内存访问速度与能效。受快速成熟的CXL生态系统启发,本研究开发了一种基于CXL的设计方案以填补这一空白。其关键在于使CXL内存控制器在支持和利用运行时可配置权重量化方面发挥积极作用。以Transformer作为代表性生成式AI模型,我们通过实验充分验证了所提设计方案的有效性。