Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization. Using transformer as a representative generative AI model, we carried out experiments that well demonstrate the effectiveness of the proposed design solution.
翻译:近期研究表明,在Transformer等生成式AI模型进行推理时,不同权重的重要性会随上下文产生显著变化。这自然揭示了通过自适应配置权重量化来提升生成式AI推理效率的广阔前景。尽管可配置权重量化能够直接利用现代GPU和AI加速器中可变精度算术的硬件支持,但现有研究鲜少探讨如何利用可变权重量化来按比例提升AI模型的内存访问速度与能效。受快速成熟的CXL生态系统启发,本研究提出了一种基于CXL的设计方案以填补该空白。其核心在于使CXL内存控制器能够主动支持并利用运行时可配置的权重量化。以Transformer作为代表性生成式AI模型,我们通过实验充分验证了所提设计方案的有效性。