Transformer-based large language models (LLMs) have achieved great success with the growing model size. LLMs' size grows by $240\times$ every two years, which outpaces the hardware progress and makes model inference increasingly costly. Model quantization is a promising approach to mitigate the widening gap between LLM size and hardware capacity. However, the existence of outliers, values with significant magnitudes, in LLMs makes existing quantization methods less effective. Prior outlier-aware quantization schemes adopt sparsity encoding techniques to separate outliers from normal values where the process requires global coordination (e.g., a global sparsity coordination list). This incurs complex encoding/decoding hardware logics and an extra orchestration controller for the computation between outlier and normal values. As such, it is not hardware-efficient and hence only achieves sub-optimal quantization benefits. We propose OliVe, an algorithm/architecture co-designed solution that adopts an outlier-victim pair (OVP) quantization and handles outlier values locally with low hardware overheads and high performance gains. The key insight of OliVe is that outliers are important while the normal values next to them are not. Thus those normal values (called victims) can be sacrificed to accommodate outliers. This enables a memory-aligned OVP encoding scheme, which can be efficiently integrated to the existing hardware accelerators like systolic array and tensor core. As a result, OliVe-based accelerator surpasses the existing outlier-aware accelerator, GOBO, by 4.5$\times$ speedup and 4.0$\times$ energy reduction, respectively, with a superior model accuracy.
翻译:基于Transformer的大语言模型(LLM)随着模型规模的扩大取得了巨大成功。LLM的规模每两年增长240倍,这超出了硬件发展速度,导致模型推理成本日益高昂。模型量化是缓解LLM规模与硬件容量之间日益扩大的差距的一种有前景的方法。然而,LLM中存在具有显著幅度的异常值,使得现有量化方法效果降低。先前的异常感知量化方案采用稀疏编码技术将异常值与正常值分离,该过程需要全局协调(例如全局稀疏协调列表)。这导致了复杂的编码/解码硬件逻辑以及用于异常值与正常值之间计算的额外协调控制器。因此,该方法不具备硬件效率,仅实现了次优的量化收益。我们提出OliVe,一种算法/架构协同设计方案,采用异常-牺牲值对(OVP)量化,并以低硬件开销和高性能增益在本地处理异常值。OliVe的关键洞察在于异常值重要,而与之相邻的正常值并不重要。因此,这些正常值(称为牺牲值)可以被牺牲以容纳异常值。这实现了内存对齐的OVP编码方案,可高效集成到现有硬件加速器(如脉动阵列和张量核心)中。最终,基于OliVe的加速器相比现有异常感知加速器GOBO,实现了4.5倍的速度提升和4.0倍的能耗降低,同时保持了更优的模型精度。