Low-bit quantization is a promising technique for efficient transformer inference by reducing computational and memory overhead. However, aggressive bitwidth reduction remains challenging due to activation outliers, leading to accuracy degradation. Existing methods, such as outlier-handling and group quantization, achieve high accuracy but incur substantial energy consumption. To address this, we propose SeVeDo, an energy-efficient SVD-based heterogeneous accelerator that structurally separates outlier-sensitive components into a high-precision low-rank path, while the remaining computations are executed in a low-bit residual datapath with group quantization. To further enhance efficiency, Hierarchical Group Quantization (HGQ) combines coarse-grained floating-point scaling with fine-grained shifting, effectively reducing dequantization cost. Also, SVD-guided mixed precision (SVD-MP) statically allocates higher bitwidths to precision-sensitive components identified through low-rank decomposition, thereby minimizing floating-point operation cost. Experimental results show that SeVeDo achieves a peak energy efficiency of 13.8TOPS/W, surpassing conventional designs, with 12.7TOPS/W on ViT-Base and 13.4TOPS/W on Llama2-7B benchmarks.
翻译:低比特量化是一种通过降低计算和内存开销实现高效Transformer推理的前沿技术。然而,由于激活值异常值的存在,进一步降低比特宽度仍面临挑战,常导致精度下降。现有方法(如异常值处理与组量化)虽能实现较高精度,但会带来显著的能耗开销。为此,我们提出SeVeDo——一种基于奇异值分解(SVD)的高效能异构加速器,其通过结构设计将异常值敏感组件分离至高精度低秩计算路径,而其余计算则在采用组量化的低比特残差数据路径中执行。为进一步提升能效,分层组量化(HGQ)将粗粒度浮点缩放与细粒度移位操作相结合,有效降低了反量化开销。同时,SVD引导的混合精度(SVD-MP)方法通过低秩分解识别精度敏感组件,并为其静态分配更高比特位宽,从而最小化浮点运算成本。实验结果表明,SeVeDo在ViT-Base和Llama2-7B基准测试中分别实现了12.7TOPS/W和13.4TOPS/W的能效,峰值能效达13.8TOPS/W,显著优于传统设计方案。