Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor Core support and inefficient memory management, leading to suboptimal acceleration. To address these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs. At its core, we introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization, effectively reducing data redundancy. Building on this, we implement an arbitrary precision matrix multiplication scheme that decomposes and recovers matrices at the bit level, enabling flexible precision while maximizing GPU Tensor Core utilization. Furthermore, we develop an efficient matrix preprocessing method that optimizes data layout for subsequent computations. Finally, we design a data recovery-oriented memory management system that strategically utilizes fast shared memory, significantly enhancing kernel execution speed and minimizing memory access latency. Experimental results demonstrate our approach's effectiveness, with up to 13\times speedup in matrix multiplication compared to NVIDIA's CUTLASS. When integrated into LLMs, we achieve up to 6.7\times inference acceleration. These improvements significantly enhance LLM inference efficiency, enabling broader and more responsive applications of LLMs.
翻译:大语言模型(LLMs)已得到广泛应用,但在高效推理方面仍面临挑战。量化方法虽能降低计算需求,但受限于GPU张量核心对任意精度超低位量化的有限支持及低效内存管理,导致加速效果未达最优。为应对这些挑战,我们提出一种面向任意精度LLMs的全面加速方案。其核心是引入一种新型双极整型数据格式,该格式既促进并行计算又支持对称量化,有效减少数据冗余。在此基础上,我们实现了任意精度矩阵乘法方案,通过比特级矩阵分解与恢复,在最大化GPU张量核心利用率的同时实现灵活精度调控。此外,我们开发了高效的矩阵预处理方法,通过优化数据布局为后续计算提供支持。最后,我们设计了面向数据恢复的内存管理系统,通过策略性利用高速共享内存,显著提升内核执行速度并降低内存访问延迟。实验结果表明,所提方案在矩阵乘法运算中相比NVIDIA CUTLASS最高可实现13倍加速;集成至LLMs时,推理过程最高获得6.7倍加速。这些改进显著提升了LLM推理效率,为LLMs更广泛、更灵敏的应用提供了可能。