The increasing adoption of large language models (LLMs) on heterogeneous computing platforms poses significant challenges to achieving high inference efficiency. To address these efficiency bottlenecks across diverse platforms, this paper proposes Opt4GPTQ, a practical optimization method designed for 4-bit GPTQ quantized LLMs inference on heterogeneous AI accelerators. Built upon the vLLM serving system, Opt4GPTQ integrates three platform-level optimization strategies: Shared Memory Buffering Optimization (SMB-Opt), which caches frequently accessed data in shared memory and employs single-threaded writes; Vectorized Memory Loading Optimization (VML-Opt), which utilizes vectorized memory operations for efficient data loading; and Inline Assembly Optimization (ILA-Opt), which directly leverages hardwarenative vector half-precision addition and fused multiply-accumulate instructions. Experimental results show that Opt4GPTQ effectively improves performance across various models while maintaining original model accuracy, achieving throughput gains of up to 84.42%. This work highlights the critical role of platformlevel engineering in enabling efficient LLMs inference on emerging architectures and provides valuable methodologies for future heterogeneous platform adaptation.
翻译:随着大语言模型在异构计算平台上的日益普及,实现高效的模型推理面临着重大挑战。为应对跨不同平台的效率瓶颈,本文提出了Opt4GPTQ,一种专为异构AI加速器上4位GPTQ量化大语言模型推理设计的实用优化方法。该方法构建于vLLM服务系统之上,集成了三种平台级优化策略:共享内存缓冲优化,该策略将频繁访问的数据缓存在共享内存中并采用单线程写入;向量化内存加载优化,该策略利用向量化内存操作实现高效数据加载;以及内联汇编优化,该策略直接利用硬件原生的向量半精度加法与融合乘加指令。实验结果表明,Opt4GPTQ在保持模型原始精度的同时,有效提升了多种模型的推理性能,最高可实现84.42%的吞吐量提升。本工作凸显了平台级工程对于在新兴架构上实现高效大语言模型推理的关键作用,并为未来的异构平台适配提供了有价值的方法论。