MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but has also been shown to yield substantial speedups for single-user inference, due to reduced memory movement, with low accuracy impact. Yet, it remains open whether speedups are achievable also in \emph{batched} settings with multiple parallel clients, which are highly relevant for practical serving. It is unclear whether GPU kernels can be designed to remain practically memory-bound, while supporting the substantially increased compute requirements of batched workloads. This paper resolves this question positively by describing the design of Mixed-precision Auto-Regressive LINear kernels, called MARLIN. Concretely, given a model whose weights are compressed via quantization to, e.g., 4 bits per element, MARLIN shows that batchsizes up to 16-32 can be supported with close to maximum ($4\times$) quantization speedup, and larger batchsizes up to 64-128 with gradually decreasing, but still significant, acceleration. MARLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining, and bespoke quantization support. Our experiments show that MARLIN's near-optimal performance on individual LLM layers across different scenarios can also lead to end-to-end LLM inference speedups (of up to $2.8\times$) when integrated with the popular vLLM serving engine. Finally, MARLIN is extensible to further compression techniques, like NVIDIA 2:4 sparsity, leading to additional speedups.

翻译：随着大语言模型推理成为机器学习应用中的重要工作负载，权重量化已成为高效GPU部署的标准技术。量化不仅能减小模型规模，还因减少内存移动而被证明可在单用户推理中实现显著加速，且对精度影响较小。然而，在具有多个并行客户端的批处理场景（这对实际部署至关重要）中是否仍能实现加速，这一问题尚未解决。目前尚不清楚GPU内核能否在支持批处理工作负载大幅增加的计算需求的同时，仍保持实际上的内存受限状态。本文通过设计名为MARLIN的混合精度自回归线性内核，对此问题给出了肯定的解答。具体而言，对于权重通过量化压缩至每元素4比特的模型，MARLIN表明：在批次大小不超过16-32时，可实现接近理论最大（4倍）的量化加速；在批次大小扩展至64-128时，仍能保持逐步递减但显著的加速效果。MARLIN通过异步内存访问、复杂任务调度与流水线技术、以及定制化量化支持等多项技术组合实现这一目标。实验表明，MARLIN在不同场景下对单个LLM层的近最优性能，与流行的vLLM推理引擎集成后，也能带来端到端LLM推理加速（最高达2.8倍）。最后，MARLIN可扩展至其他压缩技术（如NVIDIA 2:4稀疏化），从而获得额外加速。