S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Ying Sheng,Shiyi Cao,Dacheng Li,Coleman Hooper,Nicholas Lee,Shuo Yang,Christopher Chou,Banghua Zhu,Lianmin Zheng,Kurt Keutzer,Joseph E. Gonzalez,Ion Stoica

The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services.

翻译：“预训练-微调”范式在大语言模型的部署中被广泛采用。低秩适配（LoRA）作为一种参数高效微调方法，常被用于将基础模型适配到众多任务中，从而形成一个基于同一基础模型衍生出的LoRA适配器集合。我们观察到，这一范式为服务过程中的批处理推理提供了重要机遇。为充分利用这些机遇，我们提出了S-LoRA——一个专为大规模服务众多LoRA适配器而设计的系统。S-LoRA将所有适配器存储在主内存中，并将当前运行查询所使用的适配器动态提取至GPU内存。为高效利用GPU内存并减少碎片化，S-LoRA提出了统一分页（Unified Paging）技术。该技术通过统一内存池管理具有不同秩的动态适配器权重以及具有可变序列长度的KV缓存张量。此外，S-LoRA采用了一种新颖的张量并行策略以及高度优化的自定义CUDA内核，以实现LoRA计算的异构批处理。综合上述特性，S-LoRA能够在单个GPU或多GPU上以极小的开销服务数千个LoRA适配器。与HuggingFace PEFT和vLLM（对LoRA服务提供基础支持）等先进库相比，S-LoRA的吞吐量可提升高达4倍，而所服务的适配器数量则增加了数个数量级。因此，S-LoRA实现了对众多任务特定微调模型的可扩展服务，并为大规模定制化微调服务提供了潜在可能。