S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Ying Sheng,Shiyi Cao,Dacheng Li,Coleman Hooper,Nicholas Lee,Shuo Yang,Christopher Chou,Banghua Zhu,Lianmin Zheng,Kurt Keutzer,Joseph E. Gonzalez,Ion Stoica

The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at https://github.com/S-LoRA/S-LoRA

翻译：“预训练-微调”范式在大语言模型部署中被广泛采用。低秩适配（LoRA）作为一种参数高效微调方法，常用于将基础模型适配到多种任务，由此产生基于同一基础模型的大量LoRA适配器集合。我们观察到，该范式在服务期间为批量推理提供了重要机遇。为把握此机遇，我们提出S-LoRA系统，专为大规模服务众多LoRA适配器而设计。S-LoRA将所有适配器存储在主内存中，并将当前运行查询所使用的适配器动态加载至GPU内存。为高效利用GPU内存并减少碎片化，S-LoRA提出统一分页机制（Unified Paging），通过统一内存池管理具有不同秩的动态适配器权重以及具有可变序列长度的KV缓存张量。此外，S-LoRA采用新颖的张量并行策略与高度优化的自定义CUDA内核，实现LoRA计算的异构批处理。这些特性协同作用，使S-LoRA能在单GPU或多GPU上以极低开销服务于数千个LoRA适配器。相较于HuggingFace PEFT和vLLM（以朴素方式支持LoRA服务）等前沿库，S-LoRA可将吞吐量提升至4倍，并将可服务适配器数量提升数个数量级。因此，S-LoRA实现了大规模服务多个任务特定微调模型的能力，并为大规模定制化微调服务提供了可能。代码已开源：https://github.com/S-LoRA/S-LoRA