Pre-trained large language models (LLMs) often need specialization for domain-specific tasks. Low-Rank Adaptation (LoRA) is a popular approach that adapts a base model to multiple tasks by adding lightweight trainable adapters. In this paper, we present CaraServe, a system that efficiently serves many LoRA adapters derived from a common base model. CaraServe maintains the base model on GPUs and dynamically loads activated LoRA adapters from main memory. As GPU loading results in a cold-start that substantially delays token generation, CaraServe employs a CPU-assisted approach. It early starts the activated adapters on CPUs for prefilling as they are being loaded onto GPUs; after loading completes, it then switches to the GPUs for generative LoRA inference. CaraServe develops a highly optimized synchronization mechanism to efficiently coordinate LoRA computation on the CPU and GPU. Moreover, CaraServe employs a rank-aware scheduling algorithm to optimally schedule heterogeneous LoRA requests for maximum service-level objective (SLO) attainment. We have implemented CaraServe and evaluated it against state-of-the-art LoRA serving systems. Our results demonstrate that CaraServe can speed up the average request serving latency by up to 1.4$\times$ and achieve an SLO attainment of up to 99%.
翻译:预训练大语言模型(LLMs)通常需要针对特定领域任务进行专门化。低秩适配(LoRA)是一种通过添加轻量级可训练适配器,将基础模型适配到多个任务的流行方法。本文提出CaraServe系统,用于高效服务源自同一基础模型的多个LoRA适配器。CaraServe在GPU上维护基础模型,并从主存动态加载激活的LoRA适配器。由于GPU加载会导致冷启动,显著延迟令牌生成,CaraServe采用CPU辅助方法:在适配器加载到GPU的同时,在CPU上提前启动激活适配器进行预填充;加载完成后,再切换至GPU执行生成式LoRA推理。该系统设计了高度优化的同步机制,以有效协调CPU与GPU上的LoRA计算。此外,CaraServe采用秩感知调度算法,最优调度异构LoRA请求以实现最大服务等级目标(SLO)达成率。我们实现了CaraServe,并与最新LoRA服务系统进行了对比评估。结果表明,CaraServe可将平均请求服务延迟提升最高1.4倍,且SLO达成率可达99%。