The widespread adoption of LLMs has driven an exponential rise in their deployment, imposing substantial demands on inference clusters. These clusters must handle numerous concurrent queries for different LLM downstream tasks. To handle multi-task settings with vast LLM parameter counts, methods like Low-Rank Adaptation (LoRA) enable task-specific fine-tuning while sharing most of the base LLM model across tasks. Hence, they allow concurrent task serving with minimal memory requirements. However, existing LLM serving systems face inefficiencies: they overlook workload heterogeneity, impose high link bandwidth from frequent adapter loading, and suffer from head-of-line blocking in their schedulers. To address these challenges, we present Chameleon, a novel LLM serving system optimized for many adapter environments, that relies on two core ideas: adapter caching and adapter-aware scheduling. First, Chameleon caches popular adapters in GPU memory, minimizing the adapter loading times. Importantly, it uses the otherwise idle GPU memory, avoiding extra memory costs. Second, Chameleon uses a non-preemptive multi-queue scheduling to efficiently account for workload heterogeneity. In this way, Chameleon simultaneously prevents head of line blocking and starvation. We implement Chameleon on top of a state-of-the-art LLM serving platform and evaluate it with real-world production traces and open-source LLMs. Under high loads, Chameleon reduces P99 and P50 TTFT latency by 80.7% and 48.1%, respectively, while improving throughput by 1.5x compared to state-of-the-art baselines.
翻译:大语言模型的广泛应用推动了其部署规模的指数级增长,对推理集群提出了巨大需求。这些集群必须处理针对不同大语言模型下游任务的大量并发查询。为应对海量参数规模下的多任务场景,诸如低秩自适应等方法实现了任务特定的微调,同时在不同任务间共享大部分基础大语言模型参数。因此,它们能够以最小内存需求实现并发任务服务。然而,现有大语言模型服务系统存在效率瓶颈:它们忽视了工作负载的异构性,因频繁加载适配器而产生高链路带宽需求,且调度器存在队头阻塞问题。为应对这些挑战,我们提出了变色龙——一个专为多适配器环境优化的新型大语言模型服务系统,其核心基于两大创新理念:适配器缓存与适配器感知调度。首先,变色龙将热门适配器缓存于GPU显存中,最大限度减少适配器加载时间。重要的是,该系统利用原本闲置的GPU显存空间,避免了额外的内存开销。其次,变色龙采用非抢占式多队列调度机制,有效适应工作负载的异构特性。通过这种方式,变色龙同步解决了队头阻塞和任务饥饿问题。我们在先进的大语言模型服务平台基础上实现了变色龙系统,并采用真实生产环境轨迹和开源大语言模型进行评估。在高负载条件下,与最先进的基线系统相比,变色龙将P99和P50首词元延迟分别降低了80.7%和48.1%,同时将吞吐量提升了1.5倍。