变色龙：面向多适配器大语言模型推理环境的自适应缓存与调度系统 (Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments)

The widespread adoption of LLMs has driven an exponential rise in their deployment, imposing substantial demands on inference clusters. These clusters must handle numerous concurrent queries for different LLM downstream tasks. To handle multi-task settings with vast LLM parameter counts, methods like Low-Rank Adaptation (LoRA) enable task-specific fine-tuning while sharing most of the base LLM model across tasks. Hence, they allow concurrent task serving with minimal memory requirements. However, existing LLM serving systems face inefficiencies: they overlook workload heterogeneity, impose high link bandwidth from frequent adapter loading, and suffer from head-of-line blocking in their schedulers. To address these challenges, we present Chameleon, a novel LLM serving system optimized for many adapter environments, that relies on two core ideas: adapter caching and adapter-aware scheduling. First, Chameleon caches popular adapters in GPU memory, minimizing the adapter loading times. Importantly, it uses the otherwise idle GPU memory, avoiding extra memory costs. Second, Chameleon uses a non-preemptive multi-queue scheduling to efficiently account for workload heterogeneity. In this way, Chameleon simultaneously prevents head of line blocking and starvation. We implement Chameleon on top of a state-of-the-art LLM serving platform and evaluate it with real-world production traces and open-source LLMs. Under high loads, Chameleon reduces P99 and P50 TTFT latency by 80.7% and 48.1%, respectively, while improving throughput by 1.5x compared to state-of-the-art baselines.

翻译：大语言模型的广泛应用推动了其部署规模的指数级增长，对推理集群提出了巨大需求。这些集群必须处理针对不同大语言模型下游任务的大量并发查询。为应对海量参数规模下的多任务场景，诸如低秩自适应等方法实现了任务特定的微调，同时在不同任务间共享大部分基础大语言模型参数。因此，它们能够以最小内存需求实现并发任务服务。然而，现有大语言模型服务系统存在效率瓶颈：它们忽视了工作负载的异构性，因频繁加载适配器而产生高链路带宽需求，且调度器存在队头阻塞问题。为应对这些挑战，我们提出了变色龙——一个专为多适配器环境优化的新型大语言模型服务系统，其核心基于两大创新理念：适配器缓存与适配器感知调度。首先，变色龙将热门适配器缓存于GPU显存中，最大限度减少适配器加载时间。重要的是，该系统利用原本闲置的GPU显存空间，避免了额外的内存开销。其次，变色龙采用非抢占式多队列调度机制，有效适应工作负载的异构特性。通过这种方式，变色龙同步解决了队头阻塞和任务饥饿问题。我们在先进的大语言模型服务平台基础上实现了变色龙系统，并采用真实生产环境轨迹和开源大语言模型进行评估。在高负载条件下，与最先进的基线系统相比，变色龙将P99和P50首词元延迟分别降低了80.7%和48.1%，同时将吞吐量提升了1.5倍。

相关内容

大语言模型

关注 65

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日