Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

The usage of large language models (LLMs) has grown increasingly fragmented, with no single model dominating. Meanwhile, cloud providers offer a wide range of mid-tier and older-generation GPUs that enjoy better availability and deliver comparable performance per dollar to top-tier hardware. To efficiently harness these heterogeneous resources for serving multiple LLMs concurrently, we introduce Coral, an adaptive heterogeneity-aware multi-LLM serving system. The key idea behind Coral is to jointly optimize resource allocation and the serving strategy of each model replica across all models. To keep pace with shifting throughput demand and resource availability, Coral applies a lossless two-stage decomposition that preserves joint optimality while cutting online solve time from hours to tens of seconds. Our evaluation across 6 models and 20 GPU configurations shows that Coral reduces serving cost by up to 2.79$\times$ over the best baseline, and delivers up to 2.39$\times$ higher goodput under scarce resource availability.

翻译：随着大语言模型的使用日益碎片化，尚无单一模型占据主导地位。与此同时，云服务商提供多样化的中端及上一代GPU，这些GPU具有更好的可用性，且单位成本性能与顶级硬件相当。为高效利用这些异构资源并发服务多个LLM，我们提出Coral——一种自适应异构感知的多LLM服务系统。其核心思路在于联合优化所有模型中每个模型副本的资源分配与服务策略。为应对吞吐需求与资源可用性的动态变化，Coral采用无损两阶段分解方法，在保持联合最优性的同时将在线求解时间从数小时压缩至数十秒。我们在6个模型与20种GPU配置上的评估表明：相较于最优基线方案，Coral可降低高达2.79倍的服务成本，并在资源稀缺场景下实现高达2.39倍的优质吞吐提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2024】《AmoebaLLM：构建任意形状的大型语言模型以实现高效和即时部署》

专知会员服务

22+阅读 · 2024年11月21日

大模型报告:模型能力决定下限，场景适配度决定上限

专知会员服务

57+阅读 · 2024年6月3日

【CVPR2024】MA-LMM: 内存增强的大型多模态模型，用于长期视频理解

专知会员服务

21+阅读 · 2024年4月9日

【NeurIPS2023】MultiModN:多模态，多任务，可解释的模块化网络

专知会员服务

40+阅读 · 2023年9月27日