The usage of large language models (LLMs) has grown increasingly fragmented, with no single model dominating. Meanwhile, cloud providers offer a wide range of mid-tier and older-generation GPUs that enjoy better availability and deliver comparable performance per dollar to top-tier hardware. To efficiently harness these heterogeneous resources for serving multiple LLMs concurrently, we introduce Coral, an adaptive heterogeneity-aware multi-LLM serving system. The key idea behind Coral is to jointly optimize resource allocation and the serving strategy of each model replica across all models. To keep pace with shifting throughput demand and resource availability, Coral applies a lossless two-stage decomposition that preserves joint optimality while cutting online solve time from hours to tens of seconds. Our evaluation across 6 models and 20 GPU configurations shows that Coral reduces serving cost by up to 2.79$\times$ over the best baseline, and delivers up to 2.39$\times$ higher goodput under scarce resource availability.
翻译:随着大语言模型的使用日益碎片化,尚无单一模型占据主导地位。与此同时,云服务商提供多样化的中端及上一代GPU,这些GPU具有更好的可用性,且单位成本性能与顶级硬件相当。为高效利用这些异构资源并发服务多个LLM,我们提出Coral——一种自适应异构感知的多LLM服务系统。其核心思路在于联合优化所有模型中每个模型副本的资源分配与服务策略。为应对吞吐需求与资源可用性的动态变化,Coral采用无损两阶段分解方法,在保持联合最优性的同时将在线求解时间从数小时压缩至数十秒。我们在6个模型与20种GPU配置上的评估表明:相较于最优基线方案,Coral可降低高达2.79倍的服务成本,并在资源稀缺场景下实现高达2.39倍的优质吞吐提升。