Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Large language models (LLMs) are increasingly integrated into many online services. However, a major challenge in deploying LLMs is their high cost, due primarily to the use of expensive GPU instances. To address this problem, we find that the significant heterogeneity of GPU types presents an opportunity to increase GPU cost efficiency and reduce deployment costs. The broad and growing market of GPUs creates a diverse option space with varying costs and hardware specifications. Within this space, we show that there is not a linear relationship between GPU cost and performance, and identify three key LLM service characteristics that significantly affect which GPU type is the most cost effective: model request size, request rate, and latency service-level objective (SLO). We then present M\'elange, a framework for navigating the diversity of GPUs and LLM service specifications to derive the most cost-efficient set of GPUs for a given LLM service. We frame the task of GPU selection as a cost-aware bin-packing problem, where GPUs are bins with a capacity and cost, and items are request slices defined by a request size and rate. Upon solution, M\'elange derives the minimal-cost GPU allocation that adheres to a configurable latency SLO. Our evaluations across both real-world and synthetic datasets demonstrate that M\'elange can reduce deployment costs by up to 77% as compared to utilizing only a single GPU type, highlighting the importance of making heterogeneity-aware GPU provisioning decisions for LLM serving. Our source code is publicly available at https://github.com/tyler-griggs/melange-release.

翻译：大语言模型（LLMs）正日益融入众多在线服务中。然而，部署LLMs的主要挑战在于其高昂的成本，这主要源于对昂贵GPU实例的使用。为解决此问题，我们发现GPU类型的显著异构性为提高GPU成本效率、降低部署成本提供了机会。广阔且不断增长的GPU市场创造了多样化的选择空间，其中包含不同的成本和硬件规格。在该空间中，我们证明了GPU成本与性能之间并非线性关系，并识别出三种显著影响最具成本效益的GPU类型的关键LLM服务特性：模型请求大小、请求速率和延迟服务等级目标（SLO）。随后，我们提出了Mélange——一个旨在导航GPU多样性与LLM服务规范、为给定LLM服务导出最具成本效益GPU集合的框架。我们将GPU选择任务建模为成本感知的装箱问题，其中GPU被视为具有容量和成本的箱子，而项目则是由请求大小和速率定义的请求切片。通过求解该问题，Mélange能推导出遵循可配置延迟SLO的最小成本GPU分配方案。在真实世界和合成数据集上的评估表明，与仅使用单一GPU类型相比，Mélange可将部署成本降低高达77%，这凸显了在LLM服务中做出异构性感知的GPU配置决策的重要性。我们的源代码已公开于https://github.com/tyler-griggs/melange-release。