Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by improving the inference engine, but less attention has been given to selecting the most cost-efficient GPU type(s) for a specific LLM service. There is a large and growing landscape of GPU types and, within these options, higher cost does not always lead to increased performance. Instead, through a comprehensive investigation, we find that three key LLM service characteristics (request size, request rate, SLO) strongly influence GPU cost efficiency, and differing GPU types are most cost efficient for differing LLM service settings. As a result, the most cost-efficient allocation for a given service is typically a mix of heterogeneous GPU types. Based on this analysis, we introduce M\'elange, a GPU allocation framework that navigates these diverse LLM service characteristics and heterogeneous GPU option space to automatically and efficiently derive the minimal-cost GPU allocation for a given LLM service. We formulate the GPU allocation task as a cost-aware bin packing problem where GPUs are bins and items are slices of the service workload. Our formulation's constraints account for a service's unique characteristics, allowing M\'elange to be flexible to support diverse service settings and heterogeneity-aware to adapt the GPU allocation to a specific service. Compared to using only a single GPU type, M\'elange reduces deployment costs by up to 77% in conversational settings, 33% in document-based settings, and 51% in a mixed setting.
翻译:大型语言模型(LLM)正日益集成到众多在线服务中,但由于需要昂贵的GPU实例,其部署成本仍然居高不下。先前的研究通过改进推理引擎来解决LLM服务的高成本问题,但对于为特定LLM服务选择最具成本效益的GPU类型关注较少。GPU类型的选择范围广阔且不断增长,而在这些选项中,更高的成本并不总是带来性能提升。相反,通过全面调研,我们发现三个关键的LLM服务特征(请求大小、请求速率、服务水平目标)强烈影响GPU的成本效益,且不同的GPU类型在不同LLM服务配置下最具成本效益。因此,对于给定服务,最具成本效益的分配方案通常是混合使用异构GPU类型。基于此分析,我们提出了Mélange——一个GPU分配框架,该框架能够应对多样化的LLM服务特征和异构GPU选项空间,自动高效地为给定LLM服务推导出最小成本的GPU分配方案。我们将GPU分配任务建模为成本感知的装箱问题,其中GPU是箱子,而项目是服务工作负载的切片。我们建模的约束条件考虑了服务的独特特征,使Mélange能够灵活支持多样化服务配置,并具备异构感知能力以根据特定服务调整GPU分配。与仅使用单一GPU类型相比,Mélange在对话场景中降低部署成本高达77%,在文档处理场景中降低33%,在混合场景中降低51%。