Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by improving the inference engine, but less attention has been given to selecting the most cost-efficient GPU type(s) for a specific LLM service. There is a large and growing landscape of GPU types and, within these options, higher cost does not always lead to increased performance. Instead, through a comprehensive investigation, we find that three key LLM service characteristics (request size, request rate, SLO) strongly influence GPU cost efficiency, and differing GPU types are most cost efficient for differing LLM service settings. As a result, the most cost-efficient allocation for a given service is typically a mix of heterogeneous GPU types. Based on this analysis, we introduce M\'elange, a GPU allocation framework that navigates these diverse LLM service characteristics and heterogeneous GPU option space to automatically and efficiently derive the minimal-cost GPU allocation for a given LLM service. We formulate the GPU allocation task as a cost-aware bin packing problem where GPUs are bins and items are slices of the service workload. Our formulation's constraints account for a service's unique characteristics, allowing M\'elange to be flexible to support diverse service settings and heterogeneity-aware to adapt the GPU allocation to a specific service. Compared to using only a single GPU type, M\'elange reduces deployment costs by up to 77\% in conversational settings, 33\% in document-based settings, and 51\% in a mixed setting.
翻译:大型语言模型(LLMs)正日益融入众多在线服务,但由于需要昂贵的GPU实例,其部署成本依然居高不下。先前的研究通过改进推理引擎来应对LLM服务的高成本问题,但对于为特定LLM服务选择最具成本效益的GPU类型关注较少。GPU类型的选择范围广阔且不断扩展,而在这些选项中,更高的成本并不总是带来性能提升。通过全面调研,我们发现三个关键的LLM服务特征(请求规模、请求速率、服务等级目标)强烈影响GPU的成本效益,且不同的GPU类型在不同LLM服务配置下具有最优成本效益。因此,对于给定服务,最具成本效益的分配方案通常是混合使用异构GPU类型。基于此分析,我们提出Mélange——一个能够综合考虑多样化LLM服务特征与异构GPU选项空间的分配框架,可自动高效地为给定LLM服务推导出最小成本的GPU分配方案。我们将GPU分配任务建模为成本感知的装箱问题,其中GPU是容器,服务负载切片是待装物品。该建模的约束条件考虑了服务的独特特征,使Mélange既能灵活支持多样化服务配置,又能基于异构感知为特定服务适配GPU分配。相较于仅使用单一GPU类型,Mélange在对话场景中可降低部署成本高达77%,在文档处理场景中降低33%,在混合场景中降低51%。