As Large Language Models (LLMs) are rapidly growing in popularity, LLM inference services must be able to serve requests from thousands of users while satisfying performance requirements. The performance of an LLM inference service is largely determined by the hardware onto which it is deployed, but understanding of which hardware will deliver on performance requirements remains challenging. In this work we present LLM-Pilot - a first-of-its-kind system for characterizing and predicting performance of LLM inference services. LLM-Pilot performs benchmarking of LLM inference services, under a realistic workload, across a variety of GPUs, and optimizes the service configuration for each considered GPU to maximize performance. Finally, using this characterization data, LLM-Pilot learns a predictive model, which can be used to recommend the most cost-effective hardware for a previously unseen LLM. Compared to existing methods, LLM-Pilot can deliver on performance requirements 33% more frequently, whilst reducing costs by 60% on average.
翻译:随着大型语言模型(LLM)的迅速普及,LLM推理服务必须能够同时处理来自数千用户的请求,并满足性能要求。LLM推理服务的性能主要取决于其部署的硬件,但理解何种硬件能够满足性能要求仍然具有挑战性。本研究提出了LLM-Pilot——首个用于表征和预测LLM推理服务性能的系统。LLM-Pilot在真实工作负载下,对多种GPU上的LLM推理服务进行基准测试,并为每种考虑的GPU优化服务配置以最大化性能。最后,利用这些表征数据,LLM-Pilot学习一个预测模型,该模型可用于为先前未见过的LLM推荐最具成本效益的硬件。与现有方法相比,LLM-Pilot能够多33%的概率满足性能要求,同时平均降低60%的成本。