The demand for large language model (LLM) inference is gradually dominating the artificial intelligence workloads. Therefore, there is an urgent need for cost-efficient inference serving. Existing work focuses on single-worker optimization and lacks consideration of cluster-level management for both inference queries and computing resources. However, placing requests and managing resources without considering the query features easily causes SLO violations or resource underutilization. Providers are forced to allocate extra computing resources to guarantee user experience, leading to additional serving costs. In this paper we introduce Aladdin, a scheduler that co-adaptively places queries and scales computing resources with SLO awareness. For a stream of inference queries, Aladdin first predicts minimal computing resources and the corresponding serving workers' configuration required to fulfill the SLOs for all queries. Then, it places the queries to each serving worker according to the prefill and decode latency models of batched LLM inference to maximize each worker's utilization. Results show that Aladdin reduces the serving cost of a single model by up to 71% for the same SLO level compared with the baselines, which can be millions of dollars per year.
翻译:大语言模型推理需求正逐渐主导人工智能工作负载。因此,迫切需要实现成本效益高的推理服务。现有工作侧重于单工作节点优化,缺乏对推理查询和计算资源集群级管理的考虑。然而,在不考虑查询特征的情况下分配请求和管理资源,容易导致服务等级协议违反或资源利用率不足。服务提供商被迫分配额外计算资源以保证用户体验,从而产生额外服务成本。本文提出阿拉丁调度器,该调度器协同自适应地放置查询并扩缩计算资源,具备SLO感知能力。针对推理查询流,阿拉丁首先预测满足所有查询SLO所需的最小计算资源及相应服务工作节点的配置。随后,根据批处理大语言模型推理的预填充和解码延迟模型,将查询分配到各服务工作节点,以最大化每个工作节点的利用率。实验结果表明,与基线方法相比,在相同SLO水平下,阿拉丁将单一模型的服务成本降低最高达71%,每年可节省数百万美元。