As large language models (LLMs) are increasingly deployed as service endpoints in systems, the surge in query volume creates significant scheduling challenges. Existing scheduling frameworks mainly target at latency optimization while neglecting the capability of LLMs to serve different level of queries, which could lead to computational resource waste. This paper addresses this challenge by proposing a capability-cost coordinated scheduling framework, ECCOS, for multi-LLM serving, which explicitly constrains response quality and workload to optimize LLM inference cost. Specifically, it introduces the two-stage scheduling by designing a multi-objective predictor and a constrained optimizer. The predictor estimates both model capabilities and computational costs through training-based and retrieval-based approaches, while the optimizer determines cost-optimal assignments under quality and workload constraints. It also introduces QAServe, a dataset collected for sample-wise response quality and costs by zero-shot prompting different LLMs on knowledge QA and mathematical reasoning. Extensive experiments demonstrate that ECCOS improves success rates by 6.30% while reducing costs by 10.15% compared to existing methods, consuming less than 0.5% of LLM response time. The code is available at: https://github.com/agiresearch/ECCOS.
翻译:随着大型语言模型(LLM)日益作为服务端点部署于系统中,查询量的激增带来了显著的调度挑战。现有调度框架主要针对延迟优化,而忽视了LLM服务不同难度查询的能力差异,这可能导致计算资源浪费。本文通过提出一种能力-成本协同调度框架ECCOS来解决这一挑战,该框架显式约束响应质量与工作负载,以优化LLM推理成本。具体而言,我们通过设计多目标预测器与约束优化器实现两阶段调度:预测器通过基于训练和基于检索的方法评估模型能力与计算成本,而优化器则在质量与负载约束下确定成本最优的任务分配方案。本文还介绍了QAServe数据集,该数据集通过在不同LLM上对知识问答与数学推理任务进行零样本提示,采集了样本级别的响应质量与成本数据。大量实验表明,与现有方法相比,ECCOS在仅消耗LLM响应时间0.5%的开销下,将任务成功率提升6.30%,同时降低10.15%的成本。代码已开源:https://github.com/agiresearch/ECCOS。