Smart Routing: Cost-Effective Multi-LLM Serving for Multi-Core AIOS

As large language models (LLMs) are increasingly deployed as service endpoints in systems, the surge in query volume creates significant scheduling challenges. Existing scheduling frameworks mainly target at latency optimization while neglecting the capability of LLMs to serve different level of queries, which could lead to computational resource waste. For example, those simple queries can be safely handled by small, fast and cheap LLMs, while those complex and difficult queries need to be handled by large, slow, and expensive LLMs. This paper addresses this challenge by proposing an efficient capability-cost coordinated scheduling framework, ECCOS, for multi-LLM serving, which explicitly constrains response quality and workload to optimize LLM inference cost. Specifically, it introduces the two-stage scheduling by designing a multi-objective predictor and a constrained optimizer. The predictor estimates both model capabilities and computational costs through training-based and retrieval-based approaches, while the optimizer determines cost-optimal assignments under quality and workload constraints. It also introduces QAServe, a dataset for sample-wise response quality and costs collected by zero-shot prompting different LLMs on knowledge QA and mathematical reasoning. Extensive experiments demonstrate that ECCOS improves success rates by 6.30% while reducing costs by 10.15% compared to existing methods, consuming less than 0.5% of LLM response time. The code is available at: https://github.com/agiresearch/ECCOS, and the proposed smart routing mechanism has been integrated into AIOS, the AI Agent Operating System, at https://github.com/agiresearch/AIOS.

翻译：随着大型语言模型（LLM）越来越多地作为服务端点部署于系统中，查询量的激增带来了显著的调度挑战。现有调度框架主要针对延迟优化，而忽视了LLM处理不同难度查询的能力差异，这可能导致计算资源浪费。例如，简单查询可由小型、快速且廉价的LLM可靠处理，而复杂困难查询则需要大型、缓慢且昂贵的LLM处理。本文通过提出一种高效的能力-成本协同调度框架ECCOS来解决这一挑战，该框架显式约束响应质量与工作负载以优化LLM推理成本。具体而言，我们通过设计多目标预测器与约束优化器实现两阶段调度：预测器通过基于训练和基于检索的方法评估模型能力与计算成本，优化器则在质量与负载约束下确定成本最优的分配方案。本文还提出了QAServe数据集，该数据集通过零样本提示不同LLM在知识问答与数学推理任务上收集了样本级响应质量与成本数据。大量实验表明，相较于现有方法，ECCOS在仅消耗LLM响应时间0.5%的开销下，将成功率提升6.30%同时降低成本10.15%。代码已开源：https://github.com/agiresearch/ECCOS，所提出的智能路由机制已集成至AI智能体操作系统AIOS中：https://github.com/agiresearch/AIOS。