Towards Cost-effective LLMs Routing with Batch Prompting

Large Language Model (LLM) serving systems must balance task performance against monetary cost. Two prominent optimization techniques have emerged independently: LLM routing, which directs each query to the most cost-effective model in a model pool, and batch prompting, which packs multiple queries into a single invocation to amortize the fixed cost of the shared system prompt. These two techniques are logically complementary; i.e., routing optimizes the model assignment dimension while batching optimizes the query aggregation dimension, jointly reshaping the landscape of model utility and monetary cost. However, existing approaches explore only one side of this decision space. On the basis of empirical studies on their impacts, we are motivated to jointly optimize these two dimensions in this paper. We formulate the Route with Batching Problem, which jointly determines the target model and batch size for each query under a total cost budget, and prove it NP-hard. To solve this challenging problem, we propose RoBatch, a unified two-stage framework. In the modeling stage, RoBatch constructs a batch-aware proxy utility model that decomposes combinatorial utility estimation into utility estimation without batching and recalibration of model-specific utility degradation with batching. In the routing stage, RoBatch employs a greedy scheduling algorithm that progressively upgrades the assignment of the target model and batch size for queries along the cost-utility Pareto frontier until the budget is exhausted. Extensive experiments on six benchmarks across two LLM families (Qwen3 and Gemma3) demonstrate that RoBatch consistently achieves a superior cost-performance Pareto frontier compared with LLM routing and batch prompting baselines.

翻译：大语言模型（LLM）服务系统需要在任务性能与金钱成本之间取得平衡。目前，两种突出的优化技术已独立发展：LLM路由（将每个查询导向模型池中最具成本效益的模型）与批量提示（将多个查询打包到单个调用中，以分摊共享系统提示的固定成本）。这两项技术在逻辑上具有互补性——路由优化了模型分配维度，而批量操作则优化了查询聚合维度，共同重塑了模型效用与金钱成本的空间格局。然而，现有方法仅探索了这一决策空间的单一侧面。基于对两者影响的实证研究，本文得以激励性地对这两个维度进行联合优化。我们提出了带批量操作的路径规划问题（Route with Batching Problem），在总成本预算约束下为每个查询联合确定目标模型与批量大小，并证明了该问题的NP难度。为求解这一挑战性问题，我们提出了RoBatch——一个统一的两阶段框架。在建模阶段，RoBatch构建了批量感知的代理效用模型，将组合效用估计分解为无批量操作的效用估计与特定模型在批量操作下的效用退化重校准。在路由阶段，RoBatch采用贪心调度算法，沿着成本-效用帕累托前沿逐步升级查询的目标模型分配与批量大小，直至预算耗尽。在跨越两个LLM家族（Qwen3与Gemma3）的六个基准上的大量实验表明，与LLM路由和批量提示基线相比，RoBatch持续实现了更优的成本-性能帕累托前沿。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【新书】设计大型语言模型应用：一种面向LLMs的整体方法

专知会员服务

56+阅读 · 2025年3月16日

利用多个大型语言模型：关于LLM集成的调研

专知会员服务

35+阅读 · 2025年2月27日

带入您自己的知识：大型语言模型（LLM）知识扩展方法综述

专知会员服务

38+阅读 · 2025年2月21日

大型语言模型在不同自然语言处理任务中的提示工程方法综述

专知会员服务

60+阅读 · 2024年7月21日