As the range of applications for Large Language Models (LLMs) continues to grow, the demand for effective serving solutions becomes increasingly critical. Despite the versatility of LLMs, no single model can optimally address all tasks and applications, particularly when balancing performance with cost. This limitation has led to the development of LLM routing systems, which combine the strengths of various models to overcome the constraints of individual LLMs. Yet, the absence of a standardized benchmark for evaluating the performance of LLM routers hinders progress in this area. To bridge this gap, we present RouterBench, a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems, along with a comprehensive dataset comprising over 405k inference outcomes from representative LLMs to support the development of routing strategies. We further propose a theoretical framework for LLM routing, and deliver a comparative analysis of various routing approaches through RouterBench, highlighting their potentials and limitations within our evaluation framework. This work not only formalizes and advances the development of LLM routing systems but also sets a standard for their assessment, paving the way for more accessible and economically viable LLM deployments. The code and data are available at https://github.com/withmartian/routerbench.
翻译:随着大型语言模型(LLM)应用领域的持续拓展,对高效服务解决方案的需求日益凸显。尽管LLM具备广泛适应性,但没有任何单一模型能在所有任务与应用中实现最优性能,尤其是在平衡性能与成本方面。这一局限性催生了LLM路由系统的发展,此类系统通过整合多模型优势来突破单个LLM的约束。然而,当前缺乏标准化基准来评估LLM路由器的性能,阻碍了该领域的进展。为填补这一空白,我们提出RouterBench——一个旨在系统评估LLM路由系统效能的新型评价框架,并配套发布了包含代表性LLM超40.5万次推理结果的数据集,以支持路由策略的研发。我们进一步提出了LLM路由的理论框架,并通过RouterBench对多种路由方法开展比较分析,揭示了它们在我们评估框架中的潜力与局限性。本研究不仅规范并推动了LLM路由系统的发展,更为其评估确立了标准,为更普惠且经济可行的LLM部署铺平道路。代码与数据已开源至https://github.com/withmartian/routerbench。