Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey

The rapid growth of large language models (LLMs) with diverse capabilities, costs, and domains has created a critical need for intelligent model selection at inference time. While smaller models suffice for routine queries, complex tasks demand more capable models. However, static model deployment does not account for the complexity and domain of incoming queries, leading to suboptimal performance and increased costs. Dynamic routing systems that adaptively select models based on query characteristics have emerged as a solution to this challenge. We provide a systematic analysis of state-of-the-art multi-LLM routing and cascading approaches. In contrast to mixture-of-experts architectures, which route within a single model, we study routing across multiple independently trained LLMs. We cover diverse routing paradigms, including query difficulty, human preferences, clustering, uncertainty quantification, reinforcement learning, multimodality, and cascading. For each paradigm, we analyze representative methods and examine key trade-offs. Beyond taxonomy, we introduce a conceptual framework that characterizes routing systems along three dimensions: when decisions are made, what information is used, and how they are computed. This perspective highlights that practical systems are often compositional, integrating multiple paradigms under operational constraints. Our analysis demonstrates that effective multi-LLM routing requires balancing competing objectives. Choosing the optimal routing strategy depends on deployment and computational constraints. Well-designed routing systems can outperform even the most powerful individual models by strategically leveraging specialized capabilities across models while maximizing efficiency gains. Meanwhile, open challenges remain in developing routing mechanisms that generalize across diverse architectures, modalities, and applications.

翻译：具有多样化能力、成本及领域覆盖范围的大语言模型（LLMs）的快速发展，催生了推理阶段智能模型选择的迫切需求。虽然常规查询可通过小型模型处理，但复杂任务需要更具能力的模型。然而，静态模型部署无法适应输入查询的复杂度和领域特性，导致性能次优与成本上升。基于查询特征自适应选择模型的动态路由系统应运而生。本文对当前多LLM路由与级联方法进行系统性分析。与混合专家架构（MoE）在单个模型内进行路由不同，本研究聚焦于跨多个独立训练LLM的路由机制。我们涵盖多元路由范式，包括查询难度、人类偏好、聚类、不确定性量化、强化学习、多模态与级联策略。针对每种范式，我们分析代表性方法并探讨关键权衡。除分类体系外，我们提出一个概念框架，沿三个维度刻画路由系统：决策时机、信息利用方式及计算策略。此视角揭示实际系统往往具有组合性特征，需在操作约束下整合多种范式。研究表明，高效多LLM路由需要平衡多个竞争目标。最优路由策略的选择取决于部署与计算约束。通过战略性利用跨模型的专业化能力并最大化效率增益，设计合理的路由系统可超越最强大的单一模型。同时，在开发适配异构架构、模态与应用场景的通用路由机制方面仍存在开放挑战。