vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

Xunzhuo Liu,Huamin Chen,Samzong Lu,Yossi Ovadia,Guohong Wen,Hao Wu,Zhengda Tan,Jintao Zhang,Senan Zedan,Yehudit Kerido,Liav Weiss,Haichen Zhang,Bishen Yu,Asaad Balum,Noa Limoy,Abdallah Samara,Baofa Fan,Brent Salisbury,Ryan Cook,Zhijie Wang,Qiping Pan,Rehan Khan,Avishek Goswami,Houston H. Zhang,Shuyi Wang,Ziang Tang,Fang Han,Zohaib Hassan,Jianqiao Zheng,Avinash Changrani, Xue, Liu,Bowei He

from arxiv, Technical Report

As large language models (LLMs) diversify across modalities, capabilities, and cost profiles, the problem of intelligent request routing: selecting the right model for each query at inference time, has become a critical systems challenge. We present vLLM Semantic Router, a signal-driven decision routing framework for Mixture-of-Modality (MoM) model deployments. The architecture follows two complementary Shannon-inspired views. In the information-theoretic regime, signal extraction reduces the entropy of "which model?" by distilling routing-relevant information from raw queries. In the Boolean-algebraic regime, the decision engine composes functionally complete routing policies from signal conditions. The central innovation is composable signal orchestration: thirteen heterogeneous signal types, spanning sub-millisecond heuristics and neural classifiers for semantics, safety, and modality, are composed through configurable Boolean decision rules into deployment-specific routing policies, so that fundamentally different scenarios (multi-cloud enterprise, privacy-regulated, cost-optimized) are expressed as different configurations over the same architecture. Matched decisions drive semantic model routing via thirteen selection algorithms, while per-decision plugin chains enforce safety constraints including a three-stage HaluGate hallucination detection pipeline and a lightweight episodic memory system with ReflectionGate for personalized multi-turn context. A typed neural-symbolic DSL specifies these routing policies and compiles them to multiple deployment targets, enabling configuration-first adaptation without code changes. Together, these components show that composable signal orchestration enables a single framework to serve diverse deployment scenarios with differentiated cost, privacy, and safety policies.

翻译：随着大语言模型(LLMs)在模态、能力和成本分布上的多样化，推理时的智能请求路由问题——即为每个查询选择合适模型——已成为关键系统挑战。我们提出vLLM语义路由器，一种面向多模态模型(MoM)部署的信号驱动决策路由框架。该架构遵循两条香农启发的互补视角：在信息论领域，信号提取通过从原始查询中蒸馏路由相关信息来降低“选择哪个模型？”的熵值；在布尔代数领域，决策引擎根据信号条件组合功能完备的路由策略。核心创新在于可组合信号编排：涵盖亚毫秒级启发式算法及语义、安全性和模态神经分类器的十三种异构信号类型，通过可配置布尔决策规则组合为部署特定路由策略，使得根本不同的场景（多云企业、隐私合规、成本优化）能表示为同一架构上的不同配置。匹配决策通过十三种选择算法驱动语义模型路由，同时每决策插件链强制执行安全约束，包括三级HaluGate幻觉检测管道及配备ReflectionGate的轻量级情节记忆系统用于个性化多轮上下文。类型化神经符号领域特定语言(DSL)指定这些路由策略并将其编译为多种部署目标，实现无需代码变更的配置优先适配。这些组件共同表明，可组合信号编排能使单一框架以差异化的成本、隐私和安全策略服务于多样化的部署场景。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大语言模型智能体（LLM Agents）工具调用的演进：从单工具调用到多工具协同编排

专知会员服务

29+阅读 · 4月6日

大语言模型高效推理中的动态模型路由与级联技术综述

专知会员服务

14+阅读 · 3月6日

PlanGenLLMs：大型语言模型规划能力的最新综述

专知会员服务

34+阅读 · 2025年5月18日

《多模态大语言模型评估综述》

专知会员服务

41+阅读 · 2024年8月29日