vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

Xunzhuo Liu,Huamin Chen,Samzong Lu,Yossi Ovadia,Guohong Wen,Hao Wu,Zhengda Tan,Jintao Zhang,Senan Zedan,Yehudit Kerido,Liav Weiss,Haichen Zhang,Bishen Yu,Asaad Balum,Noa Limoy,Abdallah Samara,Baofa Fan,Brent Salisbury,Ryan Cook,Zhijie Wang,Qiping Pan,Rehan Khan,Avishek Goswami,Houston H. Zhang,Shuyi Wang,Ziang Tang,Fang Han,Zohaib Hassan,Jianqiao Zheng,Avinash Changrani, Xue, Liu,Bowei He

from arxiv, Technical Report

As large language models (LLMs) diversify across modalities, capabilities, and cost profiles, the problem of intelligent request routing -- selecting the right model for each query at inference time -- has become a critical systems challenge. We present vLLM Semantic Router, a signal-driven decision routing framework for Mixture-of-Modality (MoM) model deployments. The central innovation is composable signal orchestration: the system extracts heterogeneous signal types from each request -- from sub-millisecond heuristic features (keyword patterns, language detection, context length, role-based authorization) to neural classifiers (domain, embedding similarity, factual grounding, modality) -- and composes them through configurable Boolean decision rules into deployment-specific routing policies. Different deployment scenarios -- multi-cloud enterprise, privacy-regulated, cost-optimized, latency-sensitive -- are expressed as different signal-decision configurations over the same architecture, without code changes. Matched decisions drive semantic model routing: over a dozen of selection algorithms analyze request characteristics to find the best model cost-effectively, while per-decision plugin chains enforce privacy and safety constraints (jailbreak detection, PII filtering, hallucination detection via the three-stage HaluGate pipeline). The system provides OpenAI API support for stateful multi-turn conversations, multi-endpoint and multi-provider routing across heterogeneous backends (vLLM, OpenAI, Anthropic, Azure, Bedrock, Gemini, Vertex AI), and a pluggable authorization factory supporting multiple auth providers. Deployed in production as an Envoy external processor, the architecture demonstrates that composable signal orchestration enables a single routing framework to serve diverse deployment scenarios with differentiated cost, privacy, and safety policies.

翻译：随着大语言模型在模态、能力和成本维度上的持续多样化发展，智能请求路由问题——即在推理时为每个查询选择合适模型——已成为关键的系统性挑战。我们提出vLLM语义路由器，这是一种面向混合模态模型部署的信号驱动决策路由框架。其核心创新在于可组合信号编排：系统从每个请求中提取异构信号类型——从亚毫秒级启发式特征（关键词模式、语言检测、上下文长度、基于角色的授权）到神经分类器（领域、嵌入相似性、事实基础、模态）——并通过可配置的布尔决策规则将其组合为部署特定的路由策略。不同部署场景（多云企业、隐私合规、成本优化、延迟敏感）可表达为同一架构上的不同信号-决策配置，无需修改代码。匹配的决策驱动语义模型路由：十余种选择算法分析请求特征以经济高效地寻找最佳模型，同时每个决策的插件链强制执行隐私和安全约束（越狱检测、PII过滤、通过三阶段HaluGate流水线实现的幻觉检测）。该系统提供支持有状态多轮对话的OpenAI API、跨异构后端（vLLM、OpenAI、Anthropic、Azure、Bedrock、Gemini、Vertex AI）的多端点多提供商路由，以及支持多种认证提供商的插件式授权工厂。该架构作为Envoy外部处理器在生产环境中部署，证明了可组合信号编排使单一路由框架能够以差异化的成本、隐私和安全策略服务于多种部署场景。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大语言模型与物联网：大语言模型与物联网融合全面综述

专知会员服务

15+阅读 · 6月7日

大语言模型高效推理中的动态模型路由与级联技术综述

专知会员服务

14+阅读 · 3月6日

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

专知会员服务

28+阅读 · 2月27日

PlanGenLLMs：大型语言模型规划能力的最新综述

专知会员服务

34+阅读 · 2025年5月18日