vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

Xunzhuo Liu,Huamin Chen,Samzong Lu,Yossi Ovadia,Guohong Wen,Hao Wu,Zhengda Tan,Jintao Zhang,Senan Zedan,Yehudit Kerido,Liav Weiss,Haichen Zhang,Bishen Yu,Asaad Balum,Noa Limoy,Abdallah Samara,Baofa Fan,Brent Salisbury,Ryan Cook,Zhijie Wang,Qiping Pan,Rehan Khan,Avishek Goswami,Houston H. Zhang,Shuyi Wang,Ziang Tang,Fang Han,Zohaib Hassan,Jianqiao Zheng,Avinash Changrani

from arxiv, Technical Report

As large language models (LLMs) diversify across modalities, capabilities, and cost profiles, the problem of intelligent request routing -- selecting the right model for each query at inference time -- has become a critical systems challenge. We present vLLM Semantic Router, a signal-driven decision routing framework for Mixture-of-Modality (MoM) model deployments. The central innovation is composable signal orchestration: the system extracts heterogeneous signal types from each request -- from sub-millisecond heuristic features (keyword patterns, language detection, context length, role-based authorization) to neural classifiers (domain, embedding similarity, factual grounding, modality) -- and composes them through configurable Boolean decision rules into deployment-specific routing policies. Different deployment scenarios -- multi-cloud enterprise, privacy-regulated, cost-optimized, latency-sensitive -- are expressed as different signal-decision configurations over the same architecture, without code changes. Matched decisions drive semantic model routing: over a dozen of selection algorithms analyze request characteristics to find the best model cost-effectively, while per-decision plugin chains enforce privacy and safety constraints (jailbreak detection, PII filtering, hallucination detection via the three-stage HaluGate pipeline). The system provides OpenAI API support for stateful multi-turn conversations, multi-endpoint and multi-provider routing across heterogeneous backends (vLLM, OpenAI, Anthropic, Azure, Bedrock, Gemini, Vertex AI), and a pluggable authorization factory supporting multiple auth providers. Deployed in production as an Envoy external processor, the architecture demonstrates that composable signal orchestration enables a single routing framework to serve diverse deployment scenarios with differentiated cost, privacy, and safety policies.

翻译：随着大型语言模型（LLM）在模态、能力与成本配置上日益多样化，智能请求路由——在推理时为每个查询选择合适模型——已成为关键的系统挑战。本文提出vLLM语义路由器，一种面向多模态混合模型部署的信号驱动决策路由框架。其核心创新在于可组合的信号编排机制：系统从每个请求中提取异构信号类型——从亚毫秒级启发式特征（关键词模式、语言检测、上下文长度、基于角色的授权）到神经分类器（领域、嵌入相似度、事实依据、模态）——并通过可配置的布尔决策规则将其组合为面向特定部署场景的路由策略。不同的部署场景（多云企业环境、隐私合规场景、成本优化场景、延迟敏感场景）可在同一架构下通过不同的信号-决策配置实现，无需修改代码。匹配的决策驱动语义模型路由：十余种选择算法通过分析请求特征以经济高效的方式寻找最佳模型，同时每个决策的插件链强制执行隐私与安全约束（通过三阶段HaluGate流程实现越狱检测、个人身份信息过滤、幻觉检测）。该系统为有状态多轮对话提供OpenAI API支持，支持跨异构后端（vLLM、OpenAI、Anthropic、Azure、Bedrock、Gemini、Vertex AI）的多端点与多供应商路由，并配备支持多认证供应商的可插拔授权工厂。该架构作为Envoy外部处理器在生产环境中部署，证明了可组合信号编排机制能使单一路由框架以差异化的成本、隐私与安全策略服务于多样化的部署场景。