As large language models (LLMs) diversify across modalities, capabilities, and cost profiles, the problem of intelligent request routing -- selecting the right model for each query at inference time -- has become a critical systems challenge. We present vLLM Semantic Router, a signal-driven decision routing framework for Mixture-of-Modality (MoM) model deployments. The central innovation is composable signal orchestration: the system extracts heterogeneous signal types from each request -- from sub-millisecond heuristic features (keyword patterns, language detection, context length, role-based authorization) to neural classifiers (domain, embedding similarity, factual grounding, modality) -- and composes them through configurable Boolean decision rules into deployment-specific routing policies. Different deployment scenarios -- multi-cloud enterprise, privacy-regulated, cost-optimized, latency-sensitive -- are expressed as different signal-decision configurations over the same architecture, without code changes. Matched decisions drive semantic model routing: over a dozen of selection algorithms analyze request characteristics to find the best model cost-effectively, while per-decision plugin chains enforce privacy and safety constraints (jailbreak detection, PII filtering, hallucination detection via the three-stage HaluGate pipeline). The system provides OpenAI API support for stateful multi-turn conversations, multi-endpoint and multi-provider routing across heterogeneous backends (vLLM, OpenAI, Anthropic, Azure, Bedrock, Gemini, Vertex AI), and a pluggable authorization factory supporting multiple auth providers. Deployed in production as an Envoy external processor, the architecture demonstrates that composable signal orchestration enables a single routing framework to serve diverse deployment scenarios with differentiated cost, privacy, and safety policies.
翻译:随着大语言模型在模态、能力和成本维度上的持续多样化发展,智能请求路由问题——即在推理时为每个查询选择合适模型——已成为关键的系统性挑战。我们提出vLLM语义路由器,这是一种面向混合模态模型部署的信号驱动决策路由框架。其核心创新在于可组合信号编排:系统从每个请求中提取异构信号类型——从亚毫秒级启发式特征(关键词模式、语言检测、上下文长度、基于角色的授权)到神经分类器(领域、嵌入相似性、事实基础、模态)——并通过可配置的布尔决策规则将其组合为部署特定的路由策略。不同部署场景(多云企业、隐私合规、成本优化、延迟敏感)可表达为同一架构上的不同信号-决策配置,无需修改代码。匹配的决策驱动语义模型路由:十余种选择算法分析请求特征以经济高效地寻找最佳模型,同时每个决策的插件链强制执行隐私和安全约束(越狱检测、PII过滤、通过三阶段HaluGate流水线实现的幻觉检测)。该系统提供支持有状态多轮对话的OpenAI API、跨异构后端(vLLM、OpenAI、Anthropic、Azure、Bedrock、Gemini、Vertex AI)的多端点多提供商路由,以及支持多种认证提供商的插件式授权工厂。该架构作为Envoy外部处理器在生产环境中部署,证明了可组合信号编排使单一路由框架能够以差异化的成本、隐私和安全策略服务于多种部署场景。