The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

Over the past year, the vLLM Semantic Router project has released a series of work spanning: (1) core routing mechanisms -- signal-driven routing, context-length pool routing, router performance engineering, policy conflict detection, low-latency embedding models, category-aware semantic caching, user-feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing -- multimodal agent routing, tool selection, CUA security, and multi-turn context memory and safety; (4) governance and standards -- inference routing protocols and multi-provider API extensions. Each paper tackled a specific problem in LLM inference, but the problems are not independent; for example, fleet provisioning depends on the routing policy, which depends on the workload mix, shifting as organizations adopt agentic and multimodal workloads. This paper distills those results into the Workload-Router-Pool (WRP) architecture, a three-dimensional framework for LLM inference optimization. Workload characterizes what the fleet serves (chat vs. agent, single-turn vs. multi-turn, warm vs. cold, prefill-heavy vs. decode-heavy). Router determines how each request is dispatched (static semantic rules, online bandit adaptation, RL-based model selection, quality-aware cascading). Pool defines where inference runs (homogeneous vs. heterogeneous GPU, disaggregated prefill/decode, KV-cache topology). We map our prior work onto a 3x3 WRP interaction matrix, identify which cells we have covered and which remain open, and propose twenty-one concrete research directions at the intersections, each grounded in our prior measurements, tiered by maturity from engineering-ready to open research.

翻译：过去一年，vLLM语义路由项目发布了一系列工作，涵盖：（1）核心路由机制——信号驱动路由、上下文长度池路由、路由性能工程、策略冲突检测、低延迟嵌入模型、类别感知语义缓存、用户反馈驱动的路由自适应、幻觉检测，以及针对隐私和越狱防护的分层内容安全分类；（2）集群优化——集群配置与能效分析；（3）智能体与多模态路由——多模态智能体路由、工具选择、CUA安全、多轮对话上下文记忆与安全；（4）治理与标准——推理路由协议与多提供商API扩展。每篇论文都针对大语言模型推理中的特定问题，但这些问题并非独立存在；例如，集群配置依赖于路由策略，而路由策略又取决于工作负载组合，随着组织采用智能体与多模态工作负载而动态变化。本文将这些成果提炼为Workload-Router-Pool（WRP）架构，这是一个面向大语言模型推理优化的三维框架。Workload刻画集群所服务的负载类型（对话型vs.智能体型、单轮vs.多轮、热启动vs.冷启动、预填充主导vs.解码主导）。Router决定每个请求的调度方式（静态语义规则、在线臂赌自适应、基于强化学习的模型选择、质量感知级联）。Pool定义推理执行的场所（同构vs.异构GPU、分离式预填充/解码、KV缓存拓扑）。我们将前期工作映射到3×3的WRP交互矩阵，识别已覆盖与未覆盖的单元格，并提出二十一个位于交叉点的具体研究方向，每个方向均基于前期测量结果，并按成熟度划分为工程就绪型与开放研究型。