The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

Over the past year, the vLLM Semantic Router project has released a series of work spanning: (1) core routing mechanisms -- signal-driven routing, context-length pool routing, router performance engineering, policy conflict detection, low-latency embedding models, category-aware semantic caching, user-feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing -- multimodal agent routing, tool selection, CUA security, and multi-turn context memory and safety; (4) governance and standards -- inference routing protocols and multi-provider API extensions. Each paper tackled a specific problem in LLM inference, but the problems are not independent; for example, fleet provisioning depends on the routing policy, which depends on the workload mix, shifting as organizations adopt agentic and multimodal workloads. This paper distills those results into the Workload-Router-Pool (WRP) architecture, a three-dimensional framework for LLM inference optimization. Workload characterizes what the fleet serves (chat vs. agent, single-turn vs. multi-turn, warm vs. cold, prefill-heavy vs. decode-heavy). Router determines how each request is dispatched (static semantic rules, online bandit adaptation, RL-based model selection, quality-aware cascading). Pool defines where inference runs (homogeneous vs. heterogeneous GPU, disaggregated prefill/decode, KV-cache topology). We map our prior work onto a 3x3 WRP interaction matrix, identify which cells we have covered and which remain open, and propose twenty-one concrete research directions at the intersections, each grounded in our prior measurements, tiered by maturity from engineering-ready to open research.

翻译：过去一年中，vLLM语义路由器项目发布了一系列研究工作，涵盖：（1）核心路由机制——信号驱动路由、上下文长度感知资源池路由、路由器性能工程、策略冲突检测、低延迟嵌入模型、类别感知语义缓存、用户反馈驱动的路由自适应、幻觉检测，以及面向隐私保护和越狱防御的分层内容安全分类；（2）集群优化——集群资源调配与能效分析；（3）智能体与多模态路由——多模态智能体路由、工具选择、CUA安全、多轮对话上下文记忆与安全；（4）治理与标准——推理路由协议及多提供商API扩展。每篇论文各自解决了大语言模型推理中的特定问题，但这些问题并非相互独立：例如，集群资源调配依赖于路由策略，而路由策略又取决于工作负载的混合模式，并且随着组织采用智能体与多模态工作负载而动态变化。本文将上述研究成果提炼为"工作负载-路由器-资源池"（WRP）三层架构——一种面向大语言模型推理优化的三维框架。其中，工作负载层表征集群服务内容（对话vs.智能体、单轮vs.多轮、热启动vs.冷启动、预填充密集型vs.解码密集型）；路由器层决定每个请求的分发方式（静态语义规则、在线赌博机自适应、基于强化学习的模型选择、质量感知级联）；资源池层定义推理发生的物理位置（同构vs.异构GPU、分离式预填充/解码、KV缓存拓扑）。我们将前期工作映射到3x3的WRP交互矩阵，识别已覆盖与待填充的单元格，并在各交叉点提出二十一个具体研究方向，每个方向均基于前期实验数据，按成熟度分级（从工程就绪到开放研究问题）。