Agentic workflows are composed of sequences of interdependent Large Language Model (LLM) calls, and they have become a dominant workload in modern AI systems. These workflows exhibit extensive redundancy from overlapping prompts and intermediate results due to speculative and parallel exploration. Existing LLM serving systems, such as vLLM, focus on optimizing individual inference calls and overlook cross-call dependencies, leading to significant inefficiencies. This paper rethinks LLM and agent serving from a data systems perspective and introduces Helium, a workflow-aware serving framework that models agentic workloads as query plans and treats LLM invocations as first-class operators. Helium integrates proactive caching and cache-aware scheduling to maximize reuse across prompts, KV states, and workflows. Through these techniques, Helium bridges classic query optimization principles with LLM serving, achieving up to 1.56x speedup over state-of-the-art agent serving systems on various workloads. Our results demonstrate that end-to-end optimization across workflows is essential for scalable and efficient LLM-based agents.
翻译:智能体工作流由一系列相互依赖的大语言模型(LLM)调用序列构成,已成为现代人工智能系统中的主导工作负载。由于推测性及并行探索机制,这些工作流在重叠提示词和中间结果上表现出大量冗余。现有的大语言模型服务系统(如 vLLM)主要聚焦于优化单次推理调用,忽视了跨调用依赖关系,导致显著的效率低下。本文从数据系统的视角重新思考大语言模型与智能体服务,并提出了 Helium——一个工作流感知的服务框架。该框架将智能体工作负载建模为查询计划,并将大语言模型调用视为一等操作符。Helium 集成了主动缓存与缓存感知调度机制,以最大化提示词、KV 状态及工作流间的复用。通过这些技术,Helium 将经典的查询优化原则与大语言模型服务相结合,在多种工作负载上相比最先进的智能体服务系统实现了最高 1.56 倍的加速。我们的结果表明,跨工作流的端到端优化对于实现可扩展且高效的大语言模型智能体至关重要。