Batch Query Processing and Optimization for Agentic Workflows

Large Language Models (LLMs) in agentic workflows combine multi-step reasoning, heterogeneous tool use, and collaboration across multiple specialized agents. Existing LLM serving engines optimize individual calls in isolation, while multi-agent frameworks focus on orchestration without system-level performance planning. As a result, repeated prompts, overlapping contexts, and fragmented CPU-GPU execution create substantial redundancy and poor hardware utilization, especially in batch analytics scenarios. We introduce Halo, a system that brings batch query processing and optimization into agentic LLM workflows. Halo represents each workflow as a structured query plan DAG and constructs a consolidated graph for batched queries that exposes shared computation. Guided by a cost model that jointly considers heterogeneous resource constraints, prefill and decode costs, cache reuse, and GPU placement, Halo performs plan-level optimization to minimize redundant execution. The Processor integrates adaptive batching, KV-cache sharing and migration, along with fine-grained CPU-GPU pipelining to maximize holistic hardware efficiency. Evaluation across six benchmarks shows that Halo achieves up to 3.6x speedup for batch inference and 2.6x throughput improvement under online serving, scaling to workloads of thousands of queries and complex graphs. These gains are achieved without compromising output quality. By unifying query optimization with heterogeneous LLM serving, Halo enables efficient agentic workflows in data analytics and decision-making applications.

翻译：在智能体工作流中，大型语言模型（LLMs）融合了多步推理、异构工具使用以及多个专用智能体间的协作。现有的LLM服务引擎仅针对孤立单次调用进行优化，而多智能体框架则侧重于编排，缺乏系统级的性能规划。这导致重复提示、上下文重叠以及碎片化的CPU-GPU执行产生了大量冗余和低下的硬件利用率，在批量分析场景中尤为明显。本文提出Halo系统，它将批量查询处理与优化引入智能体LLM工作流。Halo将每个工作流表示为结构化查询计划有向无环图，并为批量查询构建一个统一的计算图以暴露共享计算机会。在综合考虑异构资源约束、预填充和解码成本、缓存复用以及GPU放置的代价模型指导下，Halo执行计划级优化以最小化冗余计算。其处理器集成了自适应批处理、KV缓存共享与迁移，以及细粒度CPU-GPU流水线技术，以实现整体硬件效率最大化。在六个基准测试上的评估表明，Halo在批量推理中实现了最高3.6倍的加速，在线服务场景下获得了2.6倍的吞吐量提升，并可扩展至包含数千查询和复杂图结构的工作负载。这些性能增益的取得并未牺牲输出质量。通过将查询优化与异构LLM服务相统一，Halo为数据分析和决策应用实现了高效的智能体工作流。