SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

from arxiv, 15 pages, 3 figures, 11 tables. Accepted to HPDC '26 (35th International Symposium on High-Performance Parallel and Distributed Computing), July 13-16, 2026, Cleveland, OH, USA

AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, achieving within 1.31x of Bélády's optimal offline policy; (2) session-affinity batching with work stealing that co-locates correlated requests while maintaining global load balance; and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster serving SWE-bench coding agents and WebArena browser tasks, SAGA reduces task completion time by 1.64x (geometric mean, p < 0.001) over vLLM v0.15.1 with prefix caching and affinity routing, while improving GPU memory utilization by 1.22x and achieving 99.2% SLO attainment under multi-tenant interference. These latency gains come at a quantified cost: approximately 30% lower peak throughput than throughput-optimal batch scheduling, a tradeoff appropriate for the latency-sensitive interactive deployments that dominate compound AI usage. Our results demonstrate that workflow-aware scheduling is essential for efficient compound AI serving.

翻译：AI智能体每项任务需执行数十至数百次链式大语言模型调用，然而GPU调度器将每次调用视为独立任务，导致步骤间数GB中间状态被丢弃，端到端延迟增加3-8倍。我们指出这种请求级抽象与复合型AI工作负载存在根本性不匹配，并提出转向程序级调度方案：将整个智能体工作流（而非单个推理调用）作为一级可调度单元。我们提出的SAGA分布式调度器通过三种机制实现该抽象：（1）智能体执行图——捕获工作流结构以预测跨工具调用边界的KV缓存复用，性能达到Bélády最优离线策略的1.31倍以内；（2）会话语义亲和批处理与任务窃取——在保持全局负载均衡的同时协同定位关联请求；（3）智能体公平份额——一种具有可证明有界偏差保证的任务完成时间公平性度量指标。在服务于SWE-bench编码智能体与WebArena浏览器任务的64-GPU集群上，相比启用前缀缓存与亲和路由的vLLM v0.15.1，SAGA将任务完成时间降低1.64倍（几何均值，p<0.001），GPU内存利用率提升1.22倍，并在多租户干扰下实现99.2%的SLO达标率。这些延迟收益存在量化代价：峰值吞吐量比吞吐量最优的批处理调度低约30%，这一权衡适用于主导复合型AI部署场景的延迟敏感型交互式服务。我们的研究结果表明，工作流感知调度对高效复合型AI服务至关重要。