AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, achieving within 1.31x of Bélády's optimal offline policy; (2) session-affinity batching with work stealing that co-locates correlated requests while maintaining global load balance; and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster serving SWE-bench coding agents and WebArena browser tasks, SAGA reduces task completion time by 1.64x (geometric mean, p < 0.001) over vLLM v0.15.1 with prefix caching and affinity routing, while improving GPU memory utilization by 1.22x and achieving 99.2% SLO attainment under multi-tenant interference. These latency gains come at a quantified cost: approximately 30% lower peak throughput than throughput-optimal batch scheduling, a tradeoff appropriate for the latency-sensitive interactive deployments that dominate compound AI usage. Our results demonstrate that workflow-aware scheduling is essential for efficient compound AI serving.
翻译:AI智能体每项任务需执行数十至数百次链式大语言模型调用,然而GPU调度器将每次调用视为独立任务,导致步骤间数GB中间状态被丢弃,端到端延迟增加3-8倍。我们指出这种请求级抽象与复合型AI工作负载存在根本性不匹配,并提出转向程序级调度方案:将整个智能体工作流(而非单个推理调用)作为一级可调度单元。我们提出的SAGA分布式调度器通过三种机制实现该抽象:(1)智能体执行图——捕获工作流结构以预测跨工具调用边界的KV缓存复用,性能达到Bélády最优离线策略的1.31倍以内;(2)会话语义亲和批处理与任务窃取——在保持全局负载均衡的同时协同定位关联请求;(3)智能体公平份额——一种具有可证明有界偏差保证的任务完成时间公平性度量指标。在服务于SWE-bench编码智能体与WebArena浏览器任务的64-GPU集群上,相比启用前缀缓存与亲和路由的vLLM v0.15.1,SAGA将任务完成时间降低1.64倍(几何均值,p<0.001),GPU内存利用率提升1.22倍,并在多租户干扰下实现99.2%的SLO达标率。这些延迟收益存在量化代价:峰值吞吐量比吞吐量最优的批处理调度低约30%,这一权衡适用于主导复合型AI部署场景的延迟敏感型交互式服务。我们的研究结果表明,工作流感知调度对高效复合型AI服务至关重要。