Agentic workflows carry out complex tasks by orchestrating multiple large language models (LLMs) and tools. Serving such workflows at a target throughput with low latency is challenging because they can be defined using arbitrary agentic frameworks and exhibit unpredictable execution times: execution may branch, fan-out, or recur in data-dependent ways. Since LLMs in workflows often outnumber available GPUs, their execution also leads to GPU oversubscription. We describe Scepsy, a new agentic serving system that efficiently schedules arbitrary multi-LLM agentic workflows onto a GPU cluster. Scepsy exploits the insight that, while agentic workflows have unpredictable end-to-end latencies, the shares of each LLM's total execution times are comparatively stable across executions. Scepsy decides on GPU allocations based on these aggregate shares: first, it profiles the LLMs under different parallelism degrees. It then uses these statistics to construct an Aggregate LLM Pipeline, which is a lightweight latency/throughput predictor for allocations. To find a GPU allocation that minimizes latency while achieving a target throughput, Scepsy uses the Aggregate LLM Pipeline to explore a search space over fractional GPU shares, tensor parallelism degrees, and replica counts. It uses a hierarchical heuristic to place the best allocation onto the GPU cluster, minimizing fragmentation, while respecting network topology constraints. Our evaluation on realistic agentic workflows shows that Scepsy achieves up to 2.4x higher throughput and 27x lower latency compared to systems that optimize LLMs independently or rely on user-specified allocations.
翻译:摘要:智能体工作流通过编排多个大语言模型(LLM)与工具来执行复杂任务。由于此类工作流可基于任意智能体框架定义,且执行时间具有不可预测性(执行过程可能随数据依赖关系产生分支、扇出或递归),因此以目标吞吐量实现低延迟服务极具挑战性。当工作流中LLM数量超过可用GPU时,其执行还会导致GPU超分。本文提出Scepsy——一种新型智能体服务系统,可将任意多LLM智能体工作流高效调度至GPU集群。Scepsy的核心洞见在于:尽管智能体工作流的端到端延迟不可预测,但每次执行中各LLM总执行时间的占比相对稳定。基于这些聚合占比,Scepsy首先在不同并行度下对LLM进行性能剖析,继而利用统计结果构建"聚合LLM流水线"——一种轻量级的分配延迟/吞吐量预测器。为在达成目标吞吐量的同时最小化延迟,Scepsy通过聚合LLM流水线在分数级GPU份额、张量并行度及副本数量组成的搜索空间中探索最优GPU分配方案。最后采用层次化启发式算法将最佳分配部署至GPU集群,在尊重网络拓扑约束的前提下最小化碎片化。基于真实智能体工作流的评估表明:相较于独立优化LLM或依赖用户指定分配的系统,Scepsy可实现最高2.4倍吞吐量提升与27倍延迟降低。