As LLM applications grow more complex, developers are increasingly adopting multi-agent architectures to decompose workflows into specialized, collaborative components, introducing structure that constrains agent behavior and exposes useful semantic predictability. Unlike traditional LLM serving, which operates under highly dynamic and uncertain conditions, this structured topology enables opportunities to reduce runtime uncertainty -- yet existing systems fail to exploit it, treating agentic workloads as generic traffic and incurring significant inefficiencies. Our analysis of production traces from an agent-serving platform and an internal coding assistant reveals key bottlenecks, including low prefix cache hit rates, severe resource contention from long-context requests, and substantial queuing delays due to suboptimal scaling. To address these challenges, we propose Pythia, a multi-agent serving system that captures workflow semantics through a simple interface at the serving layer, unlocking new optimization opportunities and substantially improving throughput and job completion time over state-of-the-art baselines.
翻译:随着LLM应用日益复杂,开发者越来越多地采用多智能体架构将工作流分解为专业化、协作化的组件,这种结构化设计约束了智能体行为并揭示出有用的语义可预测性。不同于在高度动态和不确定性条件下运行的传统LLM服务,这种结构化拓扑为降低运行时不确定性创造了机遇——然而现有系统未能加以利用,仍将智能体工作负载视为通用流量处理,导致效率严重低下。我们对来自智能体服务平台和内部编码助手生产轨迹的分析揭示了关键瓶颈,包括前缀缓存命中率低、长上下文请求导致的严重资源争用,以及因次优扩缩容引发的显著排队延迟。为应对这些挑战,我们提出Pythia——一种多智能体服务系统,通过在服务层通过简单接口捕获工作流语义,解锁新的优化机会,并在吞吐量和作业完成时间上显著超越现有最优基准。