Multi-agent applications often execute complex tasks as multi-stage workflows, where each stage is an LLM call whose output becomes part of context for subsequent steps. Existing LLM serving systems largely assume homogeneous clusters with identical model replicas. This design overlooks the potential of heterogeneous deployments, where models of different sizes and capabilities enable finer trade-offs between latency and performance. However, heterogeneity introduces new challenges in scheduling across models with diverse throughput and performance. We present Chimera, a predictive scheduling system for multi-agent workflow serving on heterogeneous LLM clusters that jointly improves end-to-end latency and task performance. Chimera applies semantic routing to estimate per-model confidence scores for each request, predicts the total remaining output length of the workflow, and estimates per-model congestion using in-flight predicted token volumes for load balancing. We evaluate Chimera on representative agentic workflows for code generation and math reasoning using multiple heterogeneous LLM configurations. Across comparable settings, Chimera traces the best latency-performance frontier, reducing end-to-end latency by 1.2--2.4$\times$ and improving task performance by 8.0-9.5 percentage points on average over competitive baselines including vLLM.
翻译:多智能体应用通常将复杂任务分解为多阶段工作流执行,其中每个阶段对应一次大语言模型调用,其输出结果将作为后续步骤的上下文。现有大语言模型服务系统主要假设同构集群环境,使用相同模型副本。这种设计忽视了异构部署的潜力——通过整合不同规模与能力的大语言模型,可在延迟与性能之间实现更精细的权衡。然而,异构性对跨吞吐量与性能各异的模型进行调度提出了新挑战。本文提出Chimera,一种面向异构大语言模型集群的多智能体工作流预测调度系统,可协同优化端到端延迟与任务性能。Chimera采用语义路由技术估计每个请求在各模型上的置信度分数,预测工作流剩余总输出长度,并通过计算集群中正在处理的预测token负载实现跨模型负载均衡。我们基于多种异构大语言模型配置,在代码生成与数学推理两类典型智能体工作流上对Chimera进行评测。在可比设置下,Chimera实现了最优延迟-性能边界,其端到端延迟较vLLM等竞争基线降低1.2-2.4倍,任务性能平均提升8.0-9.5个百分点。