Large language models (LLMs) are increasingly deployed as the execution core of autonomous agents rather than as standalone text generators. Agentic workloads induce a temporal shift from single-turn inference to multi-turn LLM-tool loops, and a spatial shift from chat-scale, GPU-only execution to repository-scale, GPU-CPU co-located execution. Consequently, coordinating heterogeneous resource demands of agentic execution has emerged as a critical system challenge. We design and implement MARS, an efficient and adaptive co-scheduling system that globally coordinates heterogeneous agentic workloads under coupled GPU-CPU resource pressure. By establishing holistic visibility across GPU inference and CPU tool execution via a unified information stream, an external control plane in MARS decouples admission from execution to prevent heterogeneous resource oversubscription. An internal agent-centric scheduler further minimizes the end-to-end critical path by prioritizing latency-sensitive continuations and adaptively retaining KV cache state only when warm resumption yields a latency benefit. Our evaluations show that MARS reduces end-to-end latency by up to 5.94x while maintaining nearly maximal system throughput. We further integrate MARS as the serving backend for the OpenHands coding agent framework, demonstrating its real-world effectiveness by accelerating end-to-end task completion time by up to 1.87x. Our source code will be publicly available soon.
翻译:大型语言模型(LLMs)正日益被部署为自主智能体的执行核心,而非独立的文本生成器。智能体工作负载引发了时间维度上从单轮推理到多轮LLM-工具循环的转变,以及空间维度上从聊天级、仅GPU执行到仓库级、GPU-CPU协同执行的转变。因此,协调智能体执行过程中的异构资源需求已成为一项关键系统挑战。我们设计并实现了MARS,一个高效自适应的协同调度系统,在GPU-CPU耦合资源压力下全局协调异构智能体工作负载。通过统一信息流建立对GPU推理与CPU工具执行的全局可见性,MARS中的外部控制平面将准入与执行解耦,以防止异构资源过载。内部智能体中心调度器通过优先处理延迟敏感的续生成任务,并仅在热恢复带来延迟收益时自适应保留KV缓存状态,进一步最小化端到端关键路径。评估表明,MARS在保持接近最大系统吞吐量的同时,将端到端延迟降低高达5.94倍。我们进一步将MARS集成到OpenHands编码智能体框架中作为服务后端,通过将端到端任务完成时间加速高达1.87倍证明了其实际有效性。我们的源代码将很快公开。