Large language models (LLMs) are increasingly deployed as the execution core of autonomous agents rather than as standalone text generators. Agentic workloads induce a temporal shift from single-turn inference to multi-turn LLM-tool loops, and a spatial shift from chat-scale, GPU-only execution to repository-scale, GPU-CPU co-located execution. Consequently, coordinating heterogeneous resource demands of agentic execution has emerged as a critical system challenge. We design and implement MARS, an efficient and adaptive co-scheduling system that globally coordinates heterogeneous agentic workloads under coupled GPU-CPU resource pressure. By establishing holistic visibility across GPU inference and CPU tool execution via a unified information stream, an external control plane in MARS decouples admission from execution to prevent heterogeneous resource oversubscription. An internal agent-centric scheduler further minimizes the end-to-end critical path by prioritizing latency-sensitive continuations and adaptively retaining KV cache state only when warm resumption yields a latency benefit. Our evaluations show that MARS reduces end-to-end latency by up to 5.94x while maintaining nearly maximal system throughput. We further integrate MARS as the serving backend for the OpenHands coding agent framework, demonstrating its real-world effectiveness by accelerating end-to-end task completion time by up to 1.87x. Our source code is publicly available at https://github.com/Afterglow231/MARS_preview .
翻译:摘要:大语言模型(LLMs)正日益被部署为自主智能体的执行核心,而非单纯的文本生成器。智能体工作负载引发了双重转变:在时间维度上,从单轮推理转向多轮LLM-工具循环;在空间维度上,从聊天规模的纯GPU执行转向仓库规模的GPU-CPU协同执行。因此,协调智能体执行过程的异构资源需求已成为关键的系统挑战。我们设计并实现了MARS——一个高效的自适应协同调度系统,能在GPU-CPU耦合资源压力下全局协调异构智能体工作负载。通过统一信息流建立GPU推理与CPU工具执行的全局可见性,MARS中的外部控制平面将准入与执行解耦,以防止异构资源过度订阅。内部智能体中心调度器则通过优先处理延迟敏感的续接请求,并仅在热重启能带来延迟收益时自适应保留KV缓存状态,进一步最小化端到端关键路径。评估表明,MARS在保持近乎最大系统吞吐量的同时,将端到端延迟降低达5.94倍。我们进一步将MARS集成到OpenHands编码智能体框架中作为服务后端,通过将端到端任务完成时间加速达1.87倍,验证了其实际应用效果。源代码已开源发布在https://github.com/Afterglow231/MARS_preview。