Large language models (LLMs) are increasingly deployed as AI agents that operate in short reasoning-action loops, interleaving model computation with external calls. Unlike traditional chat applications, these agentic workloads require inference serving systems to balance low latency, stable token emission, and throughput under multiple request arrivals from different AI agents. Recent deployments highlight a shift toward running small language models (SLMs) locally on consumer-grade GPUs, driven by privacy, compliance, and cost constraints. When heterogeneous requests overlap on a single GPU, long prefills and short decodes contend for resources, creating head-of-line blocking that destabilizes interactive performance. By analyzing agent workloads, we observe that their execution naturally separates into cold prefills, which process long system prompts, resume prefills, which append tool outputs to cached contexts, and short decodes, which are latency-critical. This mix intensifies contention compared to conventional chatbot serving. We present AgentServe, a single-GPU serving system that ensures stable multi-agent execution under such conditions by isolating prefills from decodes, applying dynamic budgeting to resume prefills, and allocating GPU resources through pre-established CUDA Green Context slots with adaptive control. Evaluation results show that AgentServe significantly improves latency stability while sustaining competitive throughput, achieving up to 2.8x TTFT improvement and 2.7x TPOT improvement over state-of-the-art baselines across different settings.
翻译:大型语言模型(LLM)正日益以智能体形式部署,在简短的推理-行动循环中运行,交替进行模型计算与外部调用。与传统聊天应用不同,这类智能体工作负载要求推理服务系统在多个AI智能体并发请求下,平衡低延迟、稳定的令牌生成与吞吐量。近期部署趋势表明,受隐私、合规和成本限制驱动,小型语言模型(SLM)正逐渐转向在消费级GPU上本地运行。当异构请求在单个GPU上重叠时,长预填充与短解码会争夺资源,造成队头阻塞,从而破坏交互性能的稳定性。通过分析智能体工作负载,我们观察到其执行过程自然分为三类:冷预填充(处理长系统提示)、恢复预填充(将工具输出追加至缓存上下文)以及延迟敏感的短解码。与传统聊天机器人服务相比,这种混合模式加剧了资源争用。本文提出AgentServe,一种单GPU服务系统,通过以下机制确保在此类条件下稳定执行多智能体任务:隔离预填充与解码、对恢复预填充应用动态预算分配,并通过预建立的CUDA Green Context槽位配合自适应控制来分配GPU资源。评估结果表明,AgentServe在保持竞争力的吞吐量的同时,显著提升了延迟稳定性,在不同设置下相比最先进的基线方法实现了最高2.8倍的首次令牌时间(TTFT)提升和2.7倍的每令牌时间(TPOT)提升。