Batch inference for agentic workloads stresses the GPU key-value (KV) cache in a sustained and cumulative manner, often causing severe throughput degradation well before memory capacity is exhausted. We identify this phenomenon as middle-phase thrashing, a previously under-characterized pathology in which cache efficiency collapses as long-lived agents accumulate state over time. We argue that mitigating this pathology requires moving beyond reactive, request-level cache management to proactive, agent-level admission control. Drawing inspiration from congestion control in distributed systems, we view the KV cache as a shared resource whose efficient utilization depends on feedback-driven regulation. Based on this insight, we present CONCUR, a lightweight control layer that regulates agent admission to bound aggregate cache pressure while preserving execution continuity. CONCUR adapts a cache-aware control algorithm to dynamically adjust the number of active agents using runtime cache signals. Across large models and real-world agent workloads, CONCUR prevents middle-phase thrashing and improves batch inference throughput by up to 4.09x on Qwen3-32B and 1.9x on DeepSeek-V3, while remaining compatible with existing LLM serving systems.
翻译:面向智能体工作负载的批量推理会以持续累积的方式对GPU键值(KV)缓存造成压力,通常在内存容量耗尽之前就导致严重的吞吐量下降。我们将此现象识别为中间阶段颠簸,这是一种先前未被充分表征的病理状态,即随着长生命周期智能体随时间累积状态,缓存效率急剧下降。我们认为,缓解此问题需要超越被动的请求级缓存管理,转向主动的智能体级准入控制。受分布式系统中拥塞控制的启发,我们将KV缓存视为一种共享资源,其高效利用依赖于反馈驱动的调节机制。基于这一洞见,我们提出了CONCUR——一个轻量级的控制层,它通过调节智能体准入来限制总体缓存压力,同时保持执行的连续性。CONCUR采用一种缓存感知的控制算法,利用运行时缓存信号动态调整活跃智能体的数量。在大型模型和真实世界智能体工作负载上的实验表明,CONCUR能有效防止中间阶段颠簸,在Qwen3-32B上将批量推理吞吐量提升高达4.09倍,在DeepSeek-V3上提升1.9倍,同时保持与现有LLM服务系统的兼容性。