CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control

Batch inference for agentic workloads stresses the GPU key-value (KV) cache in a sustained and cumulative manner, often causing severe throughput degradation well before memory capacity is exhausted. We identify this phenomenon as middle-phase thrashing, a previously under-characterized pathology in which cache efficiency collapses as long-lived agents accumulate state over time. We argue that mitigating this pathology requires moving beyond reactive, request-level cache management to proactive, agent-level admission control. Drawing inspiration from congestion control in distributed systems, we view the KV cache as a shared resource whose efficient utilization depends on feedback-driven regulation. Based on this insight, we present CONCUR, a lightweight control layer that regulates agent admission to bound aggregate cache pressure while preserving execution continuity. CONCUR adapts a cache-aware control algorithm to dynamically adjust the number of active agents using runtime cache signals. Across large models and real-world agent workloads, CONCUR prevents middle-phase thrashing and improves batch inference throughput by up to 4.09x on Qwen3-32B and 1.9x on DeepSeek-V3, while remaining compatible with existing LLM serving systems.

翻译：面向智能体工作负载的批量推理会以持续累积的方式对GPU键值（KV）缓存造成压力，通常在内存容量耗尽之前就导致严重的吞吐量下降。我们将此现象识别为中间阶段颠簸，这是一种先前未被充分表征的病理状态，即随着长生命周期智能体随时间累积状态，缓存效率急剧下降。我们认为，缓解此问题需要超越被动的请求级缓存管理，转向主动的智能体级准入控制。受分布式系统中拥塞控制的启发，我们将KV缓存视为一种共享资源，其高效利用依赖于反馈驱动的调节机制。基于这一洞见，我们提出了CONCUR——一个轻量级的控制层，它通过调节智能体准入来限制总体缓存压力，同时保持执行的连续性。CONCUR采用一种缓存感知的控制算法，利用运行时缓存信号动态调整活跃智能体的数量。在大型模型和真实世界智能体工作负载上的实验表明，CONCUR能有效防止中间阶段颠簸，在Qwen3-32B上将批量推理吞吐量提升高达4.09倍，在DeepSeek-V3上提升1.9倍，同时保持与现有LLM服务系统的兼容性。

相关内容

CONCUR

关注 0

第31届并发理论国际会议CONCUR 2020的目的是将研究人员、开发人员和学生聚集在一起，以推进并发理论，并促进其应用。 CONCUR 2020是QONFEST 2020总括会议的一部分，该会议包括CONCUR、QEST、FORMATS联合国际2020会议，以及一些研讨会和教程。官网链接：https://concur2020.forsyte.at/

大语言模型智能体（LLM Agents）工具调用的演进：从单工具调用到多工具协同编排

专知会员服务

29+阅读 · 4月6日

从静态模板到动态运行时图：大语言模型智能体（LLM Agents）工作流优化综述

专知会员服务

23+阅读 · 3月30日

管理 LLM 智能体中的演进式记忆：风险、机理及稳定性与安全性受控记忆（SSGM）框架

专知会员服务

18+阅读 · 3月14日

【伯克利博士论文】从推理服务到模型训练：面向大规模 LLM 智能体的高效系统构建

专知会员服务

19+阅读 · 1月2日