In large language model (LLM) serving, each request accumulates persistent graphics processing unit (GPU) memory during service as its key-value cache grows with every generated token. Under high concurrency, aggregate memory usage therefore increases endogenously over time: the service process itself creates future capacity pressure. When memory capacity is exceeded, systems evict active requests, discarding cached state and restarting them later, which wastes computation and reduces throughput. We develop a discrete-time dynamical model of memory-constrained LLM inference that captures admission, memory growth, and eviction under continuous batching. In the saturated-input regime, the system admits both eviction-free fixed points and limit cycles with evictions. For homogeneous workloads, we show that the eviction-free equilibrium is unstable and that, except for a Lebesgue-measure-zero exact-capture set, the system converges to a unique worst-case limit cycle that is asymptotically stable outside this exceptional set, with throughput losses as large as 50%. For heterogeneous workloads, we prove a stability criterion in the two-class common-input setting and explain how the survival-polynomial mechanism generalizes to multiple classes and heterogeneous-input lengths. Under an input-dominated scaling regime, coprime decoding lengths stabilize the eviction-free equilibrium, while non-coprime lengths create synchronized modes that drive instability. These results characterize when workload heterogeneity desynchronizes completions and helps stabilize memory-constrained serving. More broadly, we identify service-induced congestion as a structural instability mechanism and derive scheduling design principles for sustaining high throughput.
翻译:在大语言模型(LLM)服务中,每个请求在服务期间会随着其键值缓存(key-value cache)随每个生成token增长,持续累积持久性图形处理器(GPU)内存。在高并发场景下,聚合内存使用量因此随时间内生增长:服务过程本身会制造未来容量压力。当内存容量被超出时,系统会驱逐活跃请求,丢弃缓存状态并在之后重新启动它们,这浪费了计算资源并降低了吞吐量。我们开发了一个内存受限的LLM推理的离散时间动力学模型,该模型捕获了连续批处理下的准入、内存增长和驱逐过程。在饱和输入状态下,系统同时存在无驱逐的固定点和带驱逐的极限环。对于同质负载,我们证明无驱逐均衡是不稳定的,并且除了一个勒贝格测度为零的精确捕获集外,系统收敛到一个唯一的最差情况极限环,该极限环在此异常集外渐近稳定,吞吐量损失高达50%。对于异质负载,我们在两类共同输入场景下证明了稳定性判据,并解释了生存多项式机制如何推广到多类及异质输入长度。在输入主导的缩放模式下,互质的解码长度能够稳定无驱逐均衡,而非互质长度则会产生同步模式,导致不稳定。这些结果刻画了负载异质性何时能够去同步完成时间并帮助稳定内存受限服务。更广泛地,我们将服务引发的拥塞识别为一种结构不稳定性机制,并推导出维持高吞吐量的调度设计原则。