Static-graph LLM decoders provide predictable launches, fixed tensor shapes, and low submission overhead, but online decoding exposes highly irregular KV-cache behavior: request lengths differ, EOS events arrive asynchronously, and logical histories fragment over time. Dynamic runtimes recover flexibility through paged KV management and step-level scheduling, while static-graph executors often over-reserve memory and suffer burst-time latency outliers. This paper studies whether much of this variability can be absorbed below a fixed decode interface. We present KV-RM, a runtime design that regularizes KV-cache movement beneath a static-graph LLM decoder. KV-RM decouples logical KV histories from physical storage, tracks active KV state through a block pager, and materializes each decode step through a single committed descriptor. A merge-staged transport path coalesces non-contiguous KV mappings into a small number of large transfer groups before a fixed-shape attention kernel consumes them. Optional bounded far-history summaries can be enabled under the same interface, but the core design does not depend on them. On a 2-GPU NVIDIA A100 node, KV-RM improves mixed-length decoding throughput and tail latency relative to a static-graph baseline, reduces reserved KV memory across workload families, and removes severe burst-time latency spikes under production-trace replay. These results suggest that KV-cache movement, rather than kernel shape, can be an effective boundary for recovering runtime flexibility in static-graph LLM serving.
翻译:静态图大语言模型解码器具备可预测的启动、固定的张量形状和较低的开销,但在线解码过程暴露出高度不规则的KV缓存行为:请求长度各异,结束符(EOS)事件异步到达,逻辑历史片段随时间碎片化。动态运行时通过分页KV管理和逐步骤调度恢复灵活性,而静态图执行器往往过度预留内存并会出现突发延迟异常。本文研究能否将大部分可变性吸收到固定的解码接口之下。我们提出KV-RM,一种在静态图大语言模型解码器下对KV缓存移动进行正则化的运行时设计。KV-RM将逻辑KV历史与物理存储解耦,通过块分页器追踪活跃KV状态,并通过单个提交描述符实现每个解码步骤的物化。采用合并-分级的传输路径,在固定形状注意力核消费前,将非连续KV映射整合为少量大传输组。可选的有界远历史摘要可在同一接口下启用,但核心设计不依赖于此。在2-GPU NVIDIA A100节点上,与静态图基线相比,KV-RM提升了混合长度解码吞吐量和尾延迟,减少了各类工作负载下的KV预留内存,并在生产轨迹回放时消除了严重的突发延迟尖峰。这些结果表明,KV缓存移动而非核形状,可作为恢复静态图大语言模型服务运行时灵活性的有效边界。