优化大语言模型推理：内存约束下的流体引导在线调度 (Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints)

Large Language Models (LLMs) power many modern applications, but their inference procedure poses unique scheduling challenges: the Key-Value (KV) cache grows dynamically during response generation, and memory overflow triggers eviction that can cascade into system-wide failures. Even when memory capacity exceeds the theoretical requirement, conventional scheduling algorithms fail because they do not account for this dynamic memory growth -- a system that should be stable can become unstable under poor scheduling. This paper formulates LLM inference optimization as a multi-stage online scheduling problem. We develop a fluid dynamics approximation to establish a tractable benchmark and derive the Waiting for Accumulated Inference Threshold (WAIT) algorithm. WAIT uses threshold-based batching to prevent eviction by keeping the system near load balance, achieving near-optimal throughput when output lengths are known. For practical settings where output lengths are unknown at arrival, we introduce Nested WAIT. Rather than predicting output lengths, Nested WAIT classifies prompts on-the-fly: short prompts complete early and exit, while longer prompts naturally advance to later segments. A safety buffer provides high-probability protection against memory overflow with only logarithmic overhead. Theoretical analysis establishes near-optimal performance in the asymptotic regime. Experiments on Llama-7B with an A100 GPU demonstrate that our approach achieves superior throughput and reduced latency compared to vLLM and Sarathi. This work applies operations research principles to establish a theoretical framework for LLM deployment under memory constraints.

翻译：大语言模型为众多现代应用提供核心驱动力，但其推理过程带来了独特的调度挑战：键值缓存会在响应生成过程中动态增长，内存溢出会触发缓存驱逐，并可能引发系统级故障。即使内存容量超过理论需求，传统调度算法仍可能失效，因为它们未能考虑这种动态内存增长——本应稳定的系统在不当调度下可能变得不稳定。本文将大语言模型推理优化构建为一个多阶段在线调度问题。我们开发了一种流体动力学近似方法以建立可处理的性能基准，并推导出累积推理阈值等待算法。该算法采用基于阈值的批处理技术，通过维持系统接近负载平衡来防止缓存驱逐，在输出长度已知时实现接近最优的吞吐量。针对实际场景中输出长度未知的情况，我们提出了嵌套WAIT算法。该算法不预测输出长度，而是实时对提示词进行分类：短提示词提前完成并退出系统，较长提示词则自然进入后续处理阶段。通过引入安全缓冲区，仅以对数级开销即可实现高概率的内存溢出防护。理论分析证明了该算法在渐近条件下的近最优性能。基于Llama-7B模型与A100 GPU的实验表明，相较于vLLM和Sarathi系统，我们的方法在吞吐量与延迟降低方面均表现出优越性。本研究运用运筹学原理，为内存约束下的大语言模型部署建立了理论框架。