Large language models now serve millions of users daily, with providers incurring costs exceeding $700,000 per day. Each request requires token-by-token inference, making GPU scheduling central to latency, capacity, and cost. The difficulty is endogenous memory growth: generated tokens expand the Key-Value (KV) cache, and overflow can evict in-progress requests and waste prior computation. We formulate inference as a multi-stage online scheduling problem with endogenous memory growth, linear iteration times, and GPU-resident KV-cache constraints. We introduce a fluid model that characterizes equilibrium batch composition, memory requirement, and stability region. Guided by the fluid model, we design WAIT (Waiting for Accumulated Inference Threshold), a threshold-based admission rule for known output lengths, and Nested WAIT, which extends the rule to unknown output lengths by regulating how requests advance across decode-stage segments. Both algorithms approximate the fluid benchmark asymptotically under the stated memory conditions. Nested WAIT uses an additional safety buffer of moderate scale to hedge against memory-overflow-induced evictions under unknown output lengths. In Vidur simulations configured for Llama-2-7B on an A100 GPU, with supplemental real-GPU validation reported in the appendix, the policies enlarge the empirically observed stable operating range relative to widely used baseline algorithms and reduce latency especially in near-overloaded and overloaded regimes.
翻译:大语言模型如今每天服务数百万用户,提供商每日成本超过70万美元。每个请求需要逐词元推理,使GPU调度成为影响延迟、容量和成本的核心因素。其难点在于内生性内存增长:生成的词元会扩展键值缓存(KV cache),溢出可能导致处理中的请求被驱逐并浪费先前的计算。我们将推理建模为具有内生性内存增长、线性迭代时间和GPU驻留KV缓存约束的多阶段在线调度问题。我们提出一个流体模型,表征均衡批次组成、内存需求和稳定区域。基于该模型指导,我们设计了WAIT(等待累积推理阈值)——一种针对已知输出长度的阈值准入规则,以及Nested WAIT——通过调控请求在解码阶段各分段的推进方式,将该规则扩展至未知输出长度场景。这两种算法在所述内存条件下渐进逼近流体基准。针对未知输出长度,Nested WAIT使用中等规模的额外安全缓冲来应对内存溢出导致的驱逐风险。在配置为Llama-2-7B的A100 GPU的Vidur仿真中(附录补充了真实GPU验证),这些策略相对广泛使用的基线算法扩大了经验观察到的稳定运行范围,并在近过载和过载区域显著降低延迟。