Large Language Models (LLMs) are rapidly becoming critical infrastructure for enterprise applications, driving unprecedented demand for GPU-based inference services. A key operational challenge arises from the two-phase nature of LLM inference: a compute-intensive \emph{prefill} phase that processes user input, followed by a memory-bound \emph{decode} phase that generates output tokens. When these phases share GPU resources, prefill tasks throttle the processing speed of concurrent decodes, creating state-dependent contention. This contention is further complicated by workload heterogeneity, as different applications exhibit vastly different input and output lengths. We develop a stochastic control framework for scheduling heterogeneous LLM workloads across large GPU clusters. We formulate LLM inference as a multiclass many-server queueing network with state-dependent service rates, grounded in empirical iteration-time measurements. We analyze the fluid approximation of this system and solve steady-state linear programs that characterize optimal resource allocation. We design gate-and-route policies that regulate prefill admission and decode routing, and prove that they are asymptotically optimal in the many-GPU limit under both bundled and separate token-pricing schemes. We further extend the framework to incorporate Service Level Indicators (SLIs) such as latency and fairness, providing a general approach to constrained scheduling. Numerical experiments calibrated to empirical iteration-time data demonstrate that our policies outperform standard serving heuristics.
翻译:大型语言模型正迅速成为企业应用的关键基础设施,驱动着对基于GPU的推理服务的空前需求。一个关键的操作挑战源于LLM推理的两阶段特性:首先是处理用户输入的计算密集型预填充阶段,随后是生成输出令牌的内存受限解码阶段。当这两个阶段共享GPU资源时,预填充任务会限制并发解码的处理速度,形成状态依赖的资源竞争。这种竞争因工作负载的异构性而进一步复杂化,不同应用呈现出差异巨大的输入和输出长度。我们开发了一个随机控制框架,用于在大型GPU集群上调度异构LLM工作负载。基于经验迭代时间测量数据,我们将LLM推理建模为具有状态依赖服务速率的多类多服务器排队网络。通过分析该系统的流体近似,求解表征最优资源分配的稳态线性规划问题。我们设计了门控路由策略来调控预填充准入和解码路由,并证明在捆绑式和分离式令牌定价方案下,这些策略在多GPU极限条件下具有渐近最优性。我们进一步扩展该框架以纳入延迟和公平性等服务等级指标,为约束调度提供了通用方法。基于经验迭代时间数据校准的数值实验表明,我们的策略优于标准服务启发式方法。