Serving large language model (LLM) inference in cloud environments requires jointly optimizing model selection, GPU provisioning, parallelism configuration, and workload routing under latency, accuracy, memory, and budget constraints. While mixed-integer linear programming (MILP) can model this problem, its computational cost limits frequent re-optimization under demand variability. Existing heuristics often optimize individual components separately and may become infeasible when system-wide constraints are enforced. This paper presents a scalable framework for SLO-constrained LLM inference. We formulate the problem as an MILP with a two-phase delay model capturing both prefill and autoregressive decoding under tensor and pipeline parallelism. To solve it efficiently, we develop two constraint-aware heuristics: a Greedy Heuristic (GH) and an Adaptive Greedy Heuristic (AGH). AGH extends GH through multi-start construction, local search, and GPU consolidation. Both methods maintain feasibility through parallelism-aware filtering, cost-based ranking, and adaptive parallelism scaling. Experiments based on the Azure LLM Inference Trace show that GH generates feasible solutions within one second, while AGH achieves near-optimal performance within three seconds and scales to large instances where exact solvers fail to converge. Under out-of-sample stress with up to 1.5x delay and accuracy inflation, AGH degrades gracefully through provisioned headroom, yielding substantially lower cost and SLO violations than cost-minimal MILP solutions. Across synthetic and real Azure workloads, AGH maintains SLO compliance at significantly lower cost than exact MILP solutions. These results demonstrate that high-quality allocations provide substantial robustness to demand variability while enabling rapid adaptation to workload changes.
翻译:在云环境中部署大语言模型推理需要在延迟、准确性、内存和预算约束下联合优化模型选择、GPU配置、并行策略和工作负载路由。尽管混合整数线性规划能够建模该问题,但其计算开销限制了在需求波动下的频繁重优化。现有启发式方法通常单独优化各个组件,当施加全局约束时可能变得不可行。本文提出一种面向SLO约束的大语言模型推理的可扩展框架。我们将问题建模为具有两阶段延迟模型的混合整数线性规划,该模型在张量并行和流水线并行下同时捕捉预填充和自回归解码阶段。为高效求解,我们开发了两种约束感知启发式方法:贪婪启发式和自适应贪婪启发式。自适应贪婪启发式通过多起点构造、局部搜索和GPU整合扩展了贪婪启发式。两种方法通过并行感知过滤、成本排序和自适应并行缩放保持可行性。基于Azure大语言模型推理轨迹的实验表明,贪婪启发式在一秒内生成可行解,自适应贪婪启发式在三秒内达到近似最优性能,并可扩展到精确求解器无法收敛的大规模实例。在面对高达1.5倍延迟和准确性膨胀的样本外压力测试时,自适应贪婪启发式通过预留裕度实现性能优雅降级,其成本和SLO违规率显著低于成本最优的混合整数线性规划解。在合成和真实Azure工作负载上,自适应贪婪启发式在保持SLO合规性的同时,成本显著低于精确混合整数线性规划解。这些结果表明,高质量资源配置不仅能有效应对需求波动,还能快速适应工作负载变化。