Deploying large language model (LLM) inference at scale requires jointly selecting base models, provisioning heterogeneous GPUs, configuring parallelism, and distributing workloads under tight latency, accuracy, and budget constraints. Exact mixed-integer linear programming (MILP) approaches guarantee optimality but scale poorly. We propose two constraint-aware heuristics: a Greedy Heuristic (GH) for single-pass allocation, and an Adaptive Greedy Heuristic (AGH) that enhances GH via multi-start construction, relocate-based local search, and GPU consolidation. Three constraint-aware mechanisms -- TP-aware feasibility selection, cost-per-effective-coverage ranking, and TP upgrade -- ensure feasibility under tightly coupled memory, delay, error, and budget constraints. On workloads calibrated with the Azure LLM Inference Trace (2025), both heuristics produce feasible solutions in under one second, with AGH closely approaching optimal cost while achieving over 260x speedup on large-scale instances. Under out-of-sample stress tests with up to 1.5x parameter inflation, AGH maintains controlled SLO violations and stable cost, whereas the exact solver's placement degrades sharply.
翻译:摘要:大规模部署大语言模型(LLM)推理需要在严格的延迟、准确性和预算约束下,联合选择基础模型、配置异构GPU、调整并行策略并分配工作负载。精确混合整数线性规划(MILP)方法能保证最优性但扩展性较差。我们提出两种约束感知启发式算法:用于单次分配的精明启发式算法(GH),以及通过多起点构建、重定位局部搜索和GPU整合增强GH的自适应精明启发式算法(AGH)。三种约束感知机制——TP感知可行性选择、单位有效覆盖成本排序及TP升级——确保在内存、延迟、错误和预算约束紧密耦合下的可行性。在基于Azure LLM推理轨迹(2025)校准的工作负载上,两种启发式算法均能在不到一秒内生成可行解,其中AGH在接近最优成本的同时,大规模实例上实现了超过260倍的加速。在参数膨胀高达1.5倍的非样本外压力测试中,AGH维持了受控的SLO违规率和稳定成本,而精确求解器的部署性能显著下降。