Serving large language models under latency service-level objectives (SLOs) is a configuration-heavy systems problem with an unusually failure-prone search space: many plausible configurations crash outright or miss user-visible latency targets, and standard black-box optimizers treat these failures as wasted trials. We present SLO-Guard, a crash-aware autotuner for vLLM serving that treats crashes as first-class observations. SLO-Guard combines a feasible-first Thermal Budget Annealing (TBA) exploration phase with a warm-started Tree-structured Parzen Estimator (TPE) exploitation phase; the handoff replays all exploration history, including crashes encoded as extreme constraint violations. We additionally contribute a configuration-repair pass, a GPU-aware KV-cache memory guard, and a four-category crash taxonomy. We evaluate SLO-Guard on Qwen2-1.5B served with vLLM 0.19 on an NVIDIA A100 40GB. Across a pre-specified five-seed study, both SLO-Guard and uniform random search attain 75/75 feasibility with zero crashes under the corrected concurrent harness, and are statistically tied on best-achieved latency (Mann-Whitney two-sided p=0.84). SLO-Guard's advantage is in budget consistency: more trials in the fast-serving regime (10.20 vs. 7.40 out of 15; one-sided p=0.014) and higher post-handoff consistency (0.876 vs. 0.539; p=0.010). Under concurrent load, SLO-Guard's cross-seed standard deviation on best latency is 4.4x tighter than random search's (2.26 ms vs. 10.00 ms). A harness-replication analysis shows that the consistency findings survive an independent sequential-dispatch measurement condition. The central claim is not that SLO-Guard finds a better final configuration, but that it spends a fixed tuning budget more predictably once the fast regime has been found.
翻译:在延迟服务等级目标约束下部署大型语言模型是一个配置密集型的系统问题,其搜索空间极易引发故障:许多看似合理的配置直接崩溃或无法达到用户可见的延迟目标,而标准黑盒优化器将这些失败视为无效尝试。我们提出SLO-Guard——一种面向vLLM服务的崩溃感知自动调优器,将崩溃视为一等观测事件。SLO-Guard结合了可行优先的热预算退火探索阶段和热启动的树结构Parzen估计器利用阶段,切换时重放所有探索历史,包括编码为极端约束违反的崩溃。此外,我们贡献了配置修复流程、GPU感知的KV缓存内存保护机制以及四类崩溃分类体系。我们基于NVIDIA A100 40GB平台,对搭载vLLM 0.19的Qwen2-1.5B模型评估SLO-Guard。在预指定的五种子实验研究中,SLO-Guard与均匀随机搜索在修正后的并发测试工具下均实现75/75可行性且零崩溃,并在最佳延迟表现上统计持平。SLO-Guard的优势在于预算一致性:在快速服务区间内获得更多试验次数(15次中10.20次对比7.40次,单侧p=0.014)及更高的切换后一致性(0.876对比0.539,p=0.010)。在并发负载下,SLO-Guard最佳延迟的跨种子标准差比随机搜索紧致4.4倍(2.26毫秒对比10.00毫秒)。测试工具复现分析表明,该一致性结论在独立顺序调度测量条件下依然成立。核心主张并非SLO-Guard能找到更优的最终配置,而是其在发现快速区间后能更可预测地消耗固定调优预算。