Multi-stage ML inference pipelines are difficult to autoscale due to heterogeneous resources, cross-stage coupling, and dynamic bottleneck migration. We present SAIR, an autoscaling framework that uses an LLM as an in-context reinforcement learning controller, improving its policy online from reward-labeled interaction histories without gradient updates. SAIR combines Pareto-dominance reward shaping with a provable separation margin, surprisal-guided experience retrieval for context efficiency, and fine-grained GPU rate control via user-space CUDA interception. We provide regret analysis decomposing error into retrieval coverage and LLM selection components. On four ML serving pipelines under three workload patterns, SAIR achieves the best or tied-best P99 latency and effective resource cost among deployed baselines, improving P99 by up to 50% and reducing effective cost by up to 97% (under GPU rate-control assumptions), with 86% bottleneck detection accuracy and no offline training.
翻译:多阶段机器学习推理流水线由于异构资源、跨阶段耦合以及动态瓶颈迁移等问题,难以实现自动扩缩容。本文提出SAIR,一种自动扩缩容框架,其利用大语言模型作为上下文强化学习控制器,通过从带奖励标注的交互历史中在线改进策略,而无需梯度更新。SAIR结合了具有可证明分离裕度的帕累托占优奖励塑造、基于信息惊喜值的经验检索以提升上下文效率,以及通过用户空间CUDA拦截实现的细粒度GPU速率控制。我们提供了遗憾分析,将误差分解为检索覆盖度与大语言模型选择两部分。在三种工作负载模式下对四个机器学习服务流水线的实验表明,SAIR在已部署的基线方法中实现了最优或并列最优的P99延迟与有效资源成本,将P99延迟最高提升50%,并将有效成本最高降低97%(在GPU速率控制假设下),同时具备86%的瓶颈检测准确率且无需离线训练。