Modern Graphics Processing Unit (GPU)-backed services must satisfy strict latency service-level objectives (SLOs) while controlling spare-capacity cost. In multi-tenant GPU cloud platforms, this trade-off is inherently dynamic because workload demand is endogenous; specifically, pricing shapes the submissions of heterogeneous tenants, which subsequently impact congestion and delay. We formulate the joint pricing-and-scaling problem as a large-population Stackelberg game problem, and we derive an explicit equilibrium demand map. The resulting closed-loop model reveals a structural failure mode in which delay-insensitive workloads sustain a residual demand floor, making the backlog undrainable under bounded price and service capacity. This observation motivates a computable drainability guardrail that certifies uniformly negative drift in the residual-demand regime. For any fixed price-capacity pair satisfying the drainability guardrail, we establish a unique operating point and global convergence towards it under a checkable step-size condition. Building on this fixed-pair analysis, we further develop an optimizer-agnostic action shield for the full dynamic problem and show empirically that it improves safety and robustness for model-free reinforcement learning (RL) in this setting.
翻译:现代图形处理器(GPU)支撑的服务必须满足严格延迟服务等级目标(SLO),同时控制空闲容量成本。在多租户GPU云平台中,这种权衡本质上是动态的,因为工作负载需求具有内生性;具体而言,定价策略会影响异构租户的提交行为,进而影响拥塞和延迟。我们将联合定价与扩展问题建模为大种群Stackelberg博弈问题,并推导出显式的均衡需求映射。该闭环模型揭示了一种结构性失效模式:延迟不敏感的工作负载会维持残余需求底线,导致在有限价格和服务容量下积压任务无法排空。这一发现启发我们提出一种可计算的"可排出性护栏",可确保残余需求区域具有一致负漂移。对于任何满足可排出性护栏的固定价格-容量组合,我们证明了唯一工作点的存在性,并在可验证的步长条件下建立了全局收敛性。基于该固定组合分析,我们进一步为完整动态问题设计了与优化器无关的动作保护机制,并通过实验表明该方法能提升该场景下无模型强化学习(RL)的安全性和鲁棒性。