A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost across every workload we test - so when, if ever, does DRL actually help? We study this in RLScale-Bench, a reproducible benchmark and evaluation protocol for DRL on adaptive resource control, where an agent allocates compute to a dynamic workload under cost and service-level constraints. We evaluate PPO, DQN, A2C, SAC, TD3, and DDPG under matched architectures, training budgets, and reward functions against a calibrated rule-based baseline across six workload patterns and five seeds (240 runs), instantiate the benchmark on Kubernetes Horizontal Pod Autoscaling, and probe distribution-shift generalization. Three findings challenge common assumptions: (i) the calibrated controller achieves the lowest cost on all six workloads, though it trails the best RL agents on bursty and flash traffic; (ii) discrete-action algorithms outperform continuous-action ones by one to two orders of magnitude in constraint violations due to action-space mismatch; and (iii) no single algorithm dominates across workloads, with rankings shifting by up to four positions. The bottleneck in RL-based resource control is not algorithm selection but baseline calibration, reward engineering, and realistic evaluation protocols.
翻译:一个经过适当校准的基于规则的自动伸缩器,在我们测试的每种工作负载上,其成本均优于六种主流深度强化学习算法的每一种——那么,如果可能,深度强化学习究竟在何时能真正发挥作用?我们通过RLScale-Bench(一个针对自适应资源控制的可复现基准测试与评估协议)对此展开研究,其中智能体在成本和服务级别约束下为动态工作负载分配计算资源。我们在匹配的架构、训练预算和奖励函数条件下,评估了PPO、DQN、A2C、SAC、TD3和DDPG算法,并与经过校准的基于规则基线进行了对比,涵盖了六种工作负载模式和五个随机种子(共240次运行),在Kubernetes水平Pod自动伸缩上实例化该基准,并探究了分布偏移泛化能力。三项发现挑战了常见假设:(i)校准控制器在所有六种工作负载上实现了最低成本,但在突发流量和闪流上落后于最优强化学习智能体;(ii)由于动作空间不匹配,离散动作算法在约束违反方面的性能优于连续动作算法一到两个数量级;(iii)没有单一算法能主导所有工作负载,排名波动可达四个位次。基于强化学习的资源控制瓶颈并非算法选择,而是基线校准、奖励工程设计和现实评估协议。