A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost across every workload we test - so when, if ever, does DRL actually help? We study this in RLScale-Bench, a reproducible benchmark and evaluation protocol for DRL on adaptive resource control, where an agent allocates compute to a dynamic workload under cost and service-level constraints. We evaluate PPO, DQN, A2C, SAC, TD3, and DDPG under matched architectures, training budgets, and reward functions against a calibrated rule-based baseline across six workload patterns and five seeds (240 runs), instantiate the benchmark on Kubernetes Horizontal Pod Autoscaling, and probe distribution-shift generalization. Three findings challenge common assumptions: (i) the calibrated controller achieves the lowest cost on all six workloads, though it trails the best RL agents on bursty and flash traffic; (ii) discrete-action algorithms outperform continuous-action ones by one to two orders of magnitude in constraint violations due to action-space mismatch; and (iii) no single algorithm dominates across workloads, with rankings shifting by up to four positions. The bottleneck in RL-based resource control is not algorithm selection but baseline calibration, reward engineering, and realistic evaluation protocols.
翻译:一个经过恰当校准的基于规则的自动扩缩器,在我们测试的所有工作负载上,其成本均优于六种主流深度强化学习(DRL)算法。那么,深度强化学习究竟何时(如果可能的话)才能发挥实际作用?我们在RLScale-Bench中对此进行了研究,这是一个可复现的基准测试与评估协议,用于DRL在自适应资源控制上的应用,其中智能体在成本和服务水平约束下为动态工作负载分配计算资源。我们在匹配的架构、训练预算和奖励函数下,对PPO、DQN、A2C、SAC、TD3和DDPG算法进行了评估,并将其与经过校准的基于规则的基线进行比较,涵盖六种工作负载模式和五种随机种子(共计240次运行)。我们将该基准测试部署在Kubernetes水平Pod自动扩缩上,并探究了分布偏移下的泛化能力。三项发现挑战了常见假设:(i)经校准的控制器在所有六种工作负载上实现了最低成本,但在突发流量和闪点流量上落后于最优RL智能体;(ii)由于动作空间不匹配,离散动作算法在约束违反上的表现优于连续动作算法一至两个数量级;(iii)没有任何单一算法能主导所有工作负载,算法排名最多可变动四个位次。基于RL的资源控制的瓶颈并非算法选择,而是基线校准、奖励工程和现实的评估协议。