Conventional wisdom holds that large-batch training is fundamentally incompatible with Reinforcement Learning (RL) - beyond a modest threshold, increasing batch sizes typically yields diminishing returns or performance degradation due to the inherent non-stationarity of the data distribution. We challenge this view by observing that non-stationarity is not a fixed property of RL, but evolves throughout training: early stages exhibit rapid behavioral shifts that demand small batches for plasticity, whereas late stages approach a quasi-stationary regime where large batches enable precise convergence. Motivated by this observation, we propose Adaptive Batch Scaling (ABS), that dynamically adjusts the effective batch size according to the stability of the learning policy. Central to ABS is Behavioral Divergence, a novel metric that quantifies policy non-stationarity by measuring action-level shifts between consecutive updates, which we use to scale batch size inversely to policy volatility. Integrated with the Parallelised Q-Network (PQN) algorithm and evaluated on the ALE benchmark, ABS seamlessly reconciles early-stage plasticity with late-stage stable convergence. Strikingly, contrary to conventional wisdom, our results reveal that the combination of larger networks and larger batch sizes achieves the best performance - a scaling behavior previously thought to be unattainable in RL, now unlocked through adaptive batch control.
翻译:传统观点认为,大批量训练与强化学习(RL)本质上不兼容——超过适度阈值后,由于数据分布固有的非平稳性,增加批量大小通常会带来收益递减或性能下降。我们通过观察发现非平稳性并非RL的固定属性,而是随训练过程演变的:早期阶段表现出快速的行为变化,需要小批量以维持可塑性;而后期阶段趋近准平稳状态,此时大批量可实现精确收敛。基于这一观察,我们提出自适应批量缩放(ABS)方法,可根据学习策略的稳定性动态调整有效批量大小。ABS的核心是行为差异度(Behavioral Divergence)——一种通过测量连续更新间动作层面变化来量化策略非平稳性的新型指标,我们据此将批量大小与策略波动性成反比缩放。该方法与并行化Q网络(PQN)算法结合,并在ALE基准上进行评估后,ABS无缝协调了早期可塑性与后期稳定收敛。令人瞩目的是,与传统认知相反,我们的结果表明:更大网络与更大批量大小的组合可达到最佳性能——这种此前认为RL中无法实现的缩放行为,现已通过自适应批量控制得以解锁。