Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To maintain stability at increased scale, FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation. Across over 60 tasks in 10 simulators, FlashSAC consistently outperforms PPO and strong off-policy baselines in both final performance and training efficiency, with the largest gains on high-dimensional tasks such as dexterous manipulation. In sim-to-real humanoid locomotion, FlashSAC reduces training time from hours to minutes, demonstrating the promise of off-policy RL for sim-to-real transfer.
翻译:强化学习(RL)是在缺乏专家演示时实现机器人控制的核心方法。基于策略的方法(如近端策略优化)因稳定性而广泛使用,但其依赖窄分布的在线策略数据,在高维状态与动作空间中限制了精准的策略评估。离策略方法可通过学习更广泛的状态-动作分布克服这一局限,却面临收敛慢与不稳定的问题——拟合多样数据的价值函数需大量梯度更新,导致评论家误差通过自举逐步累积。我们提出FlashSAC,一种基于柔性Actor-Critic的快速稳定离策略RL算法。受监督学习中观察到的缩放定律启发,FlashSAC大幅减少梯度更新次数,同时通过更大模型与更高数据吞吐量进行补偿。为维持规模化后的稳定性,FlashSAC显式约束权重、特征与梯度范数,抑制评论家误差累积。在10个仿真器的60余项任务中,FlashSAC在最终性能与训练效率上均持续超越PPO及强离策略基线,并在灵巧操作等高维任务中取得最大增益。在仿真到真实的人形机器人运动迁移中,FlashSAC将训练时间从数小时缩短至分钟级,彰显了离策略RL在仿真到真实迁移中的潜力。