Modern deep policy gradient methods achieve effective performance on simulated robotic tasks, but they all require large replay buffers or expensive batch updates, or both, making them incompatible for real systems with resource-limited computers. We show that these methods fail catastrophically when limited to small replay buffers or during incremental learning, where updates only use the most recent sample without batch updates or a replay buffer. We propose a novel incremental deep policy gradient method -- Action Value Gradient (AVG) and a set of normalization and scaling techniques to address the challenges of instability in incremental learning. On robotic simulation benchmarks, we show that AVG is the only incremental method that learns effectively, often achieving final performance comparable to batch policy gradient methods. This advancement enabled us to show for the first time effective deep reinforcement learning with real robots using only incremental updates, employing a robotic manipulator and a mobile robot.
翻译:现代深度策略梯度方法在模拟机器人任务中取得了显著成效,但这些方法均需依赖大规模回放缓冲区或昂贵的批量更新(或两者兼需),使其难以适配计算资源受限的真实系统。我们证明,当受限于小型回放缓冲区或进行增量学习(即仅使用最新样本进行更新,无需批量更新或回放缓冲区)时,这些方法会出现灾难性失效。为此,我们提出一种新颖的增量深度策略梯度方法——动作价值梯度法(AVG),并结合一系列归一化与缩放技术以应对增量学习中的不稳定性挑战。在机器人仿真基准测试中,AVG是唯一能够有效学习的增量方法,其最终性能常可媲美批量策略梯度方法。这一进展使我们首次在真实机器人系统(包括机械臂与移动机器人)中,仅通过增量更新实现了有效的深度强化学习。