In robotic control tasks, policies trained by reinforcement learning (RL) in simulation often experience a performance drop when deployed on physical hardware, due to modeling error, measurement error, and unpredictable perturbations in the real world. Robust RL methods account for this issue by approximating a worst-case value function during training, but they can be sensitive to approximation errors in the value function and its gradient before training is complete. In this paper, we hypothesize that Lipschitz regularization can help condition the approximated value function gradients, leading to improved robustness after training. We test this hypothesis by combining Lipschitz regularization with an application of Fast Gradient Sign Method to reduce approximation errors when evaluating the value function under adversarial perturbations. Our empirical results demonstrate the benefits of this approach over prior work on a number of continuous control benchmarks.
翻译:在机器人控制任务中,通过强化学习在仿真环境中训练的策略,在部署于物理硬件时,常因建模误差、测量误差及现实世界中的不可预测扰动而出现性能下降。鲁棒强化学习方法通过在训练过程中近似最坏情况下的值函数来应对这一问题,但在训练完成前,值函数及其梯度的近似误差可能导致其敏感性较高。本文假设 Lipschitz 正则化有助于约束近似值函数的梯度,从而在训练后提升策略鲁棒性。我们通过将 Lipschitz 正则化与快速梯度符号法相结合,以减少评估对抗扰动下值函数时的近似误差,进而验证该假设。在多个连续控制基准测试上的实验结果证明,该方法相较于先前工作具有显著优势。