Robust Deep Reinforcement Learning with Adaptive Adversarial Perturbations in Action Space

Deep reinforcement learning (DRL) algorithms can suffer from modeling errors between the simulation and the real world. Many studies use adversarial learning to generate perturbation during training process to model the discrepancy and improve the robustness of DRL. However, most of these approaches use a fixed parameter to control the intensity of the adversarial perturbation, which can lead to a trade-off between average performance and robustness. In fact, finding the optimal parameter of the perturbation is challenging, as excessive perturbations may destabilize training and compromise agent performance, while insufficient perturbations may not impart enough information to enhance robustness. To keep the training stable while improving robustness, we propose a simple but effective method, namely, Adaptive Adversarial Perturbation (A2P), which can dynamically select appropriate adversarial perturbations for each sample. Specifically, we propose an adaptive adversarial coefficient framework to adjust the effect of the adversarial perturbation during training. By designing a metric for the current intensity of the perturbation, our method can calculate the suitable perturbation levels based on the current relative performance. The appealing feature of our method is that it is simple to deploy in real-world applications and does not require accessing the simulator in advance. The experiments in MuJoCo show that our method can improve the training stability and learn a robust policy when migrated to different test environments. The code is available at https://github.com/Lqm00/A2P-SAC.

翻译：深度强化学习算法在仿真环境与真实环境之间可能存在建模误差。许多研究采用对抗学习在训练过程中生成扰动以模拟这种差异，从而提升深度强化学习的鲁棒性。然而，现有方法大多使用固定参数控制对抗扰动的强度，这会导致平均性能与鲁棒性之间的权衡。实际上，寻找最优扰动参数具有挑战性：过强的扰动可能破坏训练稳定性并损害智能体性能，而不足的扰动又无法提供足够信息以增强鲁棒性。为了在保持训练稳定的同时提升鲁棒性，我们提出一种简单有效的方法，即自适应对抗扰动（Adaptive Adversarial Perturbation, A2P），该方法能够为每个样本动态选择适当的对抗扰动。具体而言，我们设计了一种自适应对抗系数框架，用于在训练过程中调节对抗扰动的影响。通过构建当前扰动强度的度量指标，我们的方法能够基于当前相对性能计算合适的扰动水平。该方法的一个突出优势在于易于部署到实际应用中，且无需预先访问仿真器。在MuJoCo环境中的实验表明，该方法能够提升训练稳定性，并在迁移至不同测试环境时学习到鲁棒的策略。代码已开源在https://github.com/Lqm00/A2P-SAC。