Deep Reinforcement Learning (DRL) policies have been shown to be vulnerable to small adversarial noise in observations. Such adversarial noise can have disastrous consequences in safety-critical environments. For instance, a self-driving car receiving adversarially perturbed sensory observations about nearby signs (e.g., a stop sign physically altered to be perceived as a speed limit sign) or objects (e.g., cars altered to be recognized as trees) can be fatal. Existing approaches for making RL algorithms robust to an observation-perturbing adversary have focused on reactive approaches that iteratively improve against adversarial examples generated at each iteration. While such approaches have been shown to provide improvements over regular RL methods, they are reactive and can fare significantly worse if certain categories of adversarial examples are not generated during training. To that end, we pursue a more proactive approach that relies on directly optimizing a well-studied robustness measure, regret instead of expected value. We provide a principled approach that minimizes maximum regret over a "neighborhood" of observations to the received "observation". Our regret criterion can be used to modify existing value- and policy-based Deep RL methods. We demonstrate that our approaches provide a significant improvement in performance across a wide variety of benchmarks against leading approaches for robust Deep RL.
翻译:深度强化学习策略已被证明易受观测中微小对抗噪声的影响。在安全关键环境中,此类对抗噪声可能造成灾难性后果。例如,接收对抗性扰动感知信号的自动驾驶车辆(如物理篡改的停车标志被识别为限速标志,或被误判为树木的车辆)可能引发致命事故。现有应对观测扰动对手的强化学习鲁棒性方法主要采用反应式策略,即通过迭代改进每轮生成的对抗样本。虽然此类方法已被证明优于常规强化学习,但其反应特性导致若训练中未生成特定类别的对抗样本,性能可能显著下降。为此,我们提出更具主动性的方法:直接优化经过充分验证的鲁棒性度量指标——遗憾值而非期望值。我们构建了一套基于原则的方法体系,通过最小化观测"邻域"内相对于实际观测的最大遗憾值。该遗憾准则可适配修改现有基于价值与策略的深度强化学习算法。实验表明,在多种基准测试中,我们的方法相较于领先的鲁棒深度强化学习算法均实现了显著性能提升。