Deep Reinforcement Learning (DRL) policies have been shown to be vulnerable to small adversarial noise in observations. Such adversarial noise can have disastrous consequences in safety-critical environments. For instance, a self-driving car receiving adversarially perturbed sensory observations about nearby signs (e.g., a stop sign physically altered to be perceived as a speed limit sign) or objects (e.g., cars altered to be recognized as trees) can be fatal. Existing approaches for making RL algorithms robust to an observation-perturbing adversary have focused on reactive approaches that iteratively improve against adversarial examples generated at each iteration. While such approaches have been shown to provide improvements over regular RL methods, they are reactive and can fare significantly worse if certain categories of adversarial examples are not generated during training. To that end, we pursue a more proactive approach that relies on directly optimizing a well-studied robustness measure, regret instead of expected value. We provide a principled approach that minimizes maximum regret over a "neighborhood" of observations to the received "observation". Our regret criterion can be used to modify existing value- and policy-based Deep RL methods. We demonstrate that our approaches provide a significant improvement in performance across a wide variety of benchmarks against leading approaches for robust Deep RL.
翻译:深度强化学习(DRL)策略已被证明易受观测中微小对抗噪声的影响。在安全关键环境中,此类对抗噪声可能造成灾难性后果。例如,自动驾驶汽车接收关于附近标志(如物理篡改后被识别为限速标志的停车标志)或物体(如被篡改后识别为树木的汽车)的对抗性扰动感知观测,可能致命。现有增强强化学习算法对观测扰动对抗者鲁棒性的方法,主要关注迭代性对抗方法,即每轮迭代中针对生成的对抗样本逐步改进。尽管此类方法已被证明优于常规强化学习方法,但它们具有被动性,若训练过程中未生成特定类别的对抗样本,其性能可能显著下降。为此,我们提出更具主动性的方法,直接优化一项经过充分研究的鲁棒性度量——遗憾(regret)而非期望值。我们提供了一种原则性方法,在观测的“邻域”内最小化接收观测的最大遗憾。我们的遗憾准则可用于修改现有基于价值和策略的深度强化学习方法。实验表明,我们的方法在多种基准测试中,相较于领先的鲁棒深度强化学习方法,性能显著提升。