Deep Reinforcement Learning (DRL) policies have been shown to be vulnerable to small adversarial noise in observations. Such adversarial noise can have disastrous consequences in safety-critical environments. For instance, a self-driving car receiving adversarially perturbed sensory observations about nearby signs (e.g., a stop sign physically altered to be perceived as a speed limit sign) or objects (e.g., cars altered to be recognized as trees) can be fatal. Existing approaches for making RL algorithms robust to an observation-perturbing adversary have focused on reactive approaches that iteratively improve against adversarial examples generated at each iteration. While such approaches have been shown to provide improvements over regular RL methods, they are reactive and can fare significantly worse if certain categories of adversarial examples are not generated during training. To that end, we pursue a more proactive approach that relies on directly optimizing a well-studied robustness measure, regret instead of expected value. We provide a principled approach that minimizes maximum regret over a "neighborhood" of observations to the received "observation". Our regret criterion can be used to modify existing value- and policy-based Deep RL methods. We demonstrate that our approaches provide a significant improvement in performance across a wide variety of benchmarks against leading approaches for robust Deep RL.
翻译:深度强化学习(DRL)策略已被证明易受观测中微小对抗噪声的影响。在安全关键环境中,此类对抗噪声可能带来灾难性后果。例如,一辆自动驾驶汽车接收到的感知信息(如被物理篡改的停车标志被识别为限速标志)或物体(如被篡改的汽车被识别为树木)若被对抗性扰动,可能导致致命事故。现有使强化学习算法对观测扰动对抗者具有鲁棒性的方法,主要采用迭代式改进的被动策略,即针对每次迭代生成的对抗样本进行优化。尽管这类方法已被证明优于常规强化学习方法,但其被动性会导致若训练过程中未生成某些类别的对抗样本时性能显著下降。为此,我们提出一种更主动的方法,直接优化一个经过充分研究的鲁棒性度量指标——遗憾(regret),而非期望值。我们提供了一种基于原则的方法,通过在接收到的"观测"的"邻域"内最小化最大遗憾值。所提出的遗憾准则可直接用于修改现有基于值函数和策略的深度强化学习方法。实验表明,在多种基准测试中,我们的方法相较于领先的鲁棒深度强化学习方法取得了显著的性能提升。