Learning in MDPs with highly complex state representations is currently possible due to multiple advancements in reinforcement learning algorithm design. However, this incline in complexity, and furthermore the increase in the dimensions of the observation came at the cost of volatility that can be taken advantage of via adversarial attacks (i.e. moving along worst-case directions in the observation space). To solve this policy instability problem we propose a novel method to detect the presence of these non-robust directions via local quadratic approximation of the deep neural policy loss. Our method provides a theoretical basis for the fundamental cut-off between safe observations and adversarial observations. Furthermore, our technique is computationally efficient, and does not depend on the methods used to produce the worst-case directions. We conduct extensive experiments in the Arcade Learning Environment with several different adversarial attack techniques. Most significantly, we demonstrate the effectiveness of our approach even in the setting where non-robust directions are explicitly optimized to circumvent our proposed method.
翻译:由于强化学习算法设计的诸多进展,目前可以在具有高度复杂状态表示的马尔可夫决策过程(MDP)中进行学习。然而,这种复杂性的提升以及观测维度的增加,却以脆弱性为代价,这种脆弱性可能被对抗攻击所利用(即沿着观测空间中的最坏情况方向移动)。为解决这一策略不稳定问题,我们提出了一种新方法,通过深度神经策略损失的局部二次近似来检测这些非鲁棒方向的存在。我们的方法为区分安全观测与对抗观测提供了理论基础。此外,该技术计算高效,且不依赖于用于生成最坏情况方向的具体方法。我们在街机学习环境中针对多种不同的对抗攻击技术进行了大量实验。最关键的是,我们证明了即使在对非鲁棒方向进行显式优化以规避我们提出的方法的情况下,我们的方法依然有效。