We study reward poisoning attacks on online deep reinforcement learning (DRL), where the attacker is oblivious to the learning algorithm used by the agent and the dynamics of the environment. We demonstrate the intrinsic vulnerability of state-of-the-art DRL algorithms by designing a general, black-box reward poisoning framework called adversarial MDP attacks. We instantiate our framework to construct two new attacks which only corrupt the rewards for a small fraction of the total training timesteps and make the agent learn a low-performing policy. We provide a theoretical analysis of the efficiency of our attack and perform an extensive empirical evaluation. Our results show that our attacks efficiently poison agents learning in several popular classical control and MuJoCo environments with a variety of state-of-the-art DRL algorithms, such as DQN, PPO, SAC, etc.
翻译:我们研究在线深度强化学习中的奖励投毒攻击,攻击者对智能体所使用的学习算法以及环境动力学一无所知。通过设计一种通用的、黑盒的奖励投毒框架——对抗性马尔可夫决策过程攻击,我们揭示了当前最先进深度强化学习算法内在的脆弱性。基于该框架,我们构建了两种新型攻击,仅需污染总训练时间步中一小部分奖励,即可使智能体学习到低性能策略。我们对攻击效率进行了理论分析,并开展了广泛的实证评估。结果表明,我们的攻击能够高效地针对多种流行的经典控制与MuJoCo环境中的智能体进行投毒,涵盖DQN、PPO、SAC等先进深度强化学习算法。