We study reward poisoning attacks on online deep reinforcement learning (DRL), where the attacker is oblivious to the learning algorithm used by the agent and the dynamics of the environment. We demonstrate the intrinsic vulnerability of state-of-the-art DRL algorithms by designing a general, black-box reward poisoning framework called adversarial MDP attacks. We instantiate our framework to construct two new attacks which only corrupt the rewards for a small fraction of the total training timesteps and make the agent learn a low-performing policy. We provide a theoretical analysis of the efficiency of our attack and perform an extensive empirical evaluation. Our results show that our attacks efficiently poison agents learning in several popular classical control and MuJoCo environments with a variety of state-of-the-art DRL algorithms, such as DQN, PPO, SAC, etc.
翻译:我们研究在线深度强化学习中的奖励投毒攻击,该攻击中攻击者无需知晓智能体所使用的学习算法及环境动力学。通过设计一种通用黑箱奖励投毒框架——对抗性MDP攻击,我们揭示了当前最优深度强化学习算法的内在脆弱性。基于该框架,我们构建了两种新型攻击方法,仅需污染总训练时间步中一小部分奖励值,即可使智能体学习到低效策略。我们从理论层面分析了攻击的效能,并开展了广泛实验评估。结果表明,我们的攻击能高效毒害在多种经典控制环境及MuJoCo环境中学习的智能体,且可兼容DQN、PPO、SAC等当前主流的深度强化学习算法。