We propose the first black-box targeted attack against online deep reinforcement learning through reward poisoning during training time. Our attack is applicable to general environments with unknown dynamics learned by unknown algorithms and requires limited attack budgets and computational resources. We leverage a general framework and find conditions to ensure efficient attack under a general assumption of the learning algorithms. We show that our attack is optimal in our framework under the conditions. We experimentally verify that with limited budgets, our attack efficiently leads the learning agent to various target policies under a diverse set of popular DRL environments and state-of-the-art learners.
翻译:我们提出了首个黑盒目标导向攻击方法,通过在训练阶段进行奖励投毒来攻击在线深度强化学习。该攻击适用于具有未知动态环境且由未知算法学习的通用场景,且仅需有限的攻击预算与计算资源。我们构建了一个通用框架,并在学习算法的一般性假设下,推导出确保攻击高效性的条件。理论证明,在该框架下,我们的攻击方法在所述条件下具有最优性。实验验证表明,在有限预算下,该方法能高效引导学习智能体趋向多种目标策略,涵盖多类主流深度强化学习环境与最先进的学习算法。