Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks. These attacks induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks rely on arbitrarily large perturbations to the agent's rewards to achieve both of these objectives - leaving them open to detection. Thus, in this work, we propose a new class of backdoor attacks against DRL which achieve state of the art performance while minimally altering the agent's rewards. These ``inception'' attacks train the agent to associate the targeted adversarial behavior with high returns by inducing a disjunction between the agent's chosen action and the true action executed in the environment during training. We formally define these attacks and prove they can achieve both adversarial objectives. We then devise an online inception attack which significantly out-performs prior attacks under bounded reward constraints.
翻译:近期研究表明,深度强化学习(DRL)算法在训练阶段易受后门投毒攻击的影响。此类攻击通过在部署阶段使智能体观察到固定触发器时诱导其产生预设的对抗行为,同时允许智能体在训练期间完成既定任务。现有攻击方法依赖对智能体奖励函数施加任意大幅度的扰动来实现双重目标,这使其易于被检测。为此,本研究提出一类针对DRL的新型后门攻击方法,该方法在最小化奖励函数修改的前提下实现了最先进的攻击性能。这类"初始"攻击通过在训练阶段制造智能体选择动作与环境实际执行动作之间的分离,使智能体将目标对抗行为与高回报相关联。我们对此类攻击进行了形式化定义,并证明其可同时实现双重对抗目标。在此基础上,我们设计了一种在线初始攻击方案,该方案在有限奖励约束条件下显著优于现有攻击方法。