Reinforcement learning (RL) is an actively growing field that is seeing increased usage in real-world, safety-critical applications -- making it paramount to ensure the robustness of RL algorithms against adversarial attacks. In this work we explore a particularly stealthy form of training-time attacks against RL -- backdoor poisoning. Here the adversary intercepts the training of an RL agent with the goal of reliably inducing a particular action when the agent observes a pre-determined trigger at inference time. We uncover theoretical limitations of prior work by proving their inability to generalize across domains and MDPs. Motivated by this, we formulate a novel poisoning attack framework which interlinks the adversary's objectives with those of finding an optimal policy -- guaranteeing attack success in the limit. Using insights from our theoretical analysis we develop ``SleeperNets'' as a universal backdoor attack which exploits a newly proposed threat model and leverages dynamic reward poisoning techniques. We evaluate our attack in 6 environments spanning multiple domains and demonstrate significant improvements in attack success over existing methods, while preserving benign episodic return.
翻译:强化学习(RL)是一个蓬勃发展的领域,在现实世界安全关键应用中的使用日益增多——这使得确保RL算法对抗对抗性攻击的鲁棒性变得至关重要。在这项工作中,我们探索了一种针对RL的特别隐蔽的训练时攻击形式——后门投毒。在这种攻击中,对手通过拦截RL智能体的训练过程,旨在当智能体在推理时观察到预定的触发器时,可靠地诱导其执行特定动作。我们通过证明先前工作无法跨领域和马尔可夫决策过程(MDP)泛化,揭示了其理论局限性。受此启发,我们构建了一种新颖的投毒攻击框架,该框架将对手的目标与寻找最优策略的目标相互关联——从而在极限情况下保证攻击成功。基于理论分析的见解,我们开发了“沉睡网络”(SleeperNets)作为一种通用后门攻击,它利用新提出的威胁模型并采用动态奖励投毒技术。我们在涵盖多个领域的6个环境中评估了该攻击,结果表明其在保持良性回合回报的同时,攻击成功率相比现有方法有显著提升。