BAFFLE: Backdoor Attack in Offline Reinforcement Learning

A growing body of research has focused on the Reinforcement Learning (RL) methods which allow the agent to learn from trial-and-error experiences gathered during the interaction with the environment. Recently, offline RL becomes a popular RL paradigm because it saves the interactions with environments. In offline RL, data providers share large pre-collected datasets, and others can train high-quality agents without interacting with the environments. This paradigm has demonstrated effectiveness in critical tasks like robot control, autonomous driving, etc. However, less attention is paid to investigating the security threats to the offline RL system. This paper focuses on backdoor attacks, where some perturbations are added to the data (observations) such that given normal observations, the agent takes high-rewards actions, and low-reward actions on observations injected with triggers. In this paper, we propose Baffle (Backdoor Attack for Offline Reinforcement Learning), an approach that automatically implants backdoors to RL agents by poisoning the offline RL dataset, and evaluate how different offline RL algorithms react to this attack. Our experiments conducted on four tasks and four offline RL algorithms expose a disquieting fact: none of the existing offline RL algorithms is immune to such a backdoor attack. Baffle modifies $10\%$ of the datasets for four tasks. Agents trained on the poisoned datasets perform well in normal settings. However, when triggers are presented, the agents' performance decreases drastically by $63.2\%$, $53.9\%$, $64.7\%$, and $47.4\%$ in the four tasks on average. The backdoor still persists after fine-tuning poisoned agents on clean datasets. We further show that the inserted backdoor is also hard to be detected by a popular defensive method. This paper calls attention to developing more effective protection for the open-source offline RL dataset.

翻译：越来越多的研究关注强化学习方法，该方法允许智能体通过与环境的交互中积累的试错经验进行学习。近年来，离线强化学习成为了一种流行的强化学习范式，因为它避免了与环境的交互。在离线强化学习中，数据提供者共享大规模预收集数据集，其他人无需与环境交互即可训练高质量智能体。该范式在机器人控制、自动驾驶等关键任务中已展现出有效性。然而，针对离线强化学习系统的安全威胁研究却相对不足。本文聚焦于后门攻击，即在数据（观测值）中添加某些扰动，使得智能体在正常观测下采取高奖励动作，而在注入触发器的观测下采取低奖励动作。本文提出Baffle（面向离线强化学习的后门攻击）方法，通过污染离线强化学习数据集自动将后门植入强化学习智能体，并评估不同离线强化学习算法对此攻击的反应。我们在四个任务和四种离线强化学习算法上进行的实验揭示了一个令人不安的事实：现有离线强化学习算法均无法免疫此类后门攻击。Baffle修改了四个任务中10%的数据集。基于污染数据集训练的智能体在正常设置下表现良好。然而，当触发器出现时，四个任务中智能体的性能平均分别下降63.2%、53.9%、64.7%和47.4%。即使在干净数据集上对污染智能体进行微调，后门仍然持续存在。我们进一步证明，插入的后门也难以被流行的防御方法检测到。本文呼吁为开源离线强化学习数据集开发更有效的保护措施。