BAFFLE: Backdoor Attack in Offline Reinforcement Learning

A growing body of research has focused on the Reinforcement Learning (RL) methods which allow the agent to learn from trial-and-error experiences gathered during the interaction with the environment. Recently, offline RL becomes a popular RL paradigm because it saves the interactions with environments. In offline RL, data providers share large pre-collected datasets, and others can train high-quality agents without interacting with the environments. This paradigm has demonstrated effectiveness in critical tasks like robot control, autonomous driving, etc. However, less attention is paid to investigating the security threats to the offline RL system. This paper focuses on backdoor attacks, where some perturbations are added to the data (observations) such that given normal observations, the agent takes high-rewards actions, and low-reward actions on observations injected with triggers. In this paper, we propose Baffle (Backdoor Attack for Offline Reinforcement Learning), an approach that automatically implants backdoors to RL agents by poisoning the offline RL dataset, and evaluate how different offline RL algorithms react to this attack. Our experiments conducted on four tasks and four offline RL algorithms expose a disquieting fact: none of the existing offline RL algorithms is immune to such a backdoor attack. Baffle modifies $10\%$ of the datasets for four tasks. Agents trained on the poisoned datasets perform well in normal settings. However, when triggers are presented, the agents' performance decreases drastically by $63.2\%$, $53.9\%$, $64.7\%$, and $47.4\%$ in the four tasks on average. The backdoor still persists after fine-tuning poisoned agents on clean datasets. We further show that the inserted backdoor is also hard to be detected by a popular defensive method. This paper calls attention to developing more effective protection for the open-source offline RL dataset.

翻译：日益增长的研究关注于强化学习方法，这些方法允许智能体在与环境交互过程中从试错经验中学习。近期，离线强化学习成为一种流行的强化学习范式，因为它避免了与环境交互。在离线强化学习中，数据提供者共享大规模预收集数据集，其他人无需与环境交互即可训练高质量智能体。该范式在机器人控制、自动驾驶等关键任务中展现出有效性。然而，针对离线强化学习系统安全威胁的研究相对较少。本文聚焦于后门攻击，即在数据（观测值）中添加扰动，使得智能体在正常观测值下采取高奖励动作，而在注入触发器的观测值下采取低奖励动作。本文提出Baffle（面向离线强化学习的后门攻击），一种通过污染离线强化学习数据集自动向强化学习智能体植入后门的方法，并评估不同离线强化学习算法对此攻击的反应。我们在四个任务和四种离线强化学习算法上的实验揭示了一个令人担忧的事实：现有离线强化学习算法无一能免疫此类后门攻击。Baffle对四个任务的数据集修改了10%。基于污染数据集训练的智能体在正常设置下表现良好。然而，当触发器出现时，智能体性能在四个任务上平均分别急剧下降63.2%、53.9%、64.7%和47.4%。在干净数据集上微调污染智能体后，后门仍然存在。我们进一步证明，植入的后门也难以被主流防御方法检测到。本文呼吁为开源离线强化学习数据集开发更有效的保护措施。