BAFFLE: Hiding Backdoors in Offline Reinforcement Learning Datasets

Reinforcement learning (RL) makes an agent learn from trial-and-error experiences gathered during the interaction with the environment. Recently, offline RL has become a popular RL paradigm because it saves the interactions with environments. In offline RL, data providers share large pre-collected datasets, and others can train high-quality agents without interacting with the environments. This paradigm has demonstrated effectiveness in critical tasks like robot control, autonomous driving, etc. However, less attention is paid to investigating the security threats to the offline RL system. This paper focuses on backdoor attacks, where some perturbations are added to the data (observations) such that given normal observations, the agent takes high-rewards actions, and low-reward actions on observations injected with triggers. In this paper, we propose Baffle (Backdoor Attack for Offline Reinforcement Learning), an approach that automatically implants backdoors to RL agents by poisoning the offline RL dataset, and evaluate how different offline RL algorithms react to this attack. Our experiments conducted on four tasks and four offline RL algorithms expose a disquieting fact: none of the existing offline RL algorithms is immune to such a backdoor attack. More specifically, Baffle modifies 10\% of the datasets for four tasks (3 robotic controls and 1 autonomous driving). Agents trained on the poisoned datasets perform well in normal settings. However, when triggers are presented, the agents' performance decreases drastically by 63.2\%, 53.9\%, 64.7\%, and 47.4\% in the four tasks on average. The backdoor still persists after fine-tuning poisoned agents on clean datasets. We further show that the inserted backdoor is also hard to be detected by a popular defensive method. This paper calls attention to developing more effective protection for the open-source offline RL dataset.

翻译：强化学习（RL）使智能体在与环境交互过程中通过试错经验进行学习。近年来，离线RL因其无需与环境交互而成为流行的RL范式。在离线RL中，数据提供者共享大规模预收集数据集，其他人可在不与环境交互的情况下训练高质量智能体。该范式在机器人控制、自动驾驶等关键任务中已展现出有效性。然而，针对离线RL系统的安全威胁研究尚不充分。本文聚焦后门攻击：在数据（观测值）中添加扰动，使得智能体在正常观测下采取高回报动作，而在注入触发器的观测上采取低回报动作。我们提出Baffle（离线强化学习后门攻击）方法，通过污染离线RL数据集自动向RL智能体植入后门，并评估不同离线RL算法对该攻击的响应。在四个任务和四种离线RL算法上的实验揭示了一个令人不安的事实：现有离线RL算法无一能免疫此类后门攻击。具体而言，Baffle在四个任务（三个机器人控制任务和一个自动驾驶任务）中修改了10%的数据集。在正常环境中，基于被污染数据集训练的智能体表现良好。然而，当触发器出现时，智能体在四个任务中的性能平均骤降63.2%、53.9%、64.7%和47.4%。即使在干净数据集上对受污染智能体进行微调，后门仍然存在。我们进一步证明，插入的后门难以被常用防御方法检测。本文呼吁为开源离线RL数据集开发更有效的防护措施。