BAFFLE: Hiding Backdoors in Offline Reinforcement Learning Datasets

Reinforcement learning (RL) makes an agent learn from trial-and-error experiences gathered during the interaction with the environment. Recently, offline RL has become a popular RL paradigm because it saves the interactions with environments. In offline RL, data providers share large pre-collected datasets, and others can train high-quality agents without interacting with the environments. This paradigm has demonstrated effectiveness in critical tasks like robot control, autonomous driving, etc. However, less attention is paid to investigating the security threats to the offline RL system. This paper focuses on backdoor attacks, where some perturbations are added to the data (observations) such that given normal observations, the agent takes high-rewards actions, and low-reward actions on observations injected with triggers. In this paper, we propose Baffle (Backdoor Attack for Offline Reinforcement Learning), an approach that automatically implants backdoors to RL agents by poisoning the offline RL dataset, and evaluate how different offline RL algorithms react to this attack. Our experiments conducted on four tasks and four offline RL algorithms expose a disquieting fact: none of the existing offline RL algorithms is immune to such a backdoor attack. More specifically, Baffle modifies 10\% of the datasets for four tasks (3 robotic controls and 1 autonomous driving). Agents trained on the poisoned datasets perform well in normal settings. However, when triggers are presented, the agents' performance decreases drastically by 63.2\%, 53.9\%, 64.7\%, and 47.4\% in the four tasks on average. The backdoor still persists after fine-tuning poisoned agents on clean datasets. We further show that the inserted backdoor is also hard to be detected by a popular defensive method. This paper calls attention to developing more effective protection for the open-source offline RL dataset.

翻译：强化学习（RL）使智能体能够通过与环境的交互过程中收集的试错经验进行学习。近年来，离线强化学习因其无需与环境交互即可进行学习的特性而成为广受欢迎的强化学习范式。在该范式中，数据提供者共享大规模预收集数据集，其他人无需与环境交互即可训练出高质量智能体。这种范式已在机器人控制、自动驾驶等关键任务中展现出有效性。然而，目前对离线强化学习系统安全威胁的研究关注不足。本文聚焦后门攻击：通过对数据（观测值）添加扰动，使得智能体在面对正常观测值时采取高奖励动作，而在注入触发器的观测值上采取低奖励动作。本文提出BAFFLE（离线强化学习后门攻击）方法，通过污染离线强化学习数据集自动向RL智能体植入后门，并评估不同离线RL算法对该攻击的反应。我们在四个任务和四种离线RL算法上的实验揭示了一个令人担忧的事实：现有离线RL算法均无法免疫此类后门攻击。具体而言，BAFFLE修改了四个任务（三个机器人控制任务和一个自动驾驶任务）中10%的数据集。在正常场景中，基于污染数据集训练的智能体表现良好。然而，一旦出现触发器，智能体在四个任务上的性能平均分别下降63.2%、53.9%、64.7%和47.4%。即使在干净数据集上对受污染智能体进行微调，后门仍然持续存在。我们进一步证明，插入的后门也难以被主流防御方法检测到。本文呼吁为开源离线强化学习数据集开发更有效的保护措施。