Machine learning models are increasingly present in our everyday lives; as a result, they become targets of adversarial attackers seeking to manipulate the systems we interact with. A well-known vulnerability is a backdoor introduced into a neural network by poisoned training data or a malicious training process. Backdoors can be used to induce unwanted behavior by including a certain trigger in the input. Existing mitigations filter training data, modify the model, or perform expensive input modifications on samples. If a vulnerable model has already been deployed, however, those strategies are either ineffective or inefficient. To address this gap, we propose our inference-time backdoor mitigation approach called FIRE (Feature-space Inference-time REpair). We hypothesize that a trigger induces structured and repeatable changes in the model's internal representation. We view the trigger as directions in the latent spaces between layers that can be applied in reverse to correct the inference mechanism. Therefore, we turn the backdoored model against itself by manipulating its latent representations and moving a poisoned sample's features along the backdoor directions to neutralize the trigger. Our evaluation shows that FIRE has low computational overhead and outperforms current runtime mitigations on image benchmarks across various attacks, datasets, and network architectures.
翻译:机器学习模型日益融入日常生活,随之成为对抗攻击者的目标,这些攻击者试图操纵我们交互的系统。一个众所周知的漏洞是通过投毒训练数据或恶意训练过程在神经网络中植入的后门。通过在输入中包含特定触发器,后门可被用于诱导非预期行为。现有缓解措施包括过滤训练数据、修改模型或对样本进行昂贵的输入修改。然而,若易受攻击的模型已部署,这些策略要么无效要么低效。为填补这一空白,我们提出名为FIRE(特征空间推理时修复)的推理时后门缓解方法。我们假设触发器会在模型内部表示中引发结构化且可重复的变化。我们将触发器视为层间潜在空间中的方向向量,可通过反向应用来修正推理机制。因此,我们通过操纵潜在表示并沿后门方向移动中毒样本的特征以中和触发器,使被植入后门的模型实现自我对抗。评估结果表明,FIRE具有较低的计算开销,在多种攻击、数据集和网络架构的图像基准测试中均优于当前运行时缓解方法。