Recent studies have demonstrated the susceptibility of deep neural networks to backdoor attacks. Given a backdoored model, its prediction of a poisoned sample with trigger will be dominated by the trigger information, though trigger information and benign information coexist. Inspired by the mechanism of the optical polarizer that a polarizer could pass light waves with particular polarizations while filtering light waves with other polarizations, we propose a novel backdoor defense method by inserting a learnable neural polarizer into the backdoored model as an intermediate layer, in order to purify the poisoned sample via filtering trigger information while maintaining benign information. The neural polarizer is instantiated as one lightweight linear transformation layer, which is learned through solving a well designed bi-level optimization problem, based on a limited clean dataset. Compared to other fine-tuning-based defense methods which often adjust all parameters of the backdoored model, the proposed method only needs to learn one additional layer, such that it is more efficient and requires less clean data. Extensive experiments demonstrate the effectiveness and efficiency of our method in removing backdoors across various neural network architectures and datasets, especially in the case of very limited clean data.
翻译:近期研究表明,深度神经网络易受后门攻击。在受后门污染的模型中,尽管中毒样本同时包含触发信息与良性信息,但其预测结果会被触发信息主导。受光学偏振器(可透射特定偏振光波而滤除其他偏振光波)机制的启发,本文提出一种新颖的后门防御方法:通过在受后门模型中插入可学习的神经偏振器作为中间层,在保留良性信息的同时滤除触发信息,从而净化中毒样本。该神经偏振器被实例化为一个轻量级线性变换层,通过求解精心设计的双层优化问题(基于有限干净数据集)进行学习。与通常需调整后门模型全部参数的微调类防御方法相比,本方法仅需学习一个额外层,因而效率更高且所需干净数据更少。大量实验表明,该方法在多种神经网络架构与数据集上均可有效且高效地消除后门,尤其在干净数据极其有限的情况下表现突出。