Deep learning models are well known to be susceptible to backdoor attack, where the attacker only needs to provide a tampered dataset on which the triggers are injected. Models trained on the dataset will passively implant the backdoor, and triggers on the input can mislead the models during testing. Our study shows that the model shows different learning behaviors in clean and poisoned subsets during training. Based on this observation, we propose a general training pipeline to defend against backdoor attacks actively. Benign models can be trained from the unreliable dataset by decoupling the learning process into three stages, i.e., supervised learning, active unlearning, and active semi-supervised fine-tuning. The effectiveness of our approach has been shown in numerous experiments across various backdoor attacks and datasets.
翻译:深度学习模型易受后门攻击已是公认问题,攻击者仅需提供植入触发器的篡改数据集。在该数据集上训练的模型将被动嵌入后门,测试阶段输入中的触发器即可误导模型决策。本研究发现,模型在训练过程中对干净样本子集与污染样本子集表现出差异化学习行为。基于此观察,我们提出一种通用的主动防御训练框架。通过将学习过程解耦为三个阶段——监督学习、主动遗忘与主动半监督微调,可从不可靠数据集中训练出良性模型。该方法在多种后门攻击场景与数据集上的大量实验均验证了其有效性。