This paper proposes a simple method to distill and detect backdoor patterns within an image: \emph{Cognitive Distillation} (CD). The idea is to extract the "minimal essence" from an input image responsible for the model's prediction. CD optimizes an input mask to extract a small pattern from the input image that can lead to the same model output (i.e., logits or deep features). The extracted pattern can help understand the cognitive mechanism of a model on clean vs. backdoor images and is thus called a \emph{Cognitive Pattern} (CP). Using CD and the distilled CPs, we uncover an interesting phenomenon of backdoor attacks: despite the various forms and sizes of trigger patterns used by different attacks, the CPs of backdoor samples are all surprisingly and suspiciously small. One thus can leverage the learned mask to detect and remove backdoor examples from poisoned training datasets. We conduct extensive experiments to show that CD can robustly detect a wide range of advanced backdoor attacks. We also show that CD can potentially be applied to help detect potential biases from face datasets. Code is available at \url{https://github.com/HanxunH/CognitiveDistillation}.
翻译:本文提出了一种简单的方法来蒸馏并检测图像中的后门模式:\emph{认知蒸馏}(Cognitive Distillation, CD)。其核心思想是提取输入图像中导致模型预测的“最小本质”。CD通过优化输入掩码,从输入图像中提取一个能够产生相同模型输出(即logits或深度特征)的小型模式。提取出的模式有助于理解模型在干净图像与后门图像上的认知机制,因此被称为\emph{认知模式}(Cognitive Pattern, CP)。利用CD和蒸馏出的CPs,我们发现了一个后门攻击的有趣现象:尽管不同攻击所使用的触发器模式在形式和大小上各不相同,但后门样本的CPs都出奇地微小且可疑地一致。因此,可以利用学习到的掩码从被投毒的训练数据集中检测并移除后门样本。我们进行了大量实验,证明CD能够稳健地检测多种先进的后门攻击。同时,CD还有望应用于帮助检测人脸数据集中潜在的偏见。代码已开源:\url{https://github.com/HanxunH/CognitiveDistillation}。