This paper proposes a simple method to distill and detect backdoor patterns within an image: \emph{Cognitive Distillation} (CD). The idea is to extract the "minimal essence" from an input image responsible for the model's prediction. CD optimizes an input mask to extract a small pattern from the input image that can lead to the same model output (i.e., logits or deep features). The extracted pattern can help understand the cognitive mechanism of a model on clean vs. backdoor images and is thus called a \emph{Cognitive Pattern} (CP). Using CD and the distilled CPs, we uncover an interesting phenomenon of backdoor attacks: despite the various forms and sizes of trigger patterns used by different attacks, the CPs of backdoor samples are all surprisingly and suspiciously small. One thus can leverage the learned mask to detect and remove backdoor examples from poisoned training datasets. We conduct extensive experiments to show that CD can robustly detect a wide range of advanced backdoor attacks. We also show that CD can potentially be applied to help detect potential biases from face datasets. Code is available at \url{https://github.com/HanxunH/CognitiveDistillation}.
翻译:本文提出了一种简单的方法来蒸馏和检测图像中的后门模式:\emph{认知蒸馏}(Cognitive Distillation, CD)。其核心思想是从输入图像中提取导致模型预测的“最小本质”。CD通过优化输入掩码,从输入图像中提取一个能产生相同模型输出(如logits或深度特征)的小型模式。提取的模式有助于理解模型在干净图像与后门图像上的认知机制,因此被称为\emph{认知模式}(Cognitive Pattern, CP)。利用CD及蒸馏得到的CP,我们揭示了一个后门攻击的有趣现象:尽管不同攻击使用的触发器模式在形式和大小上各异,但后门样本的CP均出奇地小。因此,可以利用学习到的掩码来检测并移除中毒训练数据集中的后门样本。我们进行了大量实验,证明CD能够稳健地检测多种高级后门攻击。我们还展示了CD可能被应用于帮助检测人脸数据集中的潜在偏差。代码可在\url{https://github.com/HanxunH/CognitiveDistillation}获取。