This paper proposes a simple method to distill and detect backdoor patterns within an image: \emph{Cognitive Distillation} (CD). The idea is to extract the "minimal essence" from an input image responsible for the model's prediction. CD optimizes an input mask to extract a small pattern from the input image that can lead to the same model output (i.e., logits or deep features). The extracted pattern can help understand the cognitive mechanism of a model on clean vs. backdoor images and is thus called a \emph{Cognitive Pattern} (CP). Using CD and the distilled CPs, we uncover an interesting phenomenon of backdoor attacks: despite the various forms and sizes of trigger patterns used by different attacks, the CPs of backdoor samples are all surprisingly and suspiciously small. One thus can leverage the learned mask to detect and remove backdoor examples from poisoned training datasets. We conduct extensive experiments to show that CD can robustly detect a wide range of advanced backdoor attacks. We also show that CD can potentially be applied to help detect potential biases from face datasets. Code is available at \url{https://github.com/HanxunH/CognitiveDistillation}.
翻译:本文提出了一种简单方法,用于提炼和检测图像中的后门模式:**认知蒸馏**(Cognitive Distillation, CD)。其核心思想是从输入图像中提取导致模型预测的"最小本质"。CD通过优化输入掩码,从输入图像中提取能够产生相同模型输出(即logits或深层特征)的微小模式。提取的模式有助于理解模型在干净图像与后门图像上的认知机制,因此被称为**认知模式**(Cognitive Pattern, CP)。利用CD及提炼出的CP,我们揭示了后门攻击中一个有趣的现象:尽管不同攻击所采用的触发模式在形态和大小上各异,后门样本的CP却惊人且可疑地小。因此,可借助学习到的掩码从受污染的训练数据集中检测并移除后门样本。我们通过大量实验证明,CD能够稳健地检测多种先进后门攻击。此外,CD还有望用于帮助检测人脸数据集中的潜在偏差。代码开源地址为:\url{https://github.com/HanxunH/CognitiveDistillation}。