Deep neural networks (DNNs) are recently shown to be vulnerable to backdoor attacks, where attackers embed hidden backdoors in the DNN model by injecting a few poisoned examples into the training dataset. While extensive efforts have been made to detect and remove backdoors from backdoored DNNs, it is still not clear whether a backdoor-free clean model can be directly obtained from poisoned datasets. In this paper, we first construct a causal graph to model the generation process of poisoned data and find that the backdoor attack acts as the confounder, which brings spurious associations between the input images and target labels, making the model predictions less reliable. Inspired by the causal understanding, we propose the Causality-inspired Backdoor Defense (CBD), to learn deconfounded representations for reliable classification. Specifically, a backdoored model is intentionally trained to capture the confounding effects. The other clean model dedicates to capturing the desired causal effects by minimizing the mutual information with the confounding representations from the backdoored model and employing a sample-wise re-weighting scheme. Extensive experiments on multiple benchmark datasets against 6 state-of-the-art attacks verify that our proposed defense method is effective in reducing backdoor threats while maintaining high accuracy in predicting benign samples. Further analysis shows that CBD can also resist potential adaptive attacks. The code is available at \url{https://github.com/zaixizhang/CBD}.
翻译:深度神经网络(DNN)近期被证明易受后门攻击,攻击者通过在训练数据集中注入少量中毒样本,将隐藏后门嵌入DNN模型。尽管已有大量研究致力于检测和移除后门DNN中的后门,但能否从中毒数据集中直接获得无后门的干净模型仍不清楚。本文首先构建因果图对中毒数据的生成过程进行建模,发现后门攻击充当了混杂因子,导致输入图像与目标标签之间产生虚假关联,从而降低模型预测的可靠性。受此因果理解启发,我们提出基于因果启发的后门防御(CBD),通过学习反混杂表示实现可靠分类。具体而言,我们有意训练一个带后门模型以捕获混杂效应,而另一个干净模型则通过最小化与带后门模型中混杂表示的互信息,并采用样本级重加权方案,致力于捕获所需的因果效应。在多个基准数据集上针对6种最先进攻击的大量实验表明,本文提出的防御方法能有效降低后门威胁,同时保持对良性样本的高精度预测。进一步分析显示,CBD还能抵御潜在的自适应攻击。代码开源地址:\url{https://github.com/zaixizhang/CBD}。