RAB: Provable Robustness Against Backdoor Attacks

Recent studies have shown that deep neural networks (DNNs) are vulnerable to adversarial attacks, including evasion and backdoor (poisoning) attacks. On the defense side, there have been intensive efforts on improving both empirical and provable robustness against evasion attacks; however, the provable robustness against backdoor attacks still remains largely unexplored. In this paper, we focus on certifying the machine learning model robustness against general threat models, especially backdoor attacks. We first provide a unified framework via randomized smoothing techniques and show how it can be instantiated to certify the robustness against both evasion and backdoor attacks. We then propose the first robust training process, RAB, to smooth the trained model and certify its robustness against backdoor attacks. We prove the robustness bound for machine learning models trained with RAB and prove that our robustness bound is tight. In addition, we theoretically show that it is possible to train the robust smoothed models efficiently for simple models such as K-nearest neighbor classifiers, and we propose an exact smooth-training algorithm that eliminates the need to sample from a noise distribution for such models. Empirically, we conduct comprehensive experiments for different machine learning (ML) models such as DNNs, support vector machines, and K-NN models on MNIST, CIFAR-10, and ImageNette datasets and provide the first benchmark for certified robustness against backdoor attacks. In addition, we evaluate K-NN models on a spambase tabular dataset to demonstrate the advantages of the proposed exact algorithm. Both the theoretic analysis and the comprehensive evaluation on diverse ML models and datasets shed light on further robust learning strategies against general training time attacks.

翻译：近年研究表明，深度神经网络易受对抗性攻击，包括规避攻击和后门（投毒）攻击。在防御层面，尽管针对规避攻击的经验性鲁棒性和可证明鲁棒性已取得大量研究进展，但后门攻击的可证明鲁棒性仍鲜有探索。本文聚焦于认证机器学习模型对通用威胁模型（尤其是后门攻击）的鲁棒性。我们首先通过随机平滑技术构建统一框架，并展示如何实例化该框架以认证模型对规避攻击和后门攻击的双重鲁棒性。继而提出首个鲁棒训练流程RAB，对训练模型进行平滑处理并认证其抵抗后门攻击的能力。我们证明了经RAB训练的机器学习模型的鲁棒性界限，并严格论证该界限具有紧致性。此外，理论分析表明，对于K近邻分类器等简单模型，可实现高效鲁棒平滑训练，并针对此类模型提出无需噪声分布采样的精确平滑训练算法。实验层面，我们在MNIST、CIFAR-10及ImageNette数据集上对深度神经网络、支持向量机与K-NN等多元机器学习模型开展全面评估，首次建立后门攻击可证明鲁棒性的基准测试。同时，在spambase表格数据集上验证K-NN模型，凸显所提精确算法的优势。理论分析与跨模型、跨数据集的综合评估为应对通用训练时攻击的鲁棒学习策略提供了重要启示。