Machine learning models that use deep neural networks (DNNs) are vulnerable to backdoor attacks. An adversary carrying out a backdoor attack embeds a predefined perturbation called a trigger into a small subset of input samples and trains the DNN such that the presence of the trigger in the input results in an adversary-desired output class. Such adversarial retraining however needs to ensure that outputs for inputs without the trigger remain unaffected and provide high classification accuracy on clean samples. In this paper, we propose MDTD, a Multi-Domain Trojan Detector for DNNs, which detects inputs containing a Trojan trigger at testing time. MDTD does not require knowledge of trigger-embedding strategy of the attacker and can be applied to a pre-trained DNN model with image, audio, or graph-based inputs. MDTD leverages an insight that input samples containing a Trojan trigger are located relatively farther away from a decision boundary than clean samples. MDTD estimates the distance to a decision boundary using adversarial learning methods and uses this distance to infer whether a test-time input sample is Trojaned or not. We evaluate MDTD against state-of-the-art Trojan detection methods across five widely used image-based datasets: CIFAR100, CIFAR10, GTSRB, SVHN, and Flowers102; four graph-based datasets: AIDS, WinMal, Toxicant, and COLLAB; and the SpeechCommand audio dataset. MDTD effectively identifies samples that contain different types of Trojan triggers. We evaluate MDTD against adaptive attacks where an adversary trains a robust DNN to increase (decrease) distance of benign (Trojan) inputs from a decision boundary.
翻译:使用深度神经网络(DNN)的机器学习模型易受后门攻击。实施后门攻击的对手会将预定义扰动(称为触发器)嵌入少量输入样本,并训练DNN,使得输入中存在触发器时能够产生对手期望的输出类别。然而,此类对抗性重训练需要确保不含触发器的输入输出不受影响,并在干净样本上保持高分类准确率。本文提出MDTD——一种面向DNN的多领域特洛伊木马检测器,能够在测试阶段检测包含特洛伊触发器的输入。MDTD无需了解攻击者的触发器嵌入策略,可应用于基于图像、音频或图输入的预训练DNN模型。MDTD基于以下洞见:包含特洛伊触发器的输入样本相对干净样本远离决策边界。MDTD利用对抗学习方法估计到决策边界的距离,并据此推断测试输入样本是否为特洛伊木马样本。我们在五个广泛使用的图像数据集(CIFAR100、CIFAR10、GTSRB、SVHN、Flowers102)、四个图数据集(AIDS、WinMal、Toxicant、COLLAB)以及SpeechCommand音频数据集上,将MDTD与最先进的木马检测方法进行评估。MDTD能有效识别包含不同类型特洛伊触发器的样本。我们还针对自适应攻击(即对手训练鲁棒DNN以增大/减小良性/木马输入与决策边界的距离)对MDTD进行了评估。