OCGEC: One-class Graph Embedding Classification for DNN Backdoor Detection

Deep neural networks (DNNs) have been found vulnerable to backdoor attacks, raising security concerns about their deployment in mission-critical applications. There are various approaches to detect backdoor attacks, however they all make certain assumptions about the target attack to be detected and require equal and huge numbers of clean and backdoor samples for training, which renders these detection methods quite limiting in real-world circumstances. This study proposes a novel one-class classification framework called One-class Graph Embedding Classification (OCGEC) that uses GNNs for model-level backdoor detection with only a little amount of clean data. First, we train thousands of tiny models as raw datasets from a small number of clean datasets. Following that, we design a ingenious model-to-graph method for converting the model's structural details and weight features into graph data. We then pre-train a generative self-supervised graph autoencoder (GAE) to better learn the features of benign models in order to detect backdoor models without knowing the attack strategy. After that, we dynamically combine the GAE and one-class classifier optimization goals to form classification boundaries that distinguish backdoor models from benign models. Our OCGEC combines the powerful representation capabilities of graph neural networks with the utility of one-class classification techniques in the field of anomaly detection. In comparison to other baselines, it achieves AUC scores of more than 98% on a number of tasks, which far exceeds existing methods for detection even when they rely on a huge number of positive and negative samples. Our pioneering application of graphic scenarios for generic backdoor detection can provide new insights that can be used to improve other backdoor defense tasks. Code is available at https://github.com/jhy549/OCGEC.

翻译：深度神经网络（DNN）已被发现易受后门攻击，这引发了在其关键任务应用部署中的安全担忧。现有多种后门攻击检测方法，但它们均对目标攻击类型预设特定假设，且需要同等且大量的干净样本和后门样本进行训练，这导致这些检测方法在实际场景中存在较大局限性。本研究提出一种新颖的单类分类框架——单类图嵌入分类（OCGEC），仅需少量干净数据即可利用图神经网络实现模型级后门检测。首先，我们从少量干净数据集中训练数千个微型模型作为原始数据集。随后，我们设计了一种巧妙的模型-图转换方法，将模型结构细节与权重特征转化为图数据。接着，我们预训练一个生成式自监督图自编码器（GAE），以更好学习良性模型的特征，从而在未知攻击策略的情况下检测后门模型。之后，我们将GAE与单类分类器优化目标动态结合，形成区分后门模型与良性模型的分类边界。我们的OCGEC融合了图神经网络的强大表征能力与单类分类技术在异常检测领域的实用性。相比其他基线方法，它在多项任务上实现了超过98%的AUC分数，远超依赖大量正负样本的现有检测方法。我们将图场景率先应用于通用后门检测，可为改进其他后门防御任务提供新思路。代码已开源：https://github.com/jhy549/OCGEC。