Neural networks can conceal malicious Trojan backdoors that allow a trigger to covertly change the model behavior. Detecting signs of these backdoors, particularly without access to any triggered data, is the subject of ongoing research and open challenges. In one common formulation of the problem, we are given a set of clean and poisoned models and need to predict whether a given test model is clean or poisoned. In this paper, we introduce a detector that works remarkably well across many of the existing datasets and domains. It is obtained by training a binary classifier on a large number of models' weights after performing a few different pre-processing steps including feature selection and standardization, reference model weights subtraction, and model alignment prior to detection. We evaluate this algorithm on a diverse set of Trojan detection benchmarks and domains and examine the cases where the approach is most and least effective.
翻译:神经网络可能隐藏恶意的木马后门,使得特定触发器能够隐蔽地改变模型行为。检测这些后门的迹象——特别是在无法获取任何触发数据的情况下——是当前持续研究和开放挑战的课题。在该问题的常见表述中,我们获得一组干净模型和中毒模型,需要预测给定测试模型是否被污染。本文提出一种检测器,该检测器在现有多个数据集和领域均表现优异。其实现方式为:在检测前执行特征选择与标准化、参考模型权重消减、模型对齐等预处理步骤后,基于大量模型权重训练二元分类器。我们在多样化的木马检测基准和领域上评估该算法,并分析该方法最有效和最无效的应用场景。