In recent years, the rise of machine learning (ML) in cybersecurity has brought new challenges, including the increasing threat of backdoor poisoning attacks on ML malware classifiers. For instance, adversaries could inject malicious samples into public malware repositories, contaminating the training data and potentially misclassifying malware by the ML model. Current countermeasures predominantly focus on detecting poisoned samples by leveraging disagreements within the outputs of a diverse set of ensemble models on training data points. However, these methods are not suitable for scenarios where Machine Learning-as-a-Service (MLaaS) is used or when users aim to remove backdoors from a model after it has been trained. Addressing this scenario, we introduce PBP, a post-training defense for malware classifiers that mitigates various types of backdoor embeddings without assuming any specific backdoor embedding mechanism. Our method exploits the influence of backdoor attacks on the activation distribution of neural networks, independent of the trigger-embedding method. In the presence of a backdoor attack, the activation distribution of each layer is distorted into a mixture of distributions. By regulating the statistics of the batch normalization layers, we can guide a backdoored model to perform similarly to a clean one. Our method demonstrates substantial advantages over several state-of-the-art methods, as evidenced by experiments on two datasets, two types of backdoor methods, and various attack configurations. Notably, our approach requires only a small portion of the training data -- only 1\% -- to purify the backdoor and reduce the attack success rate from 100\% to almost 0\%, a 100-fold improvement over the baseline methods. Our code is available at \url{https://github.com/judydnguyen/pbp-backdoor-purification-official}.
翻译:近年来,机器学习(ML)在网络安全领域的兴起带来了新的挑战,包括针对ML恶意软件分类器的后门投毒攻击日益增长的威胁。例如,攻击者可能将恶意样本注入公共恶意软件存储库,污染训练数据,并可能导致ML模型对恶意软件进行错误分类。当前的对策主要侧重于通过利用训练数据点上多样化集成模型输出之间的不一致性来检测中毒样本。然而,这些方法并不适用于使用机器学习即服务(MLaaS)的场景,或用户希望在模型训练完成后从中移除后门的情况。针对这一场景,我们提出了PBP,一种针对恶意软件分类器的训练后防御方法,该方法能够缓解多种类型的后门嵌入,且无需假设任何特定的后门嵌入机制。我们的方法利用了后门攻击对神经网络激活分布的影响,这种影响与触发器嵌入方法无关。在后门攻击存在的情况下,每一层的激活分布会扭曲为混合分布。通过调整批量归一化层的统计量,我们可以引导一个被植入后门的模型表现得与干净模型相似。我们的方法在多个最先进的方法中展现出显著优势,这通过在两个数据集、两种后门方法以及多种攻击配置上进行的实验得到了证明。值得注意的是,我们的方法仅需一小部分训练数据——仅1%——即可净化后门,并将攻击成功率从100%降低至接近0%,相比基线方法有100倍的提升。我们的代码可在 \url{https://github.com/judydnguyen/pbp-backdoor-purification-official} 获取。