PBP: Post-training Backdoor Purification for Malware Classifiers

In recent years, the rise of machine learning (ML) in cybersecurity has brought new challenges, including the increasing threat of backdoor poisoning attacks on ML malware classifiers. For instance, adversaries could inject malicious samples into public malware repositories, contaminating the training data and potentially misclassifying malware by the ML model. Current countermeasures predominantly focus on detecting poisoned samples by leveraging disagreements within the outputs of a diverse set of ensemble models on training data points. However, these methods are not suitable for scenarios where Machine Learning-as-a-Service (MLaaS) is used or when users aim to remove backdoors from a model after it has been trained. Addressing this scenario, we introduce PBP, a post-training defense for malware classifiers that mitigates various types of backdoor embeddings without assuming any specific backdoor embedding mechanism. Our method exploits the influence of backdoor attacks on the activation distribution of neural networks, independent of the trigger-embedding method. In the presence of a backdoor attack, the activation distribution of each layer is distorted into a mixture of distributions. By regulating the statistics of the batch normalization layers, we can guide a backdoored model to perform similarly to a clean one. Our method demonstrates substantial advantages over several state-of-the-art methods, as evidenced by experiments on two datasets, two types of backdoor methods, and various attack configurations. Notably, our approach requires only a small portion of the training data -- only 1\% -- to purify the backdoor and reduce the attack success rate from 100\% to almost 0\%, a 100-fold improvement over the baseline methods. Our code is available at https://github.com/judydnguyen/pbp-backdoor-purification-official.

翻译：近年来，机器学习在网络安全领域的兴起带来了新的挑战，其中包括针对机器学习恶意软件分类器的后门投毒攻击日益增长的威胁。例如，攻击者可能将恶意样本注入公共恶意软件存储库，污染训练数据，并可能导致机器学习模型对恶意软件进行错误分类。现有的防御措施主要侧重于通过利用集成模型集合在训练数据点上输出的分歧来检测中毒样本。然而，这些方法不适用于使用机器学习即服务（MLaaS）的场景，也不适用于用户希望在模型训练完成后从中移除后门的情况。针对这一场景，我们提出了PBP，一种用于恶意软件分类器的训练后防御方法，能够缓解多种类型的后门嵌入，且无需假设任何特定的后门嵌入机制。我们的方法利用了后门攻击对神经网络激活分布的影响，这种影响独立于触发器嵌入方法。在后门攻击存在的情况下，每一层的激活分布会扭曲为混合分布。通过调整批量归一化层的统计量，我们可以引导一个被植入后门的模型表现得与干净模型相似。我们的方法在两种数据集、两种后门方法以及多种攻击配置下的实验证明，相较于多种最先进的方法具有显著优势。值得注意的是，我们的方法仅需一小部分训练数据——仅1%——即可净化后门，并将攻击成功率从100%降低至接近0%，这比基线方法提升了100倍。我们的代码可在 https://github.com/judydnguyen/pbp-backdoor-purification-official 获取。