Circumventing Backdoor Defenses That Are Based on Latent Separability

Recent studies revealed that deep learning is susceptible to backdoor poisoning attacks. An adversary can embed a hidden backdoor into a model to manipulate its predictions by only modifying a few training data, without controlling the training process. Currently, a tangible signature has been widely observed across a diverse set of backdoor poisoning attacks -- models trained on a poisoned dataset tend to learn separable latent representations for poison and clean samples. This latent separation is so pervasive that a family of backdoor defenses directly take it as a default assumption (dubbed latent separability assumption), based on which to identify poison samples via cluster analysis in the latent space. An intriguing question consequently follows: is the latent separation unavoidable for backdoor poisoning attacks? This question is central to understanding whether the assumption of latent separability provides a reliable foundation for defending against backdoor poisoning attacks. In this paper, we design adaptive backdoor poisoning attacks to present counter-examples against this assumption. Our methods include two key components: (1) a set of trigger-planted samples correctly labeled to their semantic classes (other than the target class) that can regularize backdoor learning; (2) asymmetric trigger planting strategies that help to boost attack success rate (ASR) as well as to diversify latent representations of poison samples. Extensive experiments on benchmark datasets verify the effectiveness of our adaptive attacks in bypassing existing latent separation based backdoor defenses. Moreover, our attacks still maintain a high attack success rate with negligible clean accuracy drop. Our studies call for defense designers to take caution when leveraging latent separation as an assumption in their defenses.

翻译：近期研究表明，深度学习易受后门投毒攻击的影响。攻击者无需控制训练过程，仅需修改少量训练数据即可在模型中嵌入隐藏后门，从而操控模型预测。当前，多种后门投毒攻击普遍存在一个显著特征——在投毒数据集上训练的模型倾向于学习投毒样本与干净样本的可分离潜在表示。这种潜在分离现象极为普遍，以至于一系列后门防御方法直接将其作为默认假设（称为潜在可分离性假设），并基于此在潜在空间通过聚类分析识别投毒样本。由此引出一个关键问题：后门投毒攻击是否必然导致潜在分离？这一问题对于理解潜在可分离性假设是否能为防御后门投毒攻击提供可靠基础至关重要。本文设计了自适应后门投毒攻击，旨在提供该假设的反例。我们的方法包含两个核心组件：（1）一组被正确标注为其语义类别（而非目标类别）的触发器植入样本，用于正则化后门学习过程；（2）非对称触发器植入策略，既能提升攻击成功率（ASR），又能增强投毒样本潜在表示的多样性。在基准数据集上的大量实验验证了自适应攻击在规避现有基于潜在分离的后门防御方法方面的有效性。此外，我们的攻击在保持高攻击成功率的同时，仅导致可忽略的干净准确率下降。本研究呼吁防御设计者在构建防御机制时审慎采用潜在分离性假设。