Backdoor attacks inject poisoning samples during training, with the goal of forcing a machine learning model to output an attacker-chosen class when presented a specific trigger at test time. Although backdoor attacks have been demonstrated in a variety of settings and against different models, the factors affecting their effectiveness are still not well understood. In this work, we provide a unifying framework to study the process of backdoor learning under the lens of incremental learning and influence functions. We show that the effectiveness of backdoor attacks depends on: (i) the complexity of the learning algorithm, controlled by its hyperparameters; (ii) the fraction of backdoor samples injected into the training set; and (iii) the size and visibility of the backdoor trigger. These factors affect how fast a model learns to correlate the presence of the backdoor trigger with the target class. Our analysis unveils the intriguing existence of a region in the hyperparameter space in which the accuracy on clean test samples is still high while backdoor attacks are ineffective, thereby suggesting novel criteria to improve existing defenses.
翻译:后门攻击通过在训练过程中注入投毒样本,旨在迫使机器学习模型在测试时遇到特定触发器时输出攻击者指定的类别。尽管后门攻击已在多种场景和不同模型中得到验证,但影响其有效性的因素仍未得到充分理解。本研究提出了一个统一框架,从增量学习和影响函数的角度分析后门学习过程。我们证明后门攻击的有效性取决于:(i)学习算法的复杂度,由其超参数控制;(ii)注入训练集的后门样本比例;以及(iii)后门触发器的尺寸与可见性。这些因素共同影响模型学习将后门触发器存在与目标类别相关联的速度。我们的分析揭示了超参数空间中存在一个特殊区域:在该区域内模型对干净测试样本仍保持高精度,而后门攻击却失效,这为改进现有防御机制提出了新的理论依据。