Adversarial training (AT) is a robust learning algorithm that can defend against adversarial attacks in the inference phase and mitigate the side effects of corrupted data in the training phase. As such, it has become an indispensable component of many artificial intelligence (AI) systems. However, in high-stake AI applications, it is crucial to understand AT's vulnerabilities to ensure reliable deployment. In this paper, we investigate AT's susceptibility to poisoning attacks, a type of malicious attack that manipulates training data to compromise the performance of the trained model. Previous work has focused on poisoning attacks against standard training, but little research has been done on their effectiveness against AT. To fill this gap, we design and test effective poisoning attacks against AT. Specifically, we investigate and design clean-label poisoning attacks, allowing attackers to imperceptibly modify a small fraction of training data to control the algorithm's behavior on a specific target data point. Additionally, we propose the clean-label untargeted attack, enabling attackers can attach tiny stickers on training data to degrade the algorithm's performance on all test data, where the stickers could serve as a signal against unauthorized data collection. Our experiments demonstrate that AT can still be poisoned, highlighting the need for caution when using vanilla AT algorithms in security-related applications. The code is at https://github.com/zjfheart/Poison-adv-training.git.
翻译:对抗训练(AT)是一种鲁棒学习算法,可在推理阶段防御对抗攻击,并减轻训练阶段受损数据的副作用。因此,它已成为众多人工智能系统中不可或缺的组成部分。然而,在高风险AI应用中,理解AT的脆弱性对于确保可靠部署至关重要。本文研究了AT对投毒攻击的易感性——这类恶意攻击通过操纵训练数据来破坏已训练模型的性能。此前工作集中于针对标准训练的投毒攻击,但少有研究探讨其对AT的有效性。为填补这一空白,我们设计并测试了针对AT的有效投毒攻击。具体而言,我们研究并设计了干净标签投毒攻击,使攻击者能够在不可察觉的情况下修改少量训练数据,从而控制算法在特定目标数据点上的行为。此外,我们提出了干净标签无目标攻击,允许攻击者在训练数据上附着微小贴纸来降低算法在所有测试数据上的性能——这些贴纸可作为对抗未授权数据采集的信号。实验表明,AT仍可被投毒攻击,这警示我们在安全相关应用中应谨慎使用原始AT算法。代码见 https://github.com/zjfheart/Poison-adv-training.git。