Learning-based systems have been demonstrated to be vulnerable to backdoor attacks, wherein malicious users manipulate model performance by injecting backdoors into the target model and activating them with specific triggers. Previous backdoor attack methods primarily focused on two key metrics: attack success rate and stealthiness. However, these methods often necessitate significant privileges over the target model, such as control over the training process, making them challenging to implement in real-world scenarios. Moreover, the robustness of existing backdoor attacks is not guaranteed, as they prove sensitive to defenses such as image augmentations and model distillation. In this paper, we address these two limitations and introduce RSBA (Robust Statistical Backdoor Attack under Privilege-constrained Scenarios). The key insight of RSBA is that statistical features can naturally divide images into different groups, offering a potential implementation of triggers. This type of trigger is more robust than manually designed ones, as it is widely distributed in normal images. By leveraging these statistical triggers, RSBA enables attackers to conduct black-box attacks by solely poisoning the labels or the images. We empirically and theoretically demonstrate the robustness of RSBA against image augmentations and model distillation. Experimental results show that RSBA achieves a 99.83\% attack success rate in black-box scenarios. Remarkably, it maintains a high success rate even after model distillation, where attackers lack access to the training dataset of the student model (1.39\% success rate for baseline methods on average).
翻译:基于学习的系统已被证明易受后门攻击,恶意用户通过向目标模型中注入后门并利用特定触发器激活,从而操控模型性能。以往的后门攻击方法主要关注两个关键指标:攻击成功率和隐蔽性。然而,这些方法通常需要对目标模型拥有显著权限(例如控制训练过程),使其难以在实际场景中实施。此外,现有后门攻击的鲁棒性无法保证,因为它们对图像增强、模型蒸馏等防御手段较为敏感。本文针对上述两个局限,提出RSBA(受限权限场景下的鲁棒统计后门攻击)。RSBA的核心思路在于,统计特征可天然地将图像划分为不同分组,这为触发器的实现提供了潜在方案。此类触发器比人工设计的触发器更为鲁棒,因为它们广泛分布于常规图像中。通过利用这些统计触发器,攻击者仅需篡改标签或图像即可实现黑盒攻击。我们从经验与理论层面证明了RSBA对图像增强和模型蒸馏的鲁棒性。实验结果表明,RSBA在黑盒场景下实现了99.83%的攻击成功率。尤为值得注意的是,即便在模型蒸馏后(攻击者无法访问学生模型的训练数据集),其仍能保持高成功率(基线方法平均成功率仅为1.39%)。