A powerful category of (invisible) data poisoning attacks modify a subset of training examples by small adversarial perturbations to change the prediction of certain test-time data. Existing defense mechanisms are not desirable to deploy in practice, as they often either drastically harm the generalization performance, or are attack-specific, and prohibitively slow to apply. Here, we propose a simple but highly effective approach that unlike existing methods breaks various types of invisible poisoning attacks with the slightest drop in the generalization performance. We make the key observation that attacks introduce local sharp regions of high training loss, which when minimized, results in learning the adversarial perturbations and makes the attack successful. To break poisoning attacks, our key idea is to alleviate the sharp loss regions introduced by poisons. To do so, our approach comprises two components: an optimized friendly noise that is generated to maximally perturb examples without degrading the performance, and a randomly varying noise component. The combination of both components builds a very light-weight but extremely effective defense against the most powerful triggerless targeted and hidden-trigger backdoor poisoning attacks, including Gradient Matching, Bulls-eye Polytope, and Sleeper Agent. We show that our friendly noise is transferable to other architectures, and adaptive attacks cannot break our defense due to its random noise component. Our code is available at: https://github.com/tianyu139/friendly-noise
翻译:一类强大的(隐形)数据投毒攻击通过向部分训练样本添加微小的对抗性扰动,改变特定测试数据的预测结果。现有防御机制在实际部署中并不理想,因为它们通常要么大幅降低泛化性能,要么具有攻击特异性且应用成本极高。本文提出一种简单而高效的方法,与现有方法不同,它能在泛化性能几乎不受影响的情况下破解各类隐形投毒攻击。关键发现是,攻击会在训练损失中引入局部尖锐区域,最小化这些区域会导致模型学习对抗性扰动,从而使攻击成功。为破解投毒攻击,我们的核心思想是消除由投毒样本引入的尖锐损失区域。为此,本方法包含两个组成部分:优化的友好噪声(在最大程度上扰动样本而不降低性能),以及随机变化噪声分量。两者的结合构建了一种极轻量级但极其有效的防御,可抵御最强大的无触发器目标攻击和隐藏触发器后门投毒攻击,包括梯度匹配、靶心多面体和潜伏代理攻击。实验表明,友好噪声可迁移至其他架构,且由于随机噪声分量的存在,自适应攻击无法突破我们的防御。代码开源地址:https://github.com/tianyu139/friendly-noise