Web-scraped datasets are vulnerable to data poisoning, which can be used for backdooring deep image classifiers during training. Since training on large datasets is expensive, a model is trained once and re-used many times. Unlike adversarial examples, backdoor attacks often target specific classes rather than any class learned by the model. One might expect that targeting many classes through a naive composition of attacks vastly increases the number of poison samples. We show this is not necessarily true and more efficient, universal data poisoning attacks exist that allow controlling misclassifications from any source class into any target class with a small increase in poison samples. Our idea is to generate triggers with salient characteristics that the model can learn. The triggers we craft exploit a phenomenon we call inter-class poison transferability, where learning a trigger from one class makes the model more vulnerable to learning triggers for other classes. We demonstrate the effectiveness and robustness of our universal backdoor attacks by controlling models with up to 6,000 classes while poisoning only 0.15% of the training dataset.
翻译:基于网络抓取的训练数据集易受数据投毒攻击,该类攻击可在深度图像分类器训练过程中植入后门。由于大规模数据集训练成本高昂,模型通常仅训练一次便被重复使用。与对抗样本不同,后门攻击往往针对特定类别而非模型学习的所有类别。人们可能认为通过简单组合攻击来覆盖多个类别会大幅增加投毒样本数量。然而我们证明这一假设并不必然成立——存在更高效的通用数据投毒攻击方法,能在仅少量增加投毒样本的情况下,实现从任意源类别到任意目标类别的错误分类控制。核心思路是生成具有显著性特征的触发器,使模型能够有效学习。我们构造的触发器利用了称为"跨类别投毒迁移性"的现象——针对某一类别学习触发器会使模型更易学习其他类别的触发器。通过在多达6000个类别的模型上进行验证,我们仅投毒0.15%的训练数据即可实现后门控制,证明该通用后门攻击的有效性与鲁棒性。