Defending machine-learning (ML) models against white-box adversarial attacks has proven to be extremely difficult. Instead, recent work has proposed stateful defenses in an attempt to defend against a more restricted black-box attacker. These defenses operate by tracking a history of incoming model queries, and rejecting those that are suspiciously similar. The current state-of-the-art stateful defense Blacklight was proposed at USENIX Security '22 and claims to prevent nearly 100% of attacks on both the CIFAR10 and ImageNet datasets. In this paper, we observe that an attacker can significantly reduce the accuracy of a Blacklight-protected classifier (e.g., from 82.2% to 6.4% on CIFAR10) by simply adjusting the parameters of an existing black-box attack. Motivated by this surprising observation, since existing attacks were evaluated by the Blacklight authors, we provide a systematization of stateful defenses to understand why existing stateful defense models fail. Finally, we propose a stronger evaluation strategy for stateful defenses comprised of adaptive score and hard-label based black-box attacks. We use these attacks to successfully reduce even reconfigured versions of Blacklight to as low as 0% robust accuracy.
翻译:防御机器学习模型免受白盒对抗攻击已被证明极为困难。因此,近期研究提出有状态防御,试图防御受限更强的黑盒攻击者。这类防御通过追踪历史模型查询,并拒绝那些高度相似的查询来运作。当前最先进的有状态防御系统Blacklight于USENIX Security '22提出,声称能在CIFAR10和ImageNet数据集上阻止近100%的攻击。本文发现,攻击者仅需调整现有黑盒攻击的参数,即可显著降低受Blacklight保护的分类器准确率(例如,在CIFAR10上从82.2%降至6.4%)。鉴于Blacklight作者已评估过现有攻击,这一惊人发现促使我们对有状态防御进行系统化梳理,以理解现有模型失效的原因。最后,我们提出一种由自适应评分和硬标签黑盒攻击组成的更强评估策略。利用这些攻击,我们成功将Blacklight的鲁棒准确率降至0%,即使对其重构版本也是如此。