Adversarial examples threaten the integrity of machine learning systems with alarming success rates even under constrained black-box conditions. Stateful defenses have emerged as an effective countermeasure, detecting potential attacks by maintaining a buffer of recent queries and detecting new queries that are too similar. However, these defenses fundamentally pose a trade-off between attack detection and false positive rates, and this trade-off is typically optimized by hand-picking feature extractors and similarity thresholds that empirically work well. There is little current understanding as to the formal limits of this trade-off and the exact properties of the feature extractors/underlying problem domain that influence it. This work aims to address this gap by offering a theoretical characterization of the trade-off between detection and false positive rates for stateful defenses. We provide upper bounds for detection rates of a general class of feature extractors and analyze the impact of this trade-off on the convergence of black-box attacks. We then support our theoretical findings with empirical evaluations across multiple datasets and stateful defenses.
翻译:对抗性示例以惊人的成功率威胁机器学习系统的完整性,即使在受限的黑盒条件下也是如此。状态防御作为一种有效的对策应运而生,通过维护近期查询的缓冲区并检测过于相似的新查询来识别潜在攻击。然而,这些防御本质上在攻击检测率与误报率之间存在权衡,且这种权衡通常通过凭经验选择表现良好的特征提取器和相似度阈值来优化。目前对于这种权衡的形式化极限以及影响它的特征提取器/底层问题域的确切特性知之甚少。本工作旨在通过提供状态防御在检测率与误报率之间权衡的理论表征来填补这一空白。我们为一般类别特征提取器的检测率提供了上界,并分析了这种权衡对黑盒攻击收敛性的影响。随后,我们通过在多个数据集和状态防御上的实证评估来支持我们的理论发现。