SoK: Pitfalls in Evaluating Black-Box Attacks

Numerous works study black-box attacks on image classifiers. However, these works make different assumptions on the adversary's knowledge and current literature lacks a cohesive organization centered around the threat model. To systematize knowledge in this area, we propose a taxonomy over the threat space spanning the axes of feedback granularity, the access of interactive queries, and the quality and quantity of the auxiliary data available to the attacker. Our new taxonomy provides three key insights. 1) Despite extensive literature, numerous under-explored threat spaces exist, which cannot be trivially solved by adapting techniques from well-explored settings. We demonstrate this by establishing a new state-of-the-art in the less-studied setting of access to top-k confidence scores by adapting techniques from well-explored settings of accessing the complete confidence vector, but show how it still falls short of the more restrictive setting that only obtains the prediction label, highlighting the need for more research. 2) Identification the threat model of different attacks uncovers stronger baselines that challenge prior state-of-the-art claims. We demonstrate this by enhancing an initially weaker baseline (under interactive query access) via surrogate models, effectively overturning claims in the respective paper. 3) Our taxonomy reveals interactions between attacker knowledge that connect well to related areas, such as model inversion and extraction attacks. We discuss how advances in other areas can enable potentially stronger black-box attacks. Finally, we emphasize the need for a more realistic assessment of attack success by factoring in local attack runtime. This approach reveals the potential for certain attacks to achieve notably higher success rates and the need to evaluate attacks in diverse and harder settings, highlighting the need for better selection criteria.

翻译：大量工作研究了针对图像分类器的黑盒攻击。然而，这些工作对攻击者知识做了不同假设，且当前文献缺乏以威胁模型为核心的系统性组织。为系统化该领域知识，我们提出了一种针对威胁空间的分类法，涵盖反馈粒度、交互式查询访问以及攻击者可用辅助数据的质量和数量等维度。我们的新分类法带来三个关键见解：1）尽管文献广泛，仍存在大量未被充分探索的威胁空间，这些空间无法通过简单套用已充分研究环境下的技术来解决。我们通过将完整置信度向量环境的已知技术适配到仅访问Top-K置信度分数的较少研究环境，建立了新的最优结果，但发现该结果仍不及仅获取预测标签的更严格环境，凸显了更多研究的必要性。2）识别不同攻击的威胁模型可揭示出更强的基线，挑战先前声称的最优结果。我们通过使用替代模型增强初始较弱的基线（在交互式查询访问下），有效推翻了相应论文中的结论。3）我们的分类法揭示了攻击者知识间的相互作用，与模型反演和提取攻击等相关领域紧密关联。我们探讨了其他领域的进展如何可能促成更强大的黑盒攻击。最后，我们强调需通过纳入本地攻击运行时来更现实地评估攻击成功率。该方法揭示了某些攻击可能达到更高成功率的潜力，以及在多样化和更困难环境下评估攻击的必要性，进一步凸显了选择标准优化的需求。