SoK: Pitfalls in Evaluating Black-Box Attacks

Numerous works study black-box attacks on image classifiers. However, these works make different assumptions on the adversary's knowledge and current literature lacks a cohesive organization centered around the threat model. To systematize knowledge in this area, we propose a taxonomy over the threat space spanning the axes of feedback granularity, the access of interactive queries, and the quality and quantity of the auxiliary data available to the attacker. Our new taxonomy provides three key insights. 1) Despite extensive literature, numerous under-explored threat spaces exist, which cannot be trivially solved by adapting techniques from well-explored settings. We demonstrate this by establishing a new state-of-the-art in the less-studied setting of access to top-k confidence scores by adapting techniques from well-explored settings of accessing the complete confidence vector, but show how it still falls short of the more restrictive setting that only obtains the prediction label, highlighting the need for more research. 2) Identification the threat model of different attacks uncovers stronger baselines that challenge prior state-of-the-art claims. We demonstrate this by enhancing an initially weaker baseline (under interactive query access) via surrogate models, effectively overturning claims in the respective paper. 3) Our taxonomy reveals interactions between attacker knowledge that connect well to related areas, such as model inversion and extraction attacks. We discuss how advances in other areas can enable potentially stronger black-box attacks. Finally, we emphasize the need for a more realistic assessment of attack success by factoring in local attack runtime. This approach reveals the potential for certain attacks to achieve notably higher success rates and the need to evaluate attacks in diverse and harder settings, highlighting the need for better selection criteria.

翻译：众多研究探讨了针对图像分类器的黑盒攻击。然而，这些研究对攻击者知识的假设各不相同，且当前文献缺乏以威胁模型为核心的系统化组织。为系统化该领域的知识，我们提出了一种针对威胁空间的分类法，涵盖反馈粒度、交互式查询的访问权限，以及攻击者可用的辅助数据的质量与数量三个维度。我们的新分类法提供了三个关键见解：1）尽管文献数量庞大，但仍存在大量未被充分探索的威胁空间，这些空间无法通过简单移植已充分探索场景中的技术来轻易解决。我们通过将完全置信度向量访问的成熟技术适配到仅访问 top-k 置信度得分的较少研究场景中，展示了新技术在该场景下的最优性能，但同时表明该技术仍无法达到仅获取预测标签的更严格设置的效果，凸显了进一步研究的必要性。2）识别不同攻击的威胁模型能揭示出更强的基线方法，挑战先前声称的最优性能。我们通过利用替代模型增强一个原本较弱的基线（在交互式查询访问设置下），有效推翻了相应论文中的结论。3）我们的分类法揭示了攻击者知识之间的交互作用，这些交互与相关领域（如模型反转和模型窃取攻击）紧密相连。我们讨论了其他领域的进展如何可能催生更强的黑盒攻击。最后，我们强调应通过考虑本地攻击运行时间来更现实地评估攻击成功率。该方法揭示出某些攻击可能实现显著更高的成功率，且需在多样化且更困难的设置下评估攻击，凸显了制定更优选择标准的必要性。