As deep learning models are increasingly deployed in safety-critical applications, evaluating their vulnerabilities to adversarial perturbations is essential for ensuring their reliability and trustworthiness. Over the past decade, a large number of white-box adversarial robustness evaluation methods (i.e., attacks) have been proposed, ranging from single-step to multi-step methods and from individual to ensemble methods. Despite these advances, challenges remain in conducting meaningful and comprehensive robustness evaluations, particularly when it comes to large-scale testing and ensuring evaluations reflect real-world adversarial risks. In this work, we focus on image classification models and propose a novel individual attack method, Probability Margin Attack (PMA), which defines the adversarial margin in the probability space rather than the logits space. We analyze the relationship between PMA and existing cross-entropy or logits-margin-based attacks, and show that PMA can outperform the current state-of-the-art individual methods. Building on PMA, we propose two types of ensemble attacks that balance effectiveness and efficiency. Furthermore, we create a million-scale dataset, CC1M, derived from the existing CC3M dataset, and use it to conduct the first million-scale white-box adversarial robustness evaluation of adversarially-trained ImageNet models. Our findings provide valuable insights into the robustness gaps between individual versus ensemble attacks and small-scale versus million-scale evaluations.
翻译:随着深度学习模型在安全关键应用中的部署日益增多,评估其对对抗性扰动的脆弱性对于确保其可靠性和可信度至关重要。过去十年间,大量白盒对抗鲁棒性评估方法(即攻击方法)被提出,涵盖从单步到多步方法、从个体到集成方法等多种类型。尽管取得了这些进展,在进行有意义且全面的鲁棒性评估方面仍存在挑战,尤其是在大规模测试以及确保评估能反映真实世界对抗风险方面。本工作聚焦于图像分类模型,提出一种新颖的个体攻击方法——概率间隔攻击(PMA),该方法在概率空间而非对数概率空间定义对抗间隔。我们分析了PMA与现有基于交叉熵或对数间隔攻击方法之间的关系,并证明PMA能够超越当前最先进的个体方法。基于PMA,我们进一步提出两种兼顾有效性与效率的集成攻击方法。此外,我们利用现有CC3M数据集构建了百万规模数据集CC1M,并首次对经过对抗训练的ImageNet模型进行了百万规模的白盒对抗鲁棒性评估。我们的研究结果为个体攻击与集成攻击之间、小规模评估与百万规模评估之间的鲁棒性差异提供了重要见解。