Adversarial attacks are widely used to identify model vulnerabilities; however, their validity as proxies for robustness to random perturbations remains debated. We ask whether an adversarial example provides a representative estimate of misprediction risk under stochastic perturbations of the same magnitude, or instead reflects an atypical worst-case event. To address this question, we introduce a probabilistic analysis that quantifies this risk with respect to directionally biased perturbation distributions, parameterized by a concentration factor $κ$ that interpolates between isotropic noise and adversarial directions. Building on this, we study the limits of this connection by proposing an attack strategy designed to probe vulnerabilities in regimes that are statistically closer to uniform noise. Experiments on ImageNet and CIFAR-10 systematically benchmark multiple attacks, revealing when adversarial success meaningfully reflects robustness to perturbations and when it does not, thereby informing their use in safety-oriented robustness evaluation.
翻译:对抗性攻击被广泛用于识别模型漏洞;然而,其作为随机扰动鲁棒性代理的有效性仍存争议。我们探讨:对抗样本是否提供了相同幅度随机扰动下误预测风险的代表性估计,抑或仅反映了非典型的极端最坏情况?为回答此问题,我们引入了一种概率分析框架,通过方向性偏置的扰动分布量化该风险,其中分布由浓度因子$κ$参数化,可在各向同性噪声与对抗方向之间连续插值。在此基础上,我们通过设计一种旨在探测更接近均匀噪声统计区域中脆弱性的攻击策略,研究该关联的极限。在ImageNet和CIFAR-10数据集上的实验系统性地评估了多种攻击方法,揭示了对抗攻击成功率在何种情况下能有效反映对扰动的鲁棒性,在何种情况下则不能,从而为面向安全的鲁棒性评估中攻击方法的使用提供指导。