Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies

This paper revisits the simple, long-studied, yet still unsolved problem of making image classifiers robust to imperceptible perturbations. Taking CIFAR10 as an example, SOTA clean accuracy is about $100$%, but SOTA robustness to $\ell_{\infty}$-norm bounded perturbations barely exceeds $70$%. To understand this gap, we analyze how model size, dataset size, and synthetic data quality affect robustness by developing the first scaling laws for adversarial training. Our scaling laws reveal inefficiencies in prior art and provide actionable feedback to advance the field. For instance, we discovered that SOTA methods diverge notably from compute-optimal setups, using excess compute for their level of robustness. Leveraging a compute-efficient setup, we surpass the prior SOTA with $20$% ($70$%) fewer training (inference) FLOPs. We trained various compute-efficient models, with our best achieving $74$% AutoAttack accuracy ($+3$% gain). However, our scaling laws also predict robustness slowly grows then plateaus at $90$%: dwarfing our new SOTA by scaling is impractical, and perfect robustness is impossible. To better understand this predicted limit, we carry out a small-scale human evaluation on the AutoAttack data that fools our top-performing model. Concerningly, we estimate that human performance also plateaus near $90$%, which we show to be attributable to $\ell_{\infty}$-constrained attacks' generation of invalid images not consistent with their original labels. Having characterized limiting roadblocks, we outline promising paths for future research.

翻译：本文重新审视了一个长期研究但尚未解决的简单问题：如何使图像分类器对不可察觉的扰动具有鲁棒性。以CIFAR10为例，当前最优（SOTA）的干净准确率约为$100\%$，而对$\ell_{\infty}$-范数有界扰动的SOTA鲁棒性仅略高于$70\%$。为理解这一差距，我们通过首次建立对抗训练的缩放定律，分析了模型规模、数据集规模和合成数据质量如何影响鲁棒性。我们的缩放定律揭示了现有技术的低效之处，并为推动该领域发展提供了可操作的反馈。例如，我们发现SOTA方法显著偏离了计算最优配置，为其鲁棒性水平使用了过量的计算资源。利用一种计算高效的配置，我们在训练（推理）FLOPs减少$20\%$（$70\%$）的情况下超越了先前的SOTA。我们训练了多种计算高效的模型，其中最佳模型实现了$74\%$的AutoAttack准确率（提升$+3\%$）。然而，我们的缩放定律也预测鲁棒性会缓慢增长并在$90\%$处趋于稳定：通过缩放来大幅超越我们新的SOTA是不切实际的，且完美鲁棒性是无法实现的。为了更好地理解这一预测极限，我们对欺骗我们最佳模型的AutoAttack数据进行了小规模人工评估。令人担忧的是，我们估计人类表现同样在$90\%$附近趋于稳定，我们证明这归因于$\ell_{\infty}$约束攻击生成的无效图像与其原始标签不一致。在描述了这些限制性障碍后，我们为未来研究勾勒了有前景的路径。