The adversarial vulnerability of neural nets, and subsequent techniques to create robust models have attracted significant attention; yet we still lack a full understanding of this phenomenon. Here, we study adversarial examples of trained neural networks through analytical tools afforded by recent theory advances connecting neural networks and kernel methods, namely the Neural Tangent Kernel (NTK), following a growing body of work that leverages the NTK approximation to successfully analyze important deep learning phenomena and design algorithms for new applications. We show how NTKs allow to generate adversarial examples in a ``training-free'' fashion, and demonstrate that they transfer to fool their finite-width neural net counterparts in the ``lazy'' regime. We leverage this connection to provide an alternative view on robust and non-robust features, which have been suggested to underlie the adversarial brittleness of neural nets. Specifically, we define and study features induced by the eigendecomposition of the kernel to better understand the role of robust and non-robust features, the reliance on both for standard classification and the robustness-accuracy trade-off. We find that such features are surprisingly consistent across architectures, and that robust features tend to correspond to the largest eigenvalues of the model, and thus are learned early during training. Our framework allows us to identify and visualize non-robust yet useful features. Finally, we shed light on the robustness mechanism underlying adversarial training of neural nets used in practice: quantifying the evolution of the associated empirical NTK, we demonstrate that its dynamics falls much earlier into the ``lazy'' regime and manifests a much stronger form of the well known bias to prioritize learning features within the top eigenspaces of the kernel, compared to standard training.
翻译:神经网络的对抗脆弱性以及后续构建鲁棒模型的技术引发了广泛关注,然而我们对此现象仍缺乏全面理解。本文借助近期连接神经网络与核方法(即神经正切核,NTK)的理论进展所提供的分析工具,研究训练后神经网络的对抗样本。这一研究遵循了日益增长的研究趋势,即利用NTK近似成功分析重要深度学习现象并设计新应用算法。我们展示了NTK如何以“免训练”方式生成对抗样本,并证明它们能在“懒惰”训练机制下成功欺骗有限宽度的神经网络对应模型。我们利用这种关联为鲁棒与非鲁棒特征提供替代性视角——这些特征被认为构成了神经网络对抗脆弱性的基础。具体而言,我们定义并研究由核特征分解诱导的特征,以更深入理解鲁棒与非鲁棒特征的作用、两者在标准分类中的依赖关系以及鲁棒性与准确率的权衡。我们发现这些特征在架构间惊人地一致,且鲁棒特征倾向于对应模型的最大特征值,因此在训练早期阶段即被学习。我们的框架能够识别并可视化非鲁棒但仍有用的特征。最后,我们揭示了实际神经网络对抗训练背后的鲁棒性机制:通过量化关联经验NTK的演化过程,我们证明其动力学过程比标准训练更早进入“懒惰”训练机制,且展现出对优先学习核拓扑特征空间内特征的已知偏好的更强形式。