Adversarial training improves the robustness of neural networks against adversarial attacks, albeit at the expense of the trade-off between standard and robust generalization. To unveil the underlying factors driving this phenomenon, we examine the layer-wise learning capabilities of neural networks during the transition from a standard to an adversarial setting. Our empirical findings demonstrate that selectively updating specific layers while preserving others can substantially enhance the network's learning capacity. We therefore propose CURE, a novel training framework that leverages a gradient prominence criterion to perform selective conservation, updating, and revision of weights. Importantly, CURE is designed to be dataset- and architecture-agnostic, ensuring its applicability across various scenarios. It effectively tackles both memorization and overfitting issues, thus enhancing the trade-off between robustness and generalization and additionally, this training approach also aids in mitigating "robust overfitting". Furthermore, our study provides valuable insights into the mechanisms of selective adversarial training and offers a promising avenue for future research.
翻译:对抗训练通过牺牲标准泛化性能与鲁棒泛化之间的平衡,提高了神经网络面对对抗攻击的鲁棒性。为揭示这一现象背后的驱动因素,我们系统研究了神经网络在从标准训练向对抗训练过渡过程中各层的学习能力。实验发现,选择性更新特定层并保留其余层的权重更新策略,能显著提升网络的学习容量。基于此,我们提出CURE新型训练框架——该框架利用梯度显著度准则对权重进行选择性保留、更新与修正。值得强调的是,CURE具有数据集无关与架构无关的特性,可广泛适配各类应用场景。该方法有效解决了神经网络的记忆化与过拟合问题,从而改善了鲁棒性与泛化性的平衡关系;同时,该训练策略还能缓解“鲁棒过拟合”现象。本研究为选择性对抗训练机制提供了重要理论洞见,并为未来相关研究开辟了新的方向。