Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization

Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcement learning (RL) training can disrupt the gradient structure used by attackers by training image classifiers with policy-gradient objectives and epsilon-greedy exploration. Through systematic experiments across CIFAR-10, CIFAR-100, and ImageNet-100 with multiple architectures, we find that RL-trained classifiers significantly disrupt gradient-based adversarial optimization. To explain this, we conduct a comprehensive mechanism analysis using loss landscape visualization, static and dynamic gradient indicators, and predictive entropy. Our analysis reveals that RL acts as an implicit regularizer, producing models with highly unstable gradient directions and smaller gradient magnitudes. This combination makes each PGD step both unreliable in direction and limited in magnitude, causing gradient-based attacks to fail within practical iteration budgets. We further show that combining RL with adversarial training (RL-adv) provides a dual-layer defense operating at two complementary levels: RL degrades gradient information available to attackers (gradient-level defense), while adversarial training strengthens decision boundaries (boundary-level defense). RL-adv achieves the highest robustness across all major attack types evaluated, including gradient-based (PGD, AutoAttack), transfer-based, and query-based attacks, outperforming SL-adv by a significant margin. These findings identify RL-induced gradient disruption as a complementary robustness mechanism and motivate future research on hybrid SL-RL training schedules that combine SL's efficiency with RL's gradient-regularization properties.

翻译：基于梯度的对抗攻击仍然是深度神经网络的主要威胁来源，因其利用梯度信息高效优化对抗扰动。为应对这一问题，我们探究强化学习训练能否通过采用策略梯度目标函数与ε-贪心探索机制训练图像分类器，从而破坏攻击者所依赖的梯度结构。通过在CIFAR-10、CIFAR-100和ImageNet-100数据集上采用多种架构进行系统性实验，我们发现经强化学习训练的分类器能够显著破坏基于梯度的对抗优化。为解释这一现象，我们运用损失景观可视化、静态与动态梯度指标以及预测熵开展了全面的机理分析。分析表明，强化学习充当了隐式正则化器，使模型产生高度不稳定的梯度方向和较小的梯度幅值。这种组合效应导致每次PGD迭代在方向上不可靠且幅值受限，致使基于梯度的攻击在实用迭代预算内失效。我们进一步证明，将强化学习与对抗训练相结合（RL-adv）可提供双层防御机制，在互补层面发挥作用：强化学习削弱攻击者可获取的梯度信息（梯度级防御），而对抗训练强化决策边界（边界级防御）。在评估的所有主要攻击类型中，包括基于梯度（PGD、AutoAttack）、基于迁移和基于查询的攻击，RL-adv均取得最高鲁棒性，显著优于SL-adv方法。这些发现将强化学习诱导的梯度破坏机制确立为一种互补性鲁棒性方案，并推动未来研究探索结合监督学习效率与强化学习梯度正则化特性的混合SL-RL训练模式。