Deep neural networks are susceptible to adversarial attacks due to the accumulation of perturbations in the feature level, and numerous works have boosted model robustness by deactivating the non-robust feature activations that cause model mispredictions. However, we claim that these malicious activations still contain discriminative cues and that with recalibration, they can capture additional useful information for correct model predictions. To this end, we propose a novel, easy-to-plugin approach named Feature Separation and Recalibration (FSR) that recalibrates the malicious, non-robust activations for more robust feature maps through Separation and Recalibration. The Separation part disentangles the input feature map into the robust feature with activations that help the model make correct predictions and the non-robust feature with activations that are responsible for model mispredictions upon adversarial attack. The Recalibration part then adjusts the non-robust activations to restore the potentially useful cues for model predictions. Extensive experiments verify the superiority of FSR compared to traditional deactivation techniques and demonstrate that it improves the robustness of existing adversarial training methods by up to 8.57% with small computational overhead. Codes are available at https://github.com/wkim97/FSR.
翻译:深度神经网络在特征层面上因扰动累积而易受对抗攻击,已有大量工作通过抑制导致模型误判的非鲁棒特征激活来提升模型鲁棒性。然而,我们认为这些恶意激活仍包含判别性线索,通过再校准(recalibration)可捕获额外的有用信息以辅助模型正确预测。为此,我们提出一种新颖且易于集成的"特征分离与再校准"(Feature Separation and Recalibration, FSR)方法,通过分离与再校准两个步骤,将恶意、非鲁棒激活重新校准为更具鲁棒性的特征图。分离部分将输入特征图解耦为鲁棒特征(激活值有助于模型正确预测)与非鲁棒特征(激活值在对抗攻击下导致模型误判)。再校准部分则调整非鲁棒激活,恢复其中对模型预测有潜在价值的线索。大量实验验证了FSR相较于传统去激活技术的优越性,并表明其在极小计算开销下可将现有对抗训练方法的鲁棒性最高提升8.57%。代码开源于https://github.com/wkim97/FSR。