We present a new algorithm to learn a deep neural network model robust against adversarial attacks. Previous algorithms demonstrate an adversarially trained Bayesian Neural Network (BNN) provides improved robustness. We recognize the adversarial learning approach for approximating the multi-modal posterior distribution of a Bayesian model can lead to mode collapse; consequently, the model's achievements in robustness and performance are sub-optimal. Instead, we first propose preventing mode collapse to better approximate the multi-modal posterior distribution. Second, based on the intuition that a robust model should ignore perturbations and only consider the informative content of the input, we conceptualize and formulate an information gain objective to measure and force the information learned from both benign and adversarial training instances to be similar. Importantly. we prove and demonstrate that minimizing the information gain objective allows the adversarial risk to approach the conventional empirical risk. We believe our efforts provide a step toward a basis for a principled method of adversarially training BNNs. Our model demonstrate significantly improved robustness--up to 20%--compared with adversarial training and Adv-BNN under PGD attacks with 0.035 distortion on both CIFAR-10 and STL-10 datasets.
翻译:我们提出一种新算法,用于学习对对抗攻击具有鲁棒性的深度神经网络模型。已有算法表明,对抗训练下的贝叶斯神经网络能够提升鲁棒性。我们认识到,用于近似贝叶斯模型多模态后验分布的对抗学习方法可能导致模式坍缩,从而使得模型在鲁棒性与性能方面的表现次优。为此,我们首先提出防止模式坍缩以更好地近似多模态后验分布。其次,基于鲁棒模型应忽略扰动、仅关注输入信息性内容的直觉,我们构建并形式化一个信息增益目标,用于衡量并强制来自良性与对抗训练实例学习到的信息具有相似性。重要的是,我们证明并展示了最小化信息增益目标能使对抗风险趋近于传统经验风险。我们相信,这项工作为对抗训练贝叶斯神经网络的原理性方法奠定了基础。在CIFAR-10与STL-10数据集上,针对0.035失真度的PGD攻击,我们的模型相比对抗训练与Adv-BNN展现了显著提升的鲁棒性——最高提升达20%。