Adversarial examples have been shown to cause neural networks to fail on a wide range of vision and language tasks, but recent work has claimed that Bayesian neural networks (BNNs) are inherently robust to adversarial perturbations. In this work, we examine this claim. To study the adversarial robustness of BNNs, we investigate whether it is possible to successfully break state-of-the-art BNN inference methods and prediction pipelines using even relatively unsophisticated attacks for three tasks: (1) label prediction under the posterior predictive mean, (2) adversarial example detection with Bayesian predictive uncertainty, and (3) semantic shift detection. We find that BNNs trained with state-of-the-art approximate inference methods, and even BNNs trained with Hamiltonian Monte Carlo, are highly susceptible to adversarial attacks. We also identify various conceptual and experimental errors in previous works that claimed inherent adversarial robustness of BNNs and conclusively demonstrate that BNNs and uncertainty-aware Bayesian prediction pipelines are not inherently robust against adversarial attacks.
翻译:对抗样本已被证明会导致神经网络在广泛的视觉和语言任务中失效,但近期研究声称贝叶斯神经网络(BNNs)对对抗扰动具有内在鲁棒性。本文对此论断展开检验。为研究BNNs的对抗鲁棒性,我们探究了即使使用相对简略的攻击方法,能否成功破解当前最先进的BNN推理方法与预测流程,涉及三个任务:(1)基于后验预测均值的标签预测,(2)基于贝叶斯预测不确定性的对抗样本检测,以及(3)语义偏移检测。我们发现,采用最先进近似推理方法训练的BNNs,甚至基于哈密顿蒙特卡洛方法训练的BNNs,均极易受到对抗攻击。我们还识别了先前声称BNNs具有内在对抗鲁棒性的研究中的概念性与实验性错误,并明确证明:BNNs以及不确定性感知的贝叶斯预测流程并不具备对抗攻击的内在鲁棒性。