Adversarial examples pose a security risk as they can alter decisions of a machine learning classifier through slight input perturbations. Certified robustness has been proposed as a mitigation where given an input $x$, a classifier returns a prediction and a radius with a provable guarantee that any perturbation to $x$ within this radius (e.g., under the $L_2$ norm) will not alter the classifier's prediction. In this work, we show that these guarantees can be invalidated due to limitations of floating-point representation that cause rounding errors. We design a rounding search method that can efficiently exploit this vulnerability to find adversarial examples within the certified radius. We show that the attack can be carried out against several linear classifiers that have exact certifiable guarantees and against neural networks with ReLU activations that have conservative certifiable guarantees. Our experiments demonstrate attack success rates over 50% on random linear classifiers, up to 23.24% on the MNIST dataset for linear SVM, and up to 15.83% on the MNIST dataset for a neural network whose certified radius was given by a verifier based on mixed integer programming. Finally, as a mitigation, we advocate the use of rounded interval arithmetic to account for rounding errors.
翻译:对抗样本通过微小的输入扰动就能改变机器学习分类器的决策,构成安全风险。认证鲁棒性被提出作为缓解措施:给定输入$x$,分类器返回预测结果和半径,并提供可验证的保证,即该半径内对$x$的任何扰动(例如在$L_2$范数下)不会改变分类器的预测。本研究表明,由于浮点数表示的限制导致的舍入误差,这些保证可能失效。我们设计了一种舍入搜索方法,能够高效利用这一漏洞,在认证半径内找到对抗样本。实验证明,该攻击可成功实施于具有精确认证保证的线性分类器,以及具有保守认证保证的ReLU激活神经网络。在随机线性分类器上攻击成功率超过50%,在MNIST数据集上对线性SVM达到23.24%,对基于混合整数规划验证器给定认证半径的神经网络达到15.83%。最后,作为缓解措施,我们主张使用舍入区间算术来考虑舍入误差的影响。