A Certified Radius-Guided Attack Framework to Image Segmentation Models

Image segmentation is an important problem in many safety-critical applications. Recent studies show that modern image segmentation models are vulnerable to adversarial perturbations, while existing attack methods mainly follow the idea of attacking image classification models. We argue that image segmentation and classification have inherent differences, and design an attack framework specially for image segmentation models. Our attack framework is inspired by certified radius, which was originally used by defenders to defend against adversarial perturbations to classification models. We are the first, from the attacker perspective, to leverage the properties of certified radius and propose a certified radius guided attack framework against image segmentation models. Specifically, we first adapt randomized smoothing, the state-of-the-art certification method for classification models, to derive the pixel's certified radius. We then focus more on disrupting pixels with relatively smaller certified radii and design a pixel-wise certified radius guided loss, when plugged into any existing white-box attack, yields our certified radius-guided white-box attack. Next, we propose the first black-box attack to image segmentation models via bandit. We design a novel gradient estimator, based on bandit feedback, which is query-efficient and provably unbiased and stable. We use this gradient estimator to design a projected bandit gradient descent (PBGD) attack, as well as a certified radius-guided PBGD (CR-PBGD) attack. We prove our PBGD and CR-PBGD attacks can achieve asymptotically optimal attack performance with an optimal rate. We evaluate our certified-radius guided white-box and black-box attacks on multiple modern image segmentation models and datasets. Our results validate the effectiveness of our certified radius-guided attack framework.

翻译：图像分割是众多安全关键应用中的重要问题。近期研究表明，现代图像分割模型易受对抗性扰动影响，而现有攻击方法主要沿袭图像分类模型攻击思路。我们认为图像分割与分类存在本质差异，为此专门设计了针对图像分割模型的攻击框架。该框架受认证半径启发，该指标最初被防御方用于防御针对分类模型的对抗性扰动。我们首次从攻击者视角利用认证半径特性，提出针对图像分割模型的认证半径引导攻击框架。具体而言，首先借鉴分类模型中最先进的认证方法——随机平滑，推导像素级认证半径；继而重点破坏认证半径较小的像素，设计像素级认证半径引导损失函数，将其嵌入任意白盒攻击中即可构建认证半径引导白盒攻击。随后，我们提出首个基于强盗算法的图像分割模型黑盒攻击方法。通过设计基于强盗反馈的新型梯度估计器，该估计器兼具查询高效性、无偏性与稳定性。利用该梯度估计器，我们进一步设计了投影强盗梯度下降（PBGD）攻击及其变体——认证半径引导投影强盗梯度下降（CR-PBGD）攻击。理论证明PBGD与CR-PBGD攻击能以最优速率实现渐近最优攻击性能。我们在多个现代图像分割模型与数据集上评估了所提出的认证半径引导白盒与黑盒攻击，实验结果验证了该攻击框架的有效性。