Model attribution is a critical component of deep neural networks (DNNs) for its interpretability to complex models. Recent studies bring up attention to the security of attribution methods as they are vulnerable to attribution attacks that generate similar images with dramatically different attributions. Existing works have been investigating empirically improving the robustness of DNNs against those attacks; however, none of them explicitly quantifies the actual deviations of attributions. In this work, for the first time, a constrained optimization problem is formulated to derive an upper bound that measures the largest dissimilarity of attributions after the samples are perturbed by any noises within a certain region while the classification results remain the same. Based on the formulation, different practical approaches are introduced to bound the attributions above using Euclidean distance and cosine similarity under both $\ell_2$ and $\ell_\infty$-norm perturbations constraints. The bounds developed by our theoretical study are validated on various datasets and two different types of attacks (PGD attack and IFIA attribution attack). Over 10 million attacks in the experiments indicate that the proposed upper bounds effectively quantify the robustness of models based on the worst-case attribution dissimilarities.
翻译:模型归因是深度神经网络(DNNs)可解释性的关键组成部分,有助于理解复杂模型。近期研究关注到归因方法的安全性,因其易受归因攻击影响——攻击者可生成视觉相似但归因结果截然不同的图像。现有工作主要从实验层面探究如何提升DNNs对此类攻击的鲁棒性,但尚未有研究明确量化归因的实际偏差。本文首次提出一个约束优化问题,用于推导归因可达到的最大差异上界:在样本受特定噪声扰动且分类结果不变的情况下,该上界可度量归因的最大差异。基于该问题形式化,我们引入了不同实用方法,在ℓ₂范数和ℓ∞范数扰动约束下,分别利用欧氏距离和余弦相似度对上述归因进行约束。通过多个数据集及两种不同攻击类型(PGD攻击与IFIA归因攻击)的实验验证,本文理论推导的界限有效性得到证实。实验中超过一千万次攻击案例表明,所提出的上界能基于最坏情况归因差异有效量化模型的鲁棒性。