Perturbation robustness evaluates the vulnerabilities of models, arising from a variety of perturbations, such as data corruptions and adversarial attacks. Understanding the mechanisms of perturbation robustness is critical for global interpretability. We present a model-agnostic, global mechanistic interpretability method to interpret the perturbation robustness of image models. This research is motivated by two key aspects. First, previous global interpretability works, in tandem with robustness benchmarks, e.g. mean corruption error (mCE), are not designed to directly interpret the mechanisms of perturbation robustness within image models. Second, we notice that the spectral signal-to-noise ratios (SNR) of perturbed natural images exponentially decay over the frequency. This power-law-like decay implies that: Low-frequency signals are generally more robust than high-frequency signals -- yet high classification accuracy can not be achieved by low-frequency signals alone. By applying Shapley value theory, our method axiomatically quantifies the predictive powers of robust features and non-robust features within an information theory framework. Our method, dubbed as \textbf{I-ASIDE} (\textbf{I}mage \textbf{A}xiomatic \textbf{S}pectral \textbf{I}mportance \textbf{D}ecomposition \textbf{E}xplanation), provides a unique insight into model robustness mechanisms. We conduct extensive experiments over a variety of vision models pre-trained on ImageNet to show that \textbf{I-ASIDE} can not only \textbf{measure} the perturbation robustness but also \textbf{provide interpretations} of its mechanisms.
翻译:扰动鲁棒性评估模型在多种扰动(如数据损坏和对抗攻击)下的脆弱性。理解扰动鲁棒性的机制对于全局可解释性至关重要。本文提出一种与模型无关的全局机制可解释性方法,用于解释图像模型的扰动鲁棒性。本研究的动机源于两个关键方面。首先,以往的全局可解释性研究,结合鲁棒性基准(如平均损坏误差 mCE),并非旨在直接解释图像模型内部的扰动鲁棒性机制。其次,我们观察到受扰自然图像的谱信噪比(SNR)随频率呈指数衰减。这种类幂律衰减意味着:低频信号通常比高频信号更具鲁棒性——然而仅凭低频信号无法实现高分类精度。通过应用沙普利值理论,我们的方法在信息论框架内公理化地量化了鲁棒特征与非鲁棒特征的预测能力。我们的方法命名为 **I-ASIDE**(**I**mage **A**xiomatic **S**pectral **I**mportance **D**ecomposition **E**xplanation),为模型鲁棒性机制提供了独特的洞见。我们在多种基于 ImageNet 预训练的视觉模型上进行了大量实验,结果表明 **I-ASIDE** 不仅能够**度量**扰动鲁棒性,还能**对其机制提供解释**。