Evaluating explanations of image classifiers regarding ground truth, e.g. segmentation masks defined by human perception, primarily evaluates the quality of the models under consideration rather than the explanation methods themselves. Driven by this observation, we propose a framework for $\textit{jointly}$ evaluating the robustness of safety-critical systems that $\textit{combine}$ a deep neural network with an explanation method. These are increasingly used in real-world applications like medical image analysis or robotics. We introduce a fine-tuning procedure to (mis)align model$\unicode{x2013}$explanation pipelines with ground truth and use it to quantify the potential discrepancy between worst and best-case scenarios of human alignment. Experiments across various model architectures and post-hoc local interpretation methods provide insights into the robustness of vision transformers and the overall vulnerability of such AI systems to potential adversarial attacks.
翻译:针对图像分类器关于真实标注(例如人类感知定义的分割掩码)的解释进行评估,主要衡量的是所考虑模型的质量,而非解释方法本身。基于这一观察,我们提出一个框架,用于$\textit{联合}$评估深度神经网络与解释方法相结合的$\textit{安全关键系统}$的鲁棒性。这类系统在医学图像分析或机器人等实际应用中日益普及。我们引入一种微调流程,使模型-解释流水线与真实标注(不)对齐,并利用该流程量化人类对齐在最优与最差场景之间的潜在差异。跨多种模型架构和事后局部解释方法的实验,揭示了视觉Transformer的鲁棒性以及此类人工智能系统对潜在对抗攻击的整体脆弱性。