Class Activation Mapping (CAM) and its gradient-based variants (e.g., GradCAM) have become standard tools for explaining Convolutional Neural Network (CNN) predictions. However, these approaches typically focus on individual logits, while for neural networks using softmax, the class membership probability estimates depend only on the differences between logits, not on their absolute values. This disconnect leaves standard CAMs vulnerable to adversarial manipulation, such as passive fooling, where a model is trained to produce misleading CAMs without affecting decision performance. To address this vulnerability, we propose DiffGradCAM and its higher-order derivative version DiffGradCAM++, as novel, lightweight, contrastive approaches to class activation mapping that are not susceptible to passive fooling and match the output of standard methods such as GradCAM and GradCAM++ in the non-adversarial case. To test our claims, we introduce Salience-Hoax Activation Maps (SHAMs), a more advanced, entropy-aware form of passive fooling that serves as a benchmark for CAM robustness under adversarial conditions. Together, SHAM and DiffGradCAM establish a new framework for probing and improving the robustness of saliency-based explanations. We validate both contributions across multi-class tasks with few and many classes.
翻译:类激活映射及其基于梯度的变体(如GradCAM)已成为解释卷积神经网络预测结果的标准工具。然而,这些方法通常聚焦于单个逻辑值,而使用softmax的神经网络中,类别隶属概率估计仅取决于逻辑值间的差值,而非其绝对值。这种脱节使得标准CAM易受对抗性操纵,例如被动愚弄——即在不影响决策性能的情况下训练模型生成误导性CAM。为解决该漏洞,我们提出DiffGradCAM及其高阶导数版本DiffGradCAM++,作为新颖、轻量级的对比式类激活映射方法。该方法不易受被动愚弄影响,且在非对抗场景下与GradCAM、GradCAM++等标准方法输出匹配。为验证上述论断,我们引入显著性欺骗激活图——一种更先进的、基于熵感知的被动愚弄形式,可作为对抗条件下CAM鲁棒性的基准。SHAM与DiffGradCAM共同构建了探测并提升显著性解释鲁棒性的新框架。我们在少类别与多类别任务中验证了这两项贡献的有效性。