How explainable are adversarially-robust CNNs?

Three important criteria of existing convolutional neural networks (CNNs) are (1) test-set accuracy; (2) out-of-distribution accuracy; and (3) explainability. While these criteria have been studied independently, their relationship is unknown. For example, do CNNs that have a stronger out-of-distribution performance have also stronger explainability? Furthermore, most prior feature-importance studies only evaluate methods on 2-3 common vanilla ImageNet-trained CNNs, leaving it unknown how these methods generalize to CNNs of other architectures and training algorithms. Here, we perform the first, large-scale evaluation of the relations of the three criteria using 9 feature-importance methods and 12 ImageNet-trained CNNs that are of 3 training algorithms and 5 CNN architectures. We find several important insights and recommendations for ML practitioners. First, adversarially robust CNNs have a higher explainability score on gradient-based attribution methods (but not CAM-based or perturbation-based methods). Second, AdvProp models, despite being highly accurate more than both vanilla and robust models alone, are not superior in explainability. Third, among 9 feature attribution methods tested, GradCAM and RISE are consistently the best methods. Fourth, Insertion and Deletion are biased towards vanilla and robust models respectively, due to their strong correlation with the confidence score distributions of a CNN. Fifth, we did not find a single CNN to be the best in all three criteria, which interestingly suggests that CNNs are harder to interpret as they become more accurate.

翻译：现有卷积神经网络（CNN）的三个重要准则是：(1) 测试集准确率；(2) 分布外准确率；(3) 可解释性。尽管这些准则已被独立研究，但它们之间的关系尚不明确。例如，分布外性能更强的CNN是否也具有更强的可解释性？此外，大多数先前的特征重要性研究仅对2-3种常见的普通ImageNet训练CNN进行评估，未能揭示这些方法如何推广至其他架构和训练算法的CNN。在此，我们首次使用9种特征重要性方法和12个ImageNet训练CNN（涵盖3种训练算法和5种CNN架构）对这三个准则的关系进行大规模评估。我们为机器学习从业者提供了若干重要见解和建议。第一，对抗性鲁棒CNN在基于梯度的归因方法（而非基于CAM或基于扰动的方法）上具有更高的可解释性评分。第二，AdvProp模型尽管在准确率上显著优于普通模型和鲁棒模型，但其可解释性并未表现出优势。第三，在测试的9种特征归因方法中，GradCAM和RISE始终是最佳方法。第四，Insertion和Deletion分别偏向普通模型和鲁棒模型，这是因为它们与CNN的置信度分数分布存在强相关性。第五，我们未发现任何CNN在所有三个准则上均表现最佳，这有趣地表明：随着CNN准确率的提升，其可解释性反而变得更困难。