The development of explainable artificial intelligence (xAI) methods for scene classification problems has attracted great attention in remote sensing (RS). Most xAI methods and the related evaluation metrics in RS are initially developed for natural images considered in computer vision (CV), and their direct usage in RS may not be suitable. To address this issue, in this paper, we investigate the effectiveness of explanation methods and metrics in the context of RS image scene classification. In detail, we methodologically and experimentally analyze ten explanation metrics spanning five categories (faithfulness, robustness, localization, complexity, randomization), applied to five established feature attribution methods (Occlusion, LIME, GradCAM, LRP, and DeepLIFT) across three RS datasets. Our methodological analysis identifies key limitations in both explanation methods and metrics. The performance of perturbation-based methods, such as Occlusion and LIME, heavily depends on perturbation baselines and spatial characteristics of RS scenes. Gradient-based approaches like GradCAM struggle when multiple labels are present in the same image, while some relevance propagation methods (LRP) can distribute relevance disproportionately relative to the spatial extent of classes. Analogously, we find limitations in evaluation metrics. Faithfulness metrics share the same problems as perturbation-based methods. Localization metrics and complexity metrics are unreliable for classes with a large spatial extent. In contrast, robustness metrics and randomization metrics consistently exhibit greater stability. Our experimental results support these methodological findings. Based on our analysis, we provide guidelines for selecting explanation methods, metrics, and hyperparameters in the context of RS image scene classification.
翻译:遥感影像场景分类任务中可解释人工智能方法的发展已引起广泛关注。当前遥感领域使用的大多数xAI方法及相关评价指标最初是为计算机视觉中的自然图像设计的,直接应用于遥感影像可能并不适用。针对这一问题,本文系统研究了遥感影像场景分类背景下解释方法与评价指标的有效性。具体而言,我们从方法论和实验两个维度,对涵盖五大类别(忠实性、鲁棒性、定位性、复杂度、随机化)的十种解释指标进行了分析,这些指标应用于五种经典特征归因方法(Occlusion、LIME、GradCAM、LRP和DeepLIFT),并在三个遥感数据集上进行了验证。方法论分析揭示了解释方法与评价指标存在的主要局限:基于扰动的方法(如Occlusion和LIME)的性能高度依赖于扰动基线与遥感场景的空间特征;基于梯度的方法(如GradCAM)在图像中存在多类别标签时表现不佳;部分相关性传播方法(如LRP)可能产生与类别空间分布不成比例的相关性分配。相应地,评价指标也存在局限性:忠实性指标与基于扰动的方法存在相同缺陷;定位性指标与复杂度指标对空间分布广泛的类别可靠性不足;相比之下,鲁棒性指标与随机化指标表现出更稳定的特性。实验结果支持了上述方法论发现。基于分析结论,本文为遥感影像场景分类任务中解释方法选择、评价指标确定及超参数设置提供了实践指导。