Feature salience -- not task-informativeness -- drives machine learning model explanations

Explainable AI (XAI) promises to provide insight into machine learning models' decision processes, where one goal is to identify failures such as shortcut learning. This promise relies on the field's assumption that input features marked as important by an XAI must contain information about the target variable. However, it is unclear whether informativeness is indeed the main driver of importance attribution in practice, or if other data properties such as statistical suppression, novelty at test-time, or high feature salience substantially contribute. To clarify this, we trained deep learning models on three variants of a binary image classification task, in which translucent watermarks are either absent, act as class-dependent confounds, or represent class-independent noise. Results for five popular attribution methods show substantially elevated relative importance in watermarked areas (RIW) for all models regardless of the training setting ($R^2 \geq .45$). By contrast, whether the presence of watermarks is class-dependent or not only has a marginal effect on RIW ($R^2 \leq .03$), despite a clear impact impact on model performance and generalisation ability. XAI methods show similar behaviour to model-agnostic edge detection filters and attribute substantially less importance to watermarks when bright image intensities are encoded by smaller instead of larger feature values. These results indicate that importance attribution is most strongly driven by the salience of image structures at test time rather than statistical associations learned by machine learning models. Previous studies demonstrating successful XAI application should be reevaluated with respect to a possibly spurious concurrency of feature salience and informativeness, and workflows using feature attribution methods as building blocks should be scrutinised.

翻译：可解释人工智能（XAI）旨在揭示机器学习模型的决策过程，其目标之一是识别如捷径学习等故障。这一承诺建立在领域假设之上：XAI标记为重要的输入特征必然包含目标变量的信息。然而，实践中信息性是否确实是重要性归因的主要驱动因素尚不明确，其他数据属性（如统计抑制、测试时的新颖性或高特征显著性）是否产生实质性影响。为澄清此问题，我们在二元图像分类任务的三种变体上训练深度学习模型，其中半透明水印或不存在、或作为类别相关混淆因子、或代表类别无关噪声。五种主流归因方法的结果显示，所有模型在水印区域均呈现显著升高的相对重要性（RIW），且与训练设置无关（$R^2 \geq .45$）。相比之下，水印是否存在类别相关性仅对RIW产生边际影响（$R^2 \leq .03$），尽管其对模型性能与泛化能力具有明确影响。XAI方法表现出与模型无关边缘检测滤波器相似的行为，且当明亮图像强度由较小而非较大特征值编码时，对水印赋予的重要性显著降低。这些结果表明，重要性归因主要由测试时图像结构的显著性驱动，而非机器学习模型习得的统计关联。先前证明XAI成功应用的研究需针对特征显著性与信息性可能存在的虚假共现性进行重新评估，且将特征归因方法作为构建模块的工作流程需接受严格审查。