In explainable AI, surrogate models are commonly evaluated by their fidelity to a neural network's predictions. Fidelity, however, measures alignment to a learned model rather than alignment to the data-generating signal underlying the task. This work introduces the linearity score $λ(f)$, a diagnostic that quantifies the extent to which a regression network's input--output behavior is linearly decodable. $λ(f)$ is defined as an $R^2$ measure of surrogate fit to the network. Across synthetic and real-world regression datasets, we find that surrogates can achieve high fidelity to a neural network while failing to recover the predictive gains that distinguish the network from simpler models. In several cases, high-fidelity surrogates underperform even linear baselines trained directly on the data. These results demonstrate that explaining a model's behavior is not equivalent to explaining the task-relevant structure of the data, highlighting a limitation of fidelity-based explanations when used to reason about predictive performance.
翻译:在可解释人工智能领域,代理模型通常通过其与神经网络预测的保真度来评估。然而,保真度衡量的是与已学习模型的对齐程度,而非与任务背后数据生成信号的对齐程度。本文引入了线性度评分 $λ(f)$,该诊断指标量化了回归网络输入-输出行为可线性解码的程度。$λ(f)$ 被定义为代理模型对网络拟合的 $R^2$ 度量。在合成和真实世界的回归数据集上的实验表明,代理模型可以实现对神经网络的高保真度,却未能恢复将神经网络与更简单模型区分开来的预测增益。在多个案例中,高保真度代理模型的表现甚至不及直接在数据上训练的线性基线模型。这些结果表明,解释模型的行为并不等同于解释数据中与任务相关的结构,这凸显了当基于保真度的解释被用于推理预测性能时存在的局限性。