Saliency methods provide post-hoc model interpretation by attributing input features to the model outputs. Current methods mainly achieve this using a single input sample, thereby failing to answer input-independent inquiries about the model. We also show that input-specific saliency mapping is intrinsically susceptible to misleading feature attribution. Current attempts to use 'general' input features for model interpretation assume access to a dataset containing those features, which biases the interpretation. Addressing the gap, we introduce a new perspective of input-agnostic saliency mapping that computationally estimates the high-level features attributed by the model to its outputs. These features are geometrically correlated, and are computed by accumulating model's gradient information with respect to an unrestricted data distribution. To compute these features, we nudge independent data points over the model loss surface towards the local minima associated by a human-understandable concept, e.g., class label for classifiers. With a systematic projection, scaling and refinement process, this information is transformed into an interpretable visualization without compromising its model-fidelity. The visualization serves as a stand-alone qualitative interpretation. With an extensive evaluation, we not only demonstrate successful visualizations for a variety of concepts for large-scale models, but also showcase an interesting utility of this new form of saliency mapping by identifying backdoor signatures in compromised classifiers.
翻译:显著性方法通过将输入特征归因于模型输出来提供事后模型解释。现有方法主要依赖单个输入样本实现这一目标,因此无法回答与输入无关的模型相关问题。我们同时证明,输入特定的显著性映射本质上容易产生误导性特征归因。当前尝试使用"通用"输入特征进行模型解释的方法,需要假设能获取包含这些特征的数据集,这会导致解释存在偏差。针对这一不足,我们提出输入无关显著性映射的新视角,通过计算方式估计模型归因于其输出的高层级特征。这些特征具有几何相关性,通过累积模型相对于无限制数据分布的梯度信息进行计算。为了计算这些特征,我们将独立数据点沿模型损失曲面推向由人类可理解概念(如分类器的类别标签)定义的局部最小值。通过系统的投影、缩放和精炼流程,这些信息被转化为可解释的可视化结果,同时不损害其模型保真度。该可视化可作为独立的定性解释工具。通过广泛评估,我们不仅成功展示了大规模模型针对多种概念的可视化结果,还通过识别受损分类器中的后门签名,展示了这种新型显著性映射的有趣应用价值。