Feature Visualization (FV) is a widely used technique for interpreting concepts learned by Deep Neural Networks (DNNs), which synthesizes input patterns that maximally activate a given feature. Despite its popularity, the trustworthiness of FV explanations has received limited attention. We introduce Gradient Slingshots, a novel method that enables FV manipulation without modifying model architecture or significantly degrading performance. By shaping new trajectories in off-distribution regions of a feature's activation landscape, we coerce the optimization process to converge to a predefined visualization. We evaluate our approach on several DNN architectures, demonstrating its ability to replace faithful FVs with arbitrary targets. These results expose a critical vulnerability: auditors relying solely on FV may accept entirely fabricated explanations. To mitigate this risk, we propose a straightforward defense and quantitatively demonstrate its effectiveness.
翻译:特征可视化(FV)是一种广泛用于解释深度神经网络(DNN)所学概念的流行技术,它通过合成能够最大化激活给定特征的输入模式来实现。尽管该方法应用广泛,但其解释的可信度却鲜少受到关注。本文提出了一种名为梯度弹弓的新方法,该方法能够在无需修改模型架构或显著降低模型性能的前提下,实现对特征可视化的操控。通过在特征激活空间的离群分布区域塑造新的优化轨迹,我们能够迫使优化过程收敛到预定义的可视化结果。我们在多种DNN架构上评估了该方法,证明了其能够将原本可信的特征可视化结果替换为任意目标。这些结果揭示了一个关键漏洞:仅依赖特征可视化进行审计的评估者可能会接受完全虚构的解释。为缓解此风险,我们提出了一种简易的防御方法,并通过定量实验验证了其有效性。