Deep Neural Networks (DNNs) are capable of learning complex and versatile representations, however, the semantic nature of the learned concepts remains unknown. A common method used to explain the concepts learned by DNNs is Feature Visualization (FV), which generates a synthetic input signal that maximally activates a particular neuron in the network. In this paper, we investigate the vulnerability of this approach to adversarial model manipulations and introduce a novel method for manipulating FV without significantly impacting the model's decision-making process. The key distinction of our proposed approach is that it does not alter the model architecture. We evaluate the effectiveness of our method on several neural network models and demonstrate its capabilities to hide the functionality of arbitrarily chosen neurons by masking the original explanations of neurons with chosen target explanations during model auditing.
翻译:深度神经网络(DNN)能够学习复杂且多功能的表征,然而,所学概念的语义本质仍然未知。特征可视化(FV)是一种用于解释DNN所学概念的常用方法,它通过生成能最大化激活网络中特定神经元的合成输入信号来实现。本文研究了该方法在面对对抗性模型操纵时的脆弱性,并引入了一种在不显著影响模型决策过程的前提下操纵FV的新方法。我们提出的方法的关键区别在于它不改变模型架构。我们在多个神经网络模型上评估了该方法的有效性,并证明了其在模型审计期间,能够通过用选定的目标解释掩盖神经元的原始解释,从而隐藏任意选定神经元的功能。