Deep Neural Networks (DNNs) draw their power from the representations they learn. However, while being incredibly effective in learning complex abstractions, they are susceptible to learning malicious concepts, due to the spurious correlations inherent in the training data. So far, existing methods for uncovering such artifactual behavior in trained models focus on finding artifacts in the input data, which requires both availability of a data set and human supervision. In this paper, we introduce DORA (Data-agnOstic Representation Analysis): the first data-agnostic framework for the analysis of the representation space of DNNs. We propose a novel distance measure between representations that utilizes self-explaining capabilities within the network itself without access to any data and quantitatively validate its alignment with human-defined semantic distances. We further demonstrate that this metric could be utilized for the detection of anomalous representations, which may bear a risk of learning unintended spurious concepts deviating from the desired decision-making policy. Finally, we demonstrate the practical utility of DORA by analyzing and identifying artifactual representations in widely popular Computer Vision models.
翻译:深度神经网络(DNN)的力量源于其学习得到的表征。然而,尽管它们在学习复杂抽象概念方面极为有效,但由于训练数据中固有的虚假相关性,它们也容易习得恶意概念。目前,揭示已训练模型中此类伪像行为的现有方法主要聚焦于在输入数据中寻找伪像,这既需要数据集的可用性,也需要人工监督。本文提出了DORA(数据无关表征分析):首个用于分析DNN表征空间的数据无关框架。我们提出了一种新颖的表征间距离度量方法,该方法利用网络自身的自解释能力,无需访问任何数据,并定量验证了其与人类定义语义距离的一致性。进一步,我们证明该度量可用于检测异常表征,这些异常表征可能带来学习偏离预期决策策略的非预期虚假概念的风险。最后,通过分析并识别广泛流行的计算机视觉模型中的伪像表征,我们展示了DORA的实际应用价值。