Although Deep Neural Networks (DNNs) are incredibly effective in learning complex abstractions, they are susceptible to unintentionally learning spurious artifacts from the training data. To ensure model transparency, it is crucial to examine the relationships between learned representations, as unintended concepts often manifest themselves to be anomalous to the desired task. In this work, we introduce DORA (Data-agnOstic Representation Analysis): the first data-agnostic framework for the analysis of the representation space of DNNs. Our framework employs the proposed Extreme-Activation (EA) distance measure between representations that utilizes self-explaining capabilities within the network without accessing any data. We quantitatively validate the metric's correctness and alignment with human-defined semantic distances. The coherence between the EA distance and human judgment enables us to identify representations whose underlying concepts would be considered unnatural by humans by identifying outliers in functional distance. Finally, we demonstrate the practical usefulness of DORA by analyzing and identifying artifact representations in popular Computer Vision models.
翻译:尽管深度神经网络(DNNs)在学习复杂抽象特征方面表现出极高的有效性,但它们容易在无意识中学习到训练数据中的虚假伪影。为确保模型透明性,必须检验学习表示之间的关系,因为非预期概念往往表现为与目标任务存在异常偏差。本文提出DORA(数据无关表示分析)框架——首个无需依赖数据的DNN表示空间分析框架。该框架采用所提出的极端激活(EA)距离度量来衡量表示间的相似性,该方法利用网络内部的自解释能力且无需访问任何数据。我们通过定量实验验证了该度量的正确性及其与人类定义语义距离的一致性。EA距离与人类判断之间的协同性使我们能够通过识别功能距离中的异常值,定位人类认为其底层概念不自然的表示。最后,我们通过分析并识别主流计算机视觉模型中的伪影表示,展示了DORA的实际应用价值。