Recent efforts to understand intermediate representations in deep neural networks have commonly attempted to label individual neurons and combinations of neurons that make up linear directions in the latent space by examining extremal neuron activations and the highest direction projections. In this paper, we show that this approach, although yielding a good approximation for many purposes, fails to capture valuable information about the behaviour of a representation. Neural network activations are generally dense, and so a more complex, but realistic scenario is that linear directions encode information at various levels of stimulation. We hypothesise that non-extremal level activations contain complex information worth investigating, such as statistical associations, and thus may be used to locate confounding human interpretable concepts. We explore the value of studying a range of neuron activations by taking the case of mid-level output neuron activations and demonstrate on a synthetic dataset how they can inform us about aspects of representations in the penultimate layer not evident through analysing maximal activations alone. We use our findings to develop a method to curate data from mid-range logit samples for retraining to mitigate spurious correlations, or confounding concepts in the penultimate layer, on real benchmark datasets. The success of our method exemplifies the utility of inspecting non-maximal activations to extract complex relationships learned by models.
翻译:近期理解深度神经网络中间表征的研究通常通过检查极端神经元激活和最高方向投影,来标记构成潜在空间中线性方向的单个神经元及神经元组合。本文表明,尽管这种方法能为多种目的提供良好近似,却未能捕捉关于表征行为的宝贵信息。神经网络激活通常是密集的,因此更复杂但更现实的场景是:线性方向在不同刺激水平上编码信息。我们假设非极端水平的激活包含值得研究的复杂信息(如统计关联),因而可用于定位混淆性的人类可解释概念。我们通过研究中间层输出神经元激活的案例,探索考察神经元激活范围的价值,并在合成数据集上证明:仅分析最大激活时未能显现的末前层表征特性,如何能通过中等激活水平得以揭示。基于这些发现,我们开发了一种从中等范围逻辑值样本中筛选数据的方法,用于在真实基准数据集上通过重训练来缓解末前层中的虚假关联或混淆概念。本方法的成功例证了检查非最大激活对于提取模型习得的复杂关系的实用价值。