Much of the research on the interpretability of deep neural networks has focused on studying the visual features that maximally activate individual neurons. However, recent work has cast doubts on the usefulness of such local representations for understanding the behavior of deep neural networks because individual neurons tend to respond to multiple unrelated visual patterns, a phenomenon referred to as "superposition". A promising alternative to disentangle these complex patterns is learning sparsely distributed vector representations from entire network layers, as the resulting basis vectors seemingly encode single identifiable visual patterns consistently. Thus, one would expect the resulting code to align better with human perceivable visual patterns, but supporting evidence remains, at best, anecdotal. To fill this gap, we conducted three large-scale psychophysics experiments collected from a pool of 560 participants. Our findings provide (i) strong evidence that features obtained from sparse distributed representations are easier to interpret by human observers and (ii) that this effect is more pronounced in the deepest layers of a neural network. Complementary analyses also reveal that (iii) features derived from sparse distributed representations contribute more to the model's decision. Overall, our results highlight that distributed representations constitute a superior basis for interpretability, underscoring a need for the field to move beyond the interpretation of local neural codes in favor of sparsely distributed ones.
翻译:深度神经网络可解释性研究大多聚焦于探究能最大程度激活单个神经元的视觉特征。然而,近期研究对此类局部表示在理解深度神经网络行为方面的有效性提出了质疑,因为单个神经元往往会对多个不相关的视觉模式产生响应,这一现象被称为"叠加"。解耦这些复杂模式的一种可行替代方案是从整个网络层学习稀疏分布式向量表示,因为由此得到的基向量似乎能一致地编码单一可识别的视觉模式。因此,人们预期生成的编码能更好地与人类可感知的视觉模式对齐,但支持性证据目前仍停留在轶事层面。为填补这一空白,我们开展了三项大规模心理物理学实验,共收集了560名参与者的数据。我们的研究结果提供了以下证据:(i) 来自稀疏分布式表示的特征更易于人类观察者解释;(ii) 该效应在神经网络的最深层更为显著。补充分析还表明:(iii) 源自稀疏分布式表示的特征对模型决策的贡献更大。总体而言,我们的研究结果强调分布式表示构成了更优的可解释性基础,这预示着该领域需要超越对局部神经编码的解释,转而采用稀疏分布式表示。