In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural networks to unprecedented levels in dataset and model size. We here ask whether this extraordinary increase in scale also positively impacts the field of mechanistic interpretability. In other words, has our understanding of the inner workings of scaled neural networks improved as well? We here use a psychophysical paradigm to quantify mechanistic interpretability for a diverse suite of models and find no scaling effect for interpretability - neither for model nor dataset size. Specifically, none of the nine investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago. Latest-generation vision models appear even less interpretable than older architectures, hinting at a regression rather than improvement, with modern models sacrificing interpretability for accuracy. These results highlight the need for models explicitly designed to be mechanistically interpretable and the need for more helpful interpretability methods to increase our understanding of networks at an atomic level. We release a dataset containing more than 120'000 human responses from our psychophysical evaluation of 767 units across nine models. This dataset is meant to facilitate research on automated instead of human-based interpretability evaluations that can ultimately be leveraged to directly optimize the mechanistic interpretability of models.
翻译:鉴于最近人工智能系统的广泛采用,理解神经网络的内部信息处理已变得愈发关键。近期,通过将神经网络在数据集和模型规模上扩展到前所未有的水平,机器视觉取得了显著进展。本文探究这种规模的异常增长是否也对机制可解释性领域产生积极影响。换言之,我们对扩展后的神经网络内部工作机制的理解是否也同步提升?本文采用心理物理学范式对一系列多样化模型进行量化机制可解释性评估,发现可解释性不存在规模效应——无论是模型规模还是数据集规模。具体而言,九个前沿模型中没有一个比近十年前的GoogLeNet模型更容易解释。最新一代视觉模型的可解释性甚至低于旧架构,暗示着退化而非改进,现代模型以牺牲可解释性为代价换取准确性。这些结果凸显了设计明确具有机制可解释性的模型的必要性,以及开发更有用的可解释性方法以在原子级别提升对网络理解的迫切需求。我们发布了一个包含来自九模型767个单元的心理物理学评估中超过12万份人类响应的数据集。该数据集旨在促进基于自动化而非人工的可解释性评估研究,这些评估最终可用于直接优化模型的机制可解释性。