Scale Alone Does not Improve Mechanistic Interpretability in Vision Models

In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural networks to unprecedented levels in dataset and model size. We here ask whether this extraordinary increase in scale also positively impacts the field of mechanistic interpretability. In other words, has our understanding of the inner workings of scaled neural networks improved as well? We here use a psychophysical paradigm to quantify mechanistic interpretability for a diverse suite of models and find no scaling effect for interpretability - neither for model nor dataset size. Specifically, none of the nine investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago. Latest-generation vision models appear even less interpretable than older architectures, hinting at a regression rather than improvement, with modern models sacrificing interpretability for accuracy. These results highlight the need for models explicitly designed to be mechanistically interpretable and the need for more helpful interpretability methods to increase our understanding of networks at an atomic level. We release a dataset containing more than 120'000 human responses from our psychophysical evaluation of 767 units across nine models. This dataset is meant to facilitate research on automated instead of human-based interpretability evaluations that can ultimately be leveraged to directly optimize the mechanistic interpretability of models.

翻译：鉴于最近人工智能系统的广泛采用，理解神经网络的内部信息处理已变得愈发关键。近期，通过将神经网络在数据集和模型规模上扩展到前所未有的水平，机器视觉取得了显著进展。本文探究这种规模的异常增长是否也对机制可解释性领域产生积极影响。换言之，我们对扩展后的神经网络内部工作机制的理解是否也同步提升？本文采用心理物理学范式对一系列多样化模型进行量化机制可解释性评估，发现可解释性不存在规模效应——无论是模型规模还是数据集规模。具体而言，九个前沿模型中没有一个比近十年前的GoogLeNet模型更容易解释。最新一代视觉模型的可解释性甚至低于旧架构，暗示着退化而非改进，现代模型以牺牲可解释性为代价换取准确性。这些结果凸显了设计明确具有机制可解释性的模型的必要性，以及开发更有用的可解释性方法以在原子级别提升对网络理解的迫切需求。我们发布了一个包含来自九模型767个单元的心理物理学评估中超过12万份人类响应的数据集。该数据集旨在促进基于自动化而非人工的可解释性评估研究，这些评估最终可用于直接优化模型的机制可解释性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日