Don't trust your eyes: on the (un)reliability of feature visualizations

How do neural networks extract patterns from pixels? Feature visualizations attempt to answer this important question by visualizing highly activating patterns through optimization. Today, visualization methods form the foundation of our knowledge about the internal workings of neural networks, as a type of mechanistic interpretability. Here we ask: How reliable are feature visualizations? We start our investigation by developing network circuits that trick feature visualizations into showing arbitrary patterns that are completely disconnected from normal network behavior on natural input. We then provide evidence for a similar phenomenon occurring in standard, unmanipulated networks: feature visualizations are processed very differently from standard input, casting doubt on their ability to "explain" how neural networks process natural images. This can be used as a sanity check for feature visualizations. We underpin our empirical findings by theory proving that the set of functions that can be reliably understood by feature visualization is extremely small and does not include general black-box neural networks. Therefore, a promising way forward could be the development of networks that enforce certain structures in order to ensure more reliable feature visualizations.

翻译：神经网络如何从像素中提取模式？特征可视化试图通过优化来可视化高度激活的模式，以解答这一重要问题。如今，作为机械可解释性的一种类型，可视化方法构成了我们关于神经网络内部运作知识的基础。在此，我们提出疑问：特征可视化究竟有多可靠？我们的研究始于开发能够欺骗特征可视化展示任意模式的网络电路，这些模式与网络在自然输入下的正常行为完全脱节。随后，我们提供了证据表明，在标准的、未经操纵的网络中也存在类似现象：特征可视化的处理方式与标准输入截然不同，这让人对其“解释”神经网络如何处理自然图像的能力产生怀疑。这可以作为特征可视化的一种合理性检查。我们通过理论支撑了实验发现，证明能够被特征可视化可靠理解的函数集合极小，并不包括通用的黑盒神经网络。因此，一个有前景的前进方向可能是开发能够强制执行特定结构的网络，以确保更可靠的特征可视化。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日