How do neural networks extract patterns from pixels? Feature visualizations attempt to answer this important question by visualizing highly activating patterns through optimization. Today, visualization methods form the foundation of our knowledge about the internal workings of neural networks, as a type of mechanistic interpretability. Here we ask: How reliable are feature visualizations? We start our investigation by developing network circuits that trick feature visualizations into showing arbitrary patterns that are completely disconnected from normal network behavior on natural input. We then provide evidence for a similar phenomenon occurring in standard, unmanipulated networks: feature visualizations are processed very differently from standard input, casting doubt on their ability to "explain" how neural networks process natural images. We underpin this empirical finding by theory proving that the set of functions that can be reliably understood by feature visualization is extremely small and does not include general black-box neural networks. Therefore, a promising way forward could be the development of networks that enforce certain structures in order to ensure more reliable feature visualizations.
翻译:神经网络如何从像素中提取模式?特征可视化试图通过优化来可视化高度激活的模式,以回答这一重要问题。如今,作为一种机制可解释性方法,可视化方法构成了我们对神经网络内部工作原理认识的基础。在此,我们提出疑问:特征可视化的可靠性如何?我们首先通过开发能够欺骗特征可视化显示任意模式的网络通路展开研究,这些模式与神经网络在自然输入上的正常行为完全脱节。随后,我们提供了证据表明,在标准的、未经操纵的网络中也存在类似现象:特征可视化对输入的处理方式与标准输入截然不同,这对其"解释"神经网络如何处理自然图像的能力提出了质疑。我们通过理论证实了这一经验发现,证明能够通过特征可视化可靠理解的函数集合极其有限,且不包括通用的黑箱神经网络。因此,一个可行的发展方向是设计能强制某些结构的网络,以确保更可靠的特征可视化。