How do neural networks extract patterns from pixels? Feature visualizations attempt to answer this important question by visualizing highly activating patterns through optimization. Today, visualization methods form the foundation of our knowledge about the internal workings of neural networks, as a type of mechanistic interpretability. Here we ask: How reliable are feature visualizations? We start our investigation by developing network circuits that trick feature visualizations into showing arbitrary patterns that are completely disconnected from normal network behavior on natural input. We then provide evidence for a similar phenomenon occurring in standard, unmanipulated networks: feature visualizations are processed very differently from standard input, casting doubt on their ability to "explain" how neural networks process natural images. We underpin this empirical finding by theory proving that the set of functions that can be reliably understood by feature visualization is extremely small and does not include general black-box neural networks. Therefore, a promising way forward could be the development of networks that enforce certain structures in order to ensure more reliable feature visualizations.
翻译:神经网络如何从像素中提取模式?特征可视化试图通过优化可视化高度激活的模式来回答这一重要问题。如今,作为机制可解释性的一种类型,可视化方法构成了我们对神经网络内部工作机制认识的基础。本文提出疑问:特征可视化有多可靠?我们首先开发网络电路,这些电路能够欺骗特征可视化,使其显示出与正常网络在自然输入上的行为完全脱节的任意模式。随后,我们提供证据表明,在标准、未经操纵的网络中也会出现类似现象:特征可视化的处理方式与标准输入截然不同,这对其“解释”神经网络如何处理自然图像的能力提出了质疑。我们通过理论支撑这一经验发现,证明了可被特征可视化可靠理解的函数集合极其微小,且不包括通用黑箱神经网络。因此,一个可行的前进方向可能是开发能够强制某些结构的网络,以确保特征可视化更加可靠。