How do neural networks extract patterns from pixels? Feature visualizations attempt to answer this important question by visualizing highly activating patterns through optimization. Today, visualization methods form the foundation of our knowledge about the internal workings of neural networks, as a type of mechanistic interpretability. Here we ask: How reliable are feature visualizations? We start our investigation by developing network circuits that trick feature visualizations into showing arbitrary patterns that are completely disconnected from normal network behavior on natural input. We then provide evidence for a similar phenomenon occurring in standard, unmanipulated networks: feature visualizations are processed very differently from standard input, casting doubt on their ability to "explain" how neural networks process natural images. This can be used as a sanity check for feature visualizations. We underpin our empirical findings by theory proving that the set of functions that can be reliably understood by feature visualization is extremely small and does not include general black-box neural networks. Therefore, a promising way forward could be the development of networks that enforce certain structures in order to ensure more reliable feature visualizations.
翻译:神经网络如何从像素中提取模式?特征可视化试图通过优化来可视化高度激活的模式,以解答这一重要问题。如今,作为机械可解释性的一种类型,可视化方法构成了我们关于神经网络内部运作知识的基础。在此,我们提出疑问:特征可视化究竟有多可靠?我们的研究始于开发能够欺骗特征可视化展示任意模式的网络电路,这些模式与网络在自然输入下的正常行为完全脱节。随后,我们提供了证据表明,在标准的、未经操纵的网络中也存在类似现象:特征可视化的处理方式与标准输入截然不同,这让人对其“解释”神经网络如何处理自然图像的能力产生怀疑。这可以作为特征可视化的一种合理性检查。我们通过理论支撑了实验发现,证明能够被特征可视化可靠理解的函数集合极小,并不包括通用的黑盒神经网络。因此,一个有前景的前进方向可能是开发能够强制执行特定结构的网络,以确保更可靠的特征可视化。