Deep learning algorithms lack human-interpretable accounts of how they transform raw visual input into a robust semantic understanding, which impedes comparisons between different architectures, training objectives, and the human brain. In this work, we take inspiration from neuroscience and employ representational approaches to shed light on how neural networks encode information at low (visual saliency) and high (semantic similarity) levels of abstraction. Moreover, we introduce a custom image dataset where we systematically manipulate salient and semantic information. We find that ResNets are more sensitive to saliency information than ViTs, when trained with object classification objectives. We uncover that networks suppress saliency in early layers, a process enhanced by natural language supervision (CLIP) in ResNets. CLIP also enhances semantic encoding in both architectures. Finally, we show that semantic encoding is a key factor in aligning AI with human visual perception, while saliency suppression is a non-brain-like strategy.
翻译:深度学习算法缺乏对人类可解释的说明,以阐明其如何将原始视觉输入转化为稳健的语义理解,这阻碍了不同架构、训练目标以及人脑之间的比较。本研究受神经科学启发,采用表征方法揭示神经网络如何在低层(视觉显著性)和高层(语义相似性)抽象水平上编码信息。此外,我们引入了一个自定义图像数据集,在其中系统性地操控显著性和语义信息。我们发现,在目标分类训练目标下,ResNet对显著性信息的敏感度高于ViT。我们揭示了网络在早期层中抑制显著性的现象,这一过程在ResNet中通过自然语言监督(CLIP)得到增强。CLIP还增强了两类架构中的语义编码。最终,我们证明语义编码是使人工智能与人类视觉感知对齐的关键因素,而显著性抑制则是一种非类脑策略。