Gestalt psychologists have identified a range of conditions in which humans organize elements of a scene into a group or whole, and perceptual grouping principles play an essential role in scene perception and object identification. Recently, Deep Neural Networks (DNNs) trained on natural images (ImageNet) have been proposed as compelling models of human vision based on reports that they perform well on various brain and behavioral benchmarks. Here we test a total of 16 networks covering a variety of architectures and learning paradigms (convolutional, attention-based, supervised and self-supervised, feed-forward and recurrent) on dots (Experiment 1) and more complex shapes (Experiment 2) stimuli that produce strong Gestalts effects in humans. In Experiment 1 we found that convolutional networks were indeed sensitive in a human-like fashion to the principles of proximity, linearity, and orientation, but only at the output layer. In Experiment 2, we found that most networks exhibited Gestalt effects only for a few sets, and again only at the latest stage of processing. Overall, self-supervised and Vision-Transformer appeared to perform worse than convolutional networks in terms of human similarity. Remarkably, no model presented a grouping effect at the early or intermediate stages of processing. This is at odds with the widespread assumption that Gestalts occur prior to object recognition, and indeed, serve to organize the visual scene for the sake of object recognition. Our overall conclusion is that, albeit noteworthy that networks trained on simple 2D images support a form of Gestalt grouping for some stimuli at the output layer, this ability does not seem to transfer to more complex features. Additionally, the fact that this grouping only occurs at the last layer suggests that networks learn fundamentally different perceptual properties than humans.
翻译:格式塔心理学家已识别出一系列条件,在这些条件下,人类会将场景中的元素组织成一个组或整体,而感知分组原则在场景感知和物体识别中起着关键作用。近年来,基于自然图像(ImageNet)训练的深度神经网络(DNNs)被提出作为人类视觉的令人信服的模型,报告称它们在各种大脑和行为基准测试中表现良好。在此,我们测试了共16个网络,涵盖多种架构和学习范式(卷积网络、基于注意力的网络、监督与自监督学习、前馈与循环网络),使用能对人类产生强烈格式塔效应的点状(实验1)和更复杂形状(实验2)刺激。在实验1中,我们发现卷积网络确实以类人方式对接近性、线性性和方向性原则敏感,但仅在输出层如此。在实验2中,我们发现大多数网络仅对少数刺激集表现出格式塔效应,且同样仅发生在处理的最新阶段。总体而言,自监督网络和视觉Transformer在人类相似性方面表现似乎逊于卷积网络。值得注意的是,没有模型在处理的早期或中间阶段呈现分组效应。这与普遍假设——格式塔发生在物体识别之前,并因此为物体识别组织视觉场景——相矛盾。我们的总体结论是,尽管值得一提的是,在简单二维图像上训练的网络在输出层对某些刺激支持一种形式的格式塔分组,但这种能力似乎无法迁移到更复杂的特征。此外,这种分组仅出现在最后一层的事实表明,网络所学到的感知特性与人类有根本不同。