Deep Learning models like Convolutional Neural Networks (CNN) are powerful image classifiers, but what factors determine whether they attend to similar image areas as humans do? While previous studies have focused on technological factors, little is known about the role of factors that affect human attention. In the present study, we investigated how the tasks used to elicit human attention maps interact with image characteristics in modulating the similarity between humans and CNN. We varied the intentionality of human tasks, ranging from spontaneous gaze during categorization over intentional gaze-pointing up to manual area selection. Moreover, we varied the type of image to be categorized, using either singular, salient objects, indoor scenes consisting of object arrangements, or landscapes without distinct objects defining the category. The human attention maps generated in this way were compared to the CNN attention maps revealed by explainable artificial intelligence (Grad-CAM). The influence of human tasks strongly depended on image type: For objects, human manual selection produced maps that were most similar to CNN, while the specific eye movement task has little impact. For indoor scenes, spontaneous gaze produced the least similarity, while for landscapes, similarity was equally low across all human tasks. To better understand these results, we also compared the different human attention maps to each other. Our results highlight the importance of taking human factors into account when comparing the attention of humans and CNN.
翻译:像卷积神经网络(CNN)这样的深度学习模型是强大的图像分类器,但哪些因素决定了它们是否与人类关注相似的图像区域?以往研究主要关注技术因素,而对影响人类注意的因素作用知之甚少。在本研究中,我们探究了用于引发人类注意力图的任务如何与图像特征相互作用,以调节人类与CNN之间的相似性。我们改变了人类任务的意图性,从分类过程中的自发注视、意图性注视指向,到手动区域选择。此外,我们改变了待分类图像的类型,使用包含单个显著物体的图像、由物体排列组成的室内场景,或没有显著物体定义类别的景观。将以此方式生成的人类注意力图与可解释人工智能(Grad-CAM)揭示的CNN注意力图进行比较。人类任务的影响强烈依赖于图像类型:对于物体,人类手动选择产生的注意力图与CNN最相似,而特定眼动任务的影响很小。对于室内场景,自发注视产生的相似性最低,而对于景观,所有人类任务产生的相似性均较低。为了更好地理解这些结果,我们还比较了不同的人类注意力图。我们的结果强调了在比较人类与CNN注意力时考虑人为因素的重要性。