Human vision achieves remarkable perceptual performance while operating under strict metabolic constraints. A key ingredient is the selective attention mechanism, driven by rapid saccadic eye movements that constantly reposition the high-resolution fovea onto task-relevant locations, unlike conventional AI systems that process entire images with equal emphasis. Our work aims to draw inspiration from the human visual system to create smarter, more efficient image processing models. Using DINO, a self-supervised Vision Transformer that produces attention maps strikingly similar to human gaze patterns, we explore a saccade inspired method to focus the processing of information on key regions in visual space. To do so, we use the ImageNet dataset in a standard classification task and measure how each successive saccade affects the model's class scores. This selective-processing strategy preserves most of the full-image classification performance and can even outperform it in certain cases. By benchmarking against established saliency models built for human gaze prediction, we demonstrate that DINO provides superior fixation guidance for selecting informative regions. These findings highlight Vision Transformer attention as a promising basis for biologically inspired active vision and open new directions for efficient, neuromorphic visual processing.
翻译:人类视觉在严格的代谢约束下实现了卓越的感知性能。其关键机制在于选择性注意,该机制由快速的眼跳运动驱动,能够持续将高分辨率的中央凹重新定位到任务相关区域;这与传统人工智能系统以同等权重处理整幅图像的方式截然不同。我们的工作旨在从人类视觉系统中汲取灵感,以构建更智能、更高效的图像处理模型。利用自监督视觉Transformer模型DINO——其生成的注意力图与人类注视模式惊人地相似——我们探索了一种受眼跳启发的方法,将信息处理聚焦于视觉空间的关键区域。为此,我们在标准分类任务中使用ImageNet数据集,并测量每一次连续眼跳如何影响模型的类别得分。这种选择性处理策略保留了全图像分类的大部分性能,在某些情况下甚至能超越全图像处理。通过与已建立的、用于预测人类注视的显著性模型进行基准比较,我们证明DINO为选择信息丰富的区域提供了更优的注视引导。这些发现凸显了视觉Transformer注意力作为受生物启发的主动视觉研究的有前景的基础,并为高效的神经形态视觉处理开辟了新方向。