The objects we perceive guide our eye movements when observing real-world dynamic scenes. Yet, gaze shifts and selective attention are critical for perceiving details and refining object boundaries. Object segmentation and gaze behavior are, however, typically treated as two independent processes. Here, we present a computational model that simulates these processes in an interconnected manner and allows for hypothesis-driven investigations of distinct attentional mechanisms. Drawing on an information processing pattern from robotics, we use a Bayesian filter to recursively segment the scene, which also provides an uncertainty estimate for the object boundaries that we use to guide active scene exploration. We demonstrate that this model closely resembles observers' free viewing behavior on a dataset of dynamic real-world scenes, measured by scanpath statistics, including foveation duration and saccade amplitude distributions used for parameter fitting and higher-level statistics not used for fitting. These include how object detections, inspections, and returns are balanced and a delay of returning saccades without an explicit implementation of such temporal inhibition of return. Extensive simulations and ablation studies show that uncertainty promotes balanced exploration and that semantic object cues are crucial to forming the perceptual units used in object-based attention. Moreover, we show how our model's modular design allows for extensions, such as incorporating saccadic momentum or pre-saccadic attention, to further align its output with human scanpaths.
翻译:我们感知到的对象在观察真实世界动态场景时引导着我们的眼动。然而,注视转移与选择性注意对于感知细节和细化对象边界至关重要。然而,对象分割与注视行为通常被视为两个独立的过程。在此,我们提出一种计算模型,以相互关联的方式模拟这些过程,并允许对不同的注意机制进行假设驱动的研究。借鉴机器人学中的信息处理模式,我们使用贝叶斯滤波器递归地分割场景,该滤波器同时提供对象边界的不确定性估计,我们利用这一估计来引导主动的场景探索。我们证明,该模型在动态真实世界场景数据集上,通过扫视路径统计量(包括用于参数拟合的中央凹注视持续时间与眼跳幅度分布,以及未用于拟合的高阶统计量)的测量,与观察者的自由观看行为高度相似。这些高阶统计量包括对象检测、检查与返回如何达到平衡,以及返回性眼跳的延迟(无需显式实现此类时间性返回抑制)。广泛的模拟与消融研究表明,不确定性促进了平衡的探索,而语义对象线索对于形成基于对象的注意中所使用的感知单元至关重要。此外,我们展示了模型的模块化设计如何允许进行扩展,例如融入眼跳动量或眼跳前注意,以进一步使其输出与人类扫视路径对齐。