ObjectVisA-120：交互式过街环境中的基于对象的视觉注意力预测 (ObjectVisA-120: Object-based Visual Attention Prediction in Interactive Street-crossing Environments)

The object-based nature of human visual attention is well-known in cognitive science, but has only played a minor role in computational visual attention models so far. This is mainly due to a lack of suitable datasets and evaluation metrics for object-based attention. To address these limitations, we present \dataset~ -- a novel 120-participant dataset of spatial street-crossing navigation in virtual reality specifically geared to object-based attention evaluations. The uniqueness of the presented dataset lies in the ethical and safety affiliated challenges that make collecting comparable data in real-world environments highly difficult. \dataset~ not only features accurate gaze data and a complete state-space representation of objects in the virtual environment, but it also offers variable scenario complexities and rich annotations, including panoptic segmentation, depth information, and vehicle keypoints. We further propose object-based similarity (oSIM) as a novel metric to evaluate the performance of object-based visual attention models, a previously unexplored performance characteristic. Our evaluations show that explicitly optimising for object-based attention not only improves oSIM performance but also leads to an improved model performance on common metrics. In addition, we present SUMGraph, a Mamba U-Net-based model, which explicitly encodes critical scene objects (vehicles) in a graph representation, leading to further performance improvements over several state-of-the-art visual attention prediction methods. The dataset, code and models will be publicly released.

翻译：人类视觉注意力具有基于对象的特性，这一观点在认知科学中已广为人知，但迄今为止在计算视觉注意力模型中仅扮演了次要角色。这主要是由于缺乏适用于基于对象注意力的合适数据集和评估指标。为解决这些局限性，我们提出了 \dataset~ —— 一个包含120名参与者的新颖数据集，专注于虚拟现实中的空间过街导航，专门用于基于对象注意力的评估。该数据集的独特性在于其涉及的伦理和安全相关挑战，这使得在现实环境中收集可比数据极为困难。\dataset~ 不仅包含精确的注视数据以及虚拟环境中对象的完整状态空间表示，还提供了可变的场景复杂度和丰富的标注，包括全景分割、深度信息和车辆关键点。我们进一步提出了基于对象的相似度（oSIM）作为一种新颖的指标，用于评估基于对象的视觉注意力模型的性能，这是一个先前未被探索的性能特征。我们的评估表明，明确针对基于对象的注意力进行优化不仅能提高oSIM性能，还能提升模型在常见指标上的表现。此外，我们提出了SUMGraph，一种基于Mamba U-Net的模型，该模型以图表示形式显式编码关键场景对象（车辆），从而在多种最先进的视觉注意力预测方法上实现了进一步的性能提升。该数据集、代码和模型将公开发布。