Human visual attention on three-dimensional objects emerges from the interplay between bottom-up geometric processing and top-down semantic recognition. Existing 3D saliency methods rely on hand-crafted geometric features or learning-based approaches that lack semantic awareness, failing to explain why humans fixate on semantically meaningful but geometrically unremarkable regions. We introduce SemGeo-AttentionNet, a dual-stream architecture that explicitly formalizes this dichotomy through asymmetric cross-modal fusion, leveraging diffusion-based semantic priors from geometry-conditioned multi-view rendering and point cloud transformers for geometric processing. Cross-attention ensures geometric features query semantic content, enabling bottom-up distinctiveness to guide top-down retrieval. We extend our framework to temporal scanpath generation through reinforcement learning, introducing the first formulation respecting 3D mesh topology with inhibition-of-return dynamics. Evaluation on SAL3D, NUS3D and 3DVA datasets demonstrates substantial improvements, validating how cognitively motivated architectures effectively model human visual attention on three-dimensional surfaces.
翻译:人类对三维物体的视觉注意力产生于自底向上的几何处理与自顶向下的语义识别之间的相互作用。现有的三维显著性方法依赖于手工设计的几何特征或缺乏语义感知的学习方法,无法解释人类为何注视具有语义意义但几何上不显著的区域。我们提出了SemGeo-AttentionNet,一种双流架构,通过非对称跨模态融合显式形式化这种二分性,利用来自几何条件多视图渲染的基于扩散的语义先验以及用于几何处理的点云Transformer。交叉注意力确保几何特征查询语义内容,使自底向上的显著性能够引导自顶向下的检索。我们通过强化学习将框架扩展到时序扫描路径生成,首次提出了尊重三维网格拓扑结构并包含返回抑制动态的建模方法。在SAL3D、NUS3D和3DVA数据集上的评估显示出显著改进,验证了认知驱动架构如何有效建模三维表面上的人类视觉注意力。