Urban perception describes how people subjectively evaluate urban environments, shaping how cities are experienced and understood. Existing computational approaches primarily model urban perception directly from street view images, but largely ignore the human perceptual process through which such judgments are formed. In this paper, we introduce Place Pulse-Gaze, an urban perception dataset that augments street view images with synchronized eye-tracking recordings and individual perception labels. Based on this dataset, we propose a Gaze-Guided Urban Perception Framework to study how gaze behavior contributes to the modeling of subjective urban perception. The framework systematically investigates three complementary settings: gaze-only modeling, gaze fusion with explicit semantic scene representations, and gaze fusion with implicit richer visual representations. Experiments show that gaze alone already carries useful predictive signals for subjective urban perception, and that integrating gaze with scene representations further improves prediction under both semantic and richer visual representations. Overall, our findings highlight the importance of incorporating human perceptual processes into urban scene understanding and open a direction for gaze-guided multimodal urban computing.
翻译:城市感知描述了人们如何主观评价城市环境,塑造了人们对城市的体验与理解。现有计算方法主要从街景图像直接建模城市感知,却基本忽略了形成这类判断的人类感知过程。本文提出Place Pulse-Gaze数据集,该数据集通过同步的眼动追踪记录与个体感知标签对街景图像进行增强。基于该数据集,我们提出目光引导的城市感知框架,研究目光行为如何促进主观城市感知的建模。该框架系统性地探索三种互补设置:纯目光建模、目光与显式语义场景表征融合、目光与隐式丰富视觉表征融合。实验表明,仅凭目光已包含主观城市感知的有效预测信号,且将目光与场景表征结合可进一步改善语义及丰富视觉表征下的预测性能。总体而言,我们的研究结果凸显了将人类感知过程纳入城市场景理解的重要性,并为目光引导的多模态城市计算开辟了新方向。