Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments

Human visual attention plays an important role in how people perceive and respond to environments containing potential risks. This study investigates whether large vision-language models can identify the same regions of a scene that attract human attention in safety-relevant environments. Eye-tracking data were collected from ten participants viewing 33 scene images representing environments with varying levels of potential risk using Pupil Invisible wearable glasses. Gaze coordinates were mapped onto stimulus images to generate population-averaged human gaze heatmaps. In parallel, GPT-4o was prompted through the OpenAI Vision Application Programming Interface (API) to generate spatial predictions of visual attention, which were converted into saliency maps for comparison with human gaze patterns. Spatial alignment between human gaze heatmaps and model-generated saliency maps was evaluated using four complementary metrics: Pearson correlation (r = 0.515 +- 0.117), Normalised Scanpath Saliency (NSS = 0.988 +- 0.323), Kullback-Leibler divergence (KL = 1.766 +- 0.844), and Area Under the Receiver Operating Characteristic Curve using the Judd formulation (AUC-Judd = 0.806 +- 0.076). A cross-model comparison with Gemini Pro, Gemini Flash, and Claude showed that all models exceeded the AUC-Judd chance baseline of 0.5 and achieved positive NSS scores. Gemini Pro demonstrated the strongest spatial localisation according to three of the four metrics, whereas GPT-4o produced the closest distributional match to human attention as measured by KL divergence. These findings suggest that large vision-language models can identify regions that broadly correspond to where humans direct visual attention in safety-relevant scenes without requiring eye-tracking training data. The results highlight the potential of vision-language models as a scalable tool for approximating human attentional patterns.

翻译：人类视觉注意力在人们感知和响应包含潜在风险的环境时起着重要作用。本研究探究大型视觉语言模型能否识别安全相关环境中吸引人类注意力的相同场景区域。通过使用Pupil Invisible可穿戴眼镜，收集了十名受试者观看33张代表不同潜在风险水平环境场景图像时的眼动追踪数据。将注视坐标映射到刺激图像上，生成群体平均的人类注视热图。与此同时，通过OpenAI视觉应用程序编程接口（API）提示GPT-4o生成视觉注意力的空间预测，并将其转换为显著性图，以便与人类注视模式进行比较。使用四种互补指标评估人类注视热图与模型生成显著性图之间的空间对齐程度：皮尔逊相关系数（r = 0.515 ± 0.117）、标准化扫描路径显著性（NSS = 0.988 ± 0.323）、库尔贝克-莱布勒散度（KL = 1.766 ± 0.844）以及基于Judd公式的受试者工作特征曲线下面积（AUC-Judd = 0.806 ± 0.076）。与Gemini Pro、Gemini Flash和Claude的跨模型比较显示，所有模型的AUC-Judd值均超过0.5的随机基线水平，并取得了正的NSS分数。Gemini Pro在四项指标中的三项上展现出最强的空间定位能力，而GPT-4o在KL散度测量下生成了与人类注意力最接近的分布匹配。这些发现表明，大型视觉语言模型能够识别与人类在安全相关场景中视觉注意力方向大致对应的区域，且无需眼动追踪训练数据。该结果凸显了视觉语言模型作为可扩展工具在近似人类注意力模式方面的潜力。