Vision Language Models (VLMs) have demonstrated strong capabilities in understanding visual content, yet their ability to predict where humans look on user interfaces remains unexplored. We present UIGaze, a study investigating how closely VLMs can approximate human visual attention on user interfaces using real eye-tracking data. Using the UEyes dataset - comprising 1,980 UI screenshots across four categories (webpage, desktop, mobile, poster) with eye-tracking data from 62 participants - we evaluate nine state-of-the-art VLMs through a zero-shot coordinate prediction pipeline. Each model generates gaze point coordinates that are converted into saliency maps via Gaussian blurring and compared against ground truth using CC, SIM, and KL divergence. Our experiments (1,980 images x 9 models x 3 runs x 3 durations) reveal that VLMs achieve moderate alignment with human gaze patterns, with the degree of alignment varying significantly across UI types and improving with longer viewing durations - suggesting VLMs capture exploratory gaze patterns rather than initial fixations. All code, predictions, and evaluation results are publicly available.
翻译:视觉语言模型在理解视觉内容方面展现出强大能力,但其预测用户界面中人类注视位置的能力尚未被探索。我们提出UIGaze研究,基于真实眼动追踪数据,探究视觉语言模型能在多大程度上近似用户界面中的人类视觉注意力。利用UEyes数据集(包含1,980张涵盖网页、桌面、移动端、海报四大类别的UI截图及62名参与者的眼动追踪数据),我们通过零样本坐标预测流程评估了九种前沿视觉语言模型。每个模型生成注视点坐标,经高斯模糊转换为显著性图,并通过CC、SIM、KL散度与真值进行比较。实验(1,980张图像×9个模型×3次运行×3种时长)表明,视觉语言模型与人类注视模式达到中等程度对齐,对齐程度因UI类型差异显著,且随注视时长增加而提升——这表明视觉语言模型捕捉到的是探索性凝视模式,而非初始注视。所有代码、预测结果及评估数据均已公开。