Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.
翻译:扫描路径相似度度量是眼动研究的核心问题,然而现有方法主要评估空间和时间对齐,而忽略了注视图像区域间的语义等价性。我们提出了一种将视觉-语言模型(VLM)整合到眼动分析中的语义扫描路径相似度框架。每个注视点均在受控视觉上下文(基于分块和基于标记的策略)下进行编码,并转化为简洁的文本描述,进而聚合为扫描路径级表征。随后,通过基于嵌入和词法的NLP指标计算语义相似度,并与已建立的空间度量(包括MultiMatch和DTW)进行对比。在自由观看眼动数据上的实验表明,语义相似度捕捉了与几何对齐部分独立的方差,揭示了在空间分化情况下仍存在高内容一致性。我们进一步分析了上下文编码对描述保真度和度量稳定性的影响。我们的研究表明,多模态基础模型能够实现经典扫描路径分析的可解释、内容感知扩展,为ETRA社区中的注视研究提供了互补维度。