Modern information querying systems are progressively incorporating multimodal inputs like vision and audio. However, the integration of gaze -- a modality deeply linked to user intent and increasingly accessible via gaze-tracking wearables -- remains underexplored. This paper introduces a novel gaze-facilitated information querying paradigm, named G-VOILA, which synergizes users' gaze, visual field, and voice-based natural language queries to facilitate a more intuitive querying process. In a user-enactment study involving 21 participants in 3 daily scenarios (p = 21, scene = 3), we revealed the ambiguity in users' query language and a gaze-voice coordination pattern in users' natural query behaviors with G-VOILA. Based on the quantitative and qualitative findings, we developed a design framework for the G-VOILA paradigm, which effectively integrates the gaze data with the in-situ querying context. Then we implemented a G-VOILA proof-of-concept using cutting-edge deep learning techniques. A follow-up user study (p = 16, scene = 2) demonstrates its effectiveness by achieving both higher objective score and subjective score, compared to a baseline without gaze data. We further conducted interviews and provided insights for future gaze-facilitated information querying systems.
翻译:现代信息查询系统正逐步集成视觉与音频等多模态输入。然而,与用户意图深度关联、且可通过眼动追踪可穿戴设备便捷获取的注视模态,其整合潜力仍待挖掘。本文提出一种名为G-VOILA的新型注视辅助信息查询范式,通过协同用户注视点、视野范围及基于语音的自然语言查询,实现更直观的查询流程。在涉及21名参与者在3个日常场景中的用户情境模拟研究(p=21,场景=3)中,我们揭示了用户查询语言的模糊性以及使用G-VOILA时用户自然查询行为中的注视-语音协同模式。基于定量与定性发现,我们构建了G-VOILA范式的设计框架,该框架能有效将注视数据与现场查询情境相融合。随后利用前沿深度学习技术实现了G-VOILA概念验证系统。后续用户研究(p=16,场景=2)表明,相较于未引入注视数据的基线系统,该系统在客观评分与主观评分上均取得更优效果。我们还通过深度访谈为未来注视辅助信息查询系统提供了设计洞见。