We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.
翻译:本文提出IRIS(基于推理时眼跳的意图解析方法),一种无需训练的新颖方法,利用实时眼动追踪数据解决开放域视觉问答中的歧义问题。通过对500组独特图像-问题对开展的综合性用户研究,我们证明参与者开始口头提问时最接近的注视点对大型视觉语言模型中的消歧最具信息价值,可将歧义问题的回答准确率提升一倍以上(从35.2%提升至77.2%),同时在非歧义查询上保持原有性能。我们在多种先进视觉语言模型上评估本方法,结果表明当眼动数据被整合到歧义图像-问题对时,无论模型架构差异如何,均能获得持续的性能提升。我们发布了包含三个组成部分的新资源:用于消歧视觉问答的眼动数据基准数据集、创新的实时交互协议以及评估套件。