The emergence of advanced multimodal large language models (MLLMs) has significantly enhanced AI assistants' ability to process complex information across modalities. Recently, egocentric videos, by directly capturing user focus, actions, and context in an unified coordinate, offer an exciting opportunity to enable proactive and personalized AI user experiences with MLLMs. However, existing benchmarks overlook the crucial role of gaze as an indicator of user intent. To address this gap, we introduce EgoGazeVQA, an egocentric gaze-guided video question answering benchmark that leverages gaze information to improve the understanding of longer daily-life videos. EgoGazeVQA consists of gaze-based QA pairs generated by MLLMs and refined by human annotators. Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions. In contrast, our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues. We further conduct experiments on gaze-related fine-tuning and analyze how gaze estimation accuracy impacts prompting effectiveness. These results underscore the value of gaze for more personalized and effective AI assistants in egocentric settings. Project page: https://taiyi98.github.io/projects/EgoGazeVQA
翻译:先进的多模态大语言模型(MLLM)的出现显著增强了AI助手处理跨模态复杂信息的能力。近期,第一人称视频通过直接捕捉用户在统一坐标系中的注视焦点、动作与上下文,为利用MLLM实现主动式个性化AI用户体验提供了令人兴奋的机遇。然而,现有基准测试忽视了作为用户意图关键指标的注视信息。为填补这一空白,我们提出了EgoGazeVQA——一个第一人称注视引导视频问答基准,该基准利用注视信息提升对较长日常生活视频的理解。EgoGazeVQA包含由MLLM生成并经人工标注者精修的注视相关问答对。实验表明,现有MLLM难以准确解读用户意图。相比之下,我们提出的注视引导意图提示方法通过整合空间、时间及意图相关线索,显著提升了性能。我们进一步开展了注视相关微调实验,并分析了注视估计准确性如何影响提示效果。这些结果凸显了注视信息在第一人称场景中实现更个性化、更高效AI助手的价值。项目页面:https://taiyi98.github.io/projects/EgoGazeVQA