StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications such as Augmented Reality (AR) glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether Multimodal Large Language Models (MLLMs) can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs utilize gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively assess streaming video understanding. These tasks evaluate whether models can use real-time gaze signals to follow shifting attention and infer user intentions based only on past and currently observed frames. To build StreamGaze, we develop a gaze-video Question Answering (QA) generation pipeline that aligns egocentric videos with raw gaze trajectories through fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, highlighting key limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze prompting strategies, reasoning behaviors, and task-specific failure modes, offering insights into current limitations and directions for future research. All data and code are publicly available to support continued research in gaze-guided streaming video understanding.

翻译：流式视频理解要求模型不仅要处理随时间传入的视频帧，还需预判用户意图以适应增强现实眼镜等实际应用。现有流式基准虽能评估时间推理能力，但均未测试多模态大语言模型能否在流式场景中解读或利用人类目光信号。为填补这一空白，我们提出首个专门评估MLLMs在流式视频中利用目光进行时间推理与主动推理能力的基准——StreamGaze。该基准引入了目光引导的过去、现在与主动三类任务，全面评估流式视频理解能力。这些任务旨在检验模型能否利用实时目光信号追踪注意力的动态转移，并基于仅有的过去与当前观测帧推断用户意图。为构建StreamGaze，我们开发了一套目光-视频问答生成流水线，通过注视点提取、区域特定视觉提示及扫描路径构建，将第一人称视频与原始目光轨迹对齐。该流水线生成反映人类感知动态的时空锚定问答对。在所有StreamGaze任务中，当前最先进MLLMs与人类表现之间存在显著性能差距，凸显了基于目光的时间推理、意图建模及主动预测方面的关键局限。我们进一步提供了目光提示策略、推理行为及任务特定失效模式的详细分析，揭示了当前局限并为未来研究指明方向。所有数据与代码均已公开，以支持目光引导流式视频理解的持续研究。