Eye gaze offers valuable cues about attention, short-term intent, and future actions, making it a powerful signal for modeling egocentric behavior. In this work, we propose a gaze-regularized framework that enhances VLMs for two key egocentric understanding tasks: fine-grained future event prediction and current activity understanding. Unlike prior approaches that rely solely on visual inputs or use gaze as an auxiliary input signal , our method uses gaze only during training. We introduce a gaze-regularized attention mechanism that aligns model focus with human visual gaze. This design is flexible and modular, allowing it to generalize across multiple VLM architectures that utilize attention. Experimental results show that our approach improves semantic prediction scores by up to 11 for future event prediction and around 7 for current activity understanding, compared to the corresponding baseline models trained without gaze regularization. These results highlight the value of gaze-guided training in improving the accuracy and robustness of egocentric VLMs. Overall, this work establishes a foundation for using human gaze to enhance the predictive capabilities of VLMs in real-world scenarios like assistive robots and human-machine collaboration. Code and additional information is available at: https://github.com/anupampani/Gaze-VLM
翻译:注视提供了关于注意力、短期意图和未来行动的有价值线索,使其成为建模第一人称视角行为的强大信号。在本工作中,我们提出了一种注视正则化框架,该框架增强了视觉语言模型在两项关键的第一人称视角理解任务上的性能:细粒度未来事件预测和当前活动理解。与先前仅依赖视觉输入或将注视作为辅助输入信号的方法不同,我们的方法仅在训练期间使用注视。我们引入了一种注视正则化注意力机制,使模型的关注点与人类视觉注视保持一致。该设计灵活且模块化,能够泛化到多种利用注意力的视觉语言模型架构。实验结果表明,与未经注视正则化训练的相应基线模型相比,我们的方法在未来事件预测任务上将语义预测分数提升了高达11分,在当前活动理解任务上提升了约7分。这些结果凸显了注视引导训练在提高第一人称视角视觉语言模型准确性和鲁棒性方面的价值。总体而言,本工作为利用人类注视增强视觉语言模型在辅助机器人和人机协作等现实场景中的预测能力奠定了基础。代码和更多信息请访问:https://github.com/anupampani/Gaze-VLM