As the demand for analyzing egocentric videos grows, egocentric visual attention prediction, anticipating where a camera wearer will attend, has garnered increasing attention. However, it remains challenging due to the inherent complexity and ambiguity of dynamic egocentric scenes. Motivated by evidence that scene contextual information plays a crucial role in modulating human attention, in this paper, we present a language-guided scene context-aware learning framework for robust egocentric visual attention prediction. We first design a context perceiver which is guided to summarize the egocentric video based on a language-based scene description, generating context-aware video representations. We then introduce two training objectives that: 1) encourage the framework to focus on the target point-of-interest regions and 2) suppress distractions from irrelevant regions which are less likely to attract first-person attention. Extensive experiments on Ego4D and Aria Everyday Activities (AEA) datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance and enhanced robustness across diverse, dynamic egocentric scenarios.
翻译:随着对第一人称视频分析需求的增长,预测摄像头佩戴者将关注何处的第一人称视觉注意力预测技术日益受到关注。然而,由于动态第一人称场景固有的复杂性和模糊性,该任务仍具挑战性。受场景上下文信息在调节人类注意力中起关键作用的证据启发,本文提出了一种语言引导的场景上下文感知学习框架,用于实现鲁棒的第一人称视觉注意力预测。我们首先设计了一个上下文感知器,该感知器在基于语言的场景描述引导下总结第一人称视频,生成上下文感知的视频表征。随后,我们引入了两个训练目标:1)促使框架聚焦于目标兴趣点区域;2)抑制来自不太可能吸引第一人称注意力的无关区域的干扰。在Ego4D和Aria Everyday Activities(AEA)数据集上的大量实验证明了我们方法的有效性,其在多样化的动态第一人称场景中实现了最先进的性能并增强了鲁棒性。