Short-term action anticipation (STA) in first-person videos is a challenging task that involves understanding the next active object interactions and predicting future actions. Existing action anticipation methods have primarily focused on utilizing features extracted from video clips, but often overlooked the importance of objects and their interactions. To this end, we propose a novel approach that applies a guided attention mechanism between the objects, and the spatiotemporal features extracted from video clips, enhancing the motion and contextual information, and further decoding the object-centric and motion-centric information to address the problem of STA in egocentric videos. Our method, GANO (Guided Attention for Next active Objects) is a multi-modal, end-to-end, single transformer-based network. The experimental results performed on the largest egocentric dataset demonstrate that GANO outperforms the existing state-of-the-art methods for the prediction of the next active object label, its bounding box location, the corresponding future action, and the time to contact the object. The ablation study shows the positive contribution of the guided attention mechanism compared to other fusion methods. Moreover, it is possible to improve the next active object location and class label prediction results of GANO by just appending the learnable object tokens with the region of interest embeddings.
翻译:短时动作预测(STA)在第一人称视频中是一项具有挑战性的任务,涉及理解下一主动物体交互并预测未来动作。现有动作预测方法主要关注利用视频片段提取的特征,但常忽视物体及其交互的重要性。为此,我们提出了一种新颖方法,在物体与视频片段提取的时空特征之间应用引导注意力机制,增强运动与上下文信息,并进一步解码以物体为中心和以运动为中心的信息,以解决第一人称视频中的STA问题。我们的方法GANO(用于下一主动物体的引导注意力)是一个多模态、端到端、基于单Transformer的网络。在最大规模的第一人称数据集上的实验结果表明,GANO在预测下一主动物体标签、其边界框位置、对应未来动作及接触物体时间方面,优于现有最先进方法。消融研究显示,与其他融合方法相比,引导注意力机制具有积极贡献。此外,仅通过将可学习物体令牌与感兴趣区域嵌入相连,即可提升GANO对下一主动物体位置和类别标签的预测结果。