Short-term action anticipation (STA) in first-person videos is a challenging task that involves understanding the next active object interactions and predicting future actions. Existing action anticipation methods have primarily focused on utilizing features extracted from video clips, but often overlooked the importance of objects and their interactions. To this end, we propose a novel approach that applies a guided attention mechanism between the objects, and the spatiotemporal features extracted from video clips, enhancing the motion and contextual information, and further decoding the object-centric and motion-centric information to address the problem of STA in egocentric videos. Our method, GANO (Guided Attention for Next active Objects) is a multi-modal, end-to-end, single transformer-based network. The experimental results performed on the largest egocentric dataset demonstrate that GANO outperforms the existing state-of-the-art methods for the prediction of the next active object label, its bounding box location, the corresponding future action, and the time to contact the object. The ablation study shows the positive contribution of the guided attention mechanism compared to other fusion methods. Moreover, it is possible to improve the next active object location and class label prediction results of GANO by just appending the learnable object tokens with the region of interest embeddings.
翻译:短时动作预测(STA)在第一人称视频中是一项具有挑战性的任务,涉及理解下一活动物体的交互并预测未来动作。现有动作预测方法主要利用从视频片段中提取的特征,但往往忽视了物体及其交互的重要性。为此,我们提出了一种新颖方法,在物体与从视频片段中提取的时空特征之间应用引导注意力机制,增强运动与上下文信息,并进一步解码以物体为中心和以运动为中心的信息,以解决自我中心视频中的STA问题。我们的方法GANO(引导注意力用于下一活动物体)是一个多模态、端到端、基于单一Transformer的网络。在最大自我中心数据集上的实验结果表明,GANO在预测下一活动物体标签、其边界框位置、对应未来动作以及接触物体时间方面,优于现有最先进方法。消融研究显示,与其他融合方法相比,引导注意力机制具有积极贡献。此外,仅通过将可学习物体令牌与感兴趣区域嵌入相结合,即可改进GANO的下一活动物体位置和类别标签预测结果。