Short-Term object-interaction Anticipation (STA) consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. We propose STAformer, a novel attention-based architecture integrating frame-guided temporal pooling, dual image-video attention, and multi-scale feature fusion to support STA predictions from an image-input video pair. Moreover, we introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. On the test set, our results obtain a final 33.5 N mAP, 17.25 N+V mAP, 11.77 N+{\delta} mAP and 6.75 Overall top-5 mAP metric when trained on the v2 training dataset.
翻译:短期物体交互预测(STA)的任务在于,通过观察第一人称视角视频,检测下一个活跃物体的位置、交互动作的名词与动词类别以及接触时间。我们提出了STAformer,一种新颖的基于注意力的架构,它集成了帧引导时序池化、双图像-视频注意力以及多尺度特征融合,以支持从图像-视频对中进行STA预测。此外,我们引入了两个新颖的模块,通过建模可供性来将STA预测建立在人类行为的基础上。首先,我们集成了一个环境可供性模型,该模型作为在给定物理场景中可能发生的交互的持久记忆。其次,我们通过观察手部和物体轨迹来预测交互热点,从而提高热点区域周围STA预测的可信度。在测试集上,当使用v2训练数据集进行训练时,我们的模型在最终指标上取得了33.5 N mAP、17.25 N+V mAP、11.77 N+{\delta} mAP以及6.75 Overall top-5 mAP的成绩。