Human-object interaction is one of the most important visual cues and we propose a novel way to represent human-object interactions for egocentric action anticipation. We propose a novel transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms other visual transformer-based methods including object-centric video representation. On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3.3% on mean-top5 recall.
翻译:人-物交互是最重要的视觉线索之一,我们提出了一种新颖的方法来表征人-物交互以实现第一人称视角动作预测。我们提出了一种新的Transformer变体,通过计算动作执行过程中物体与人手外观的变化来建模交互,并利用这些变化优化视频表征。具体而言,我们使用空间交叉注意力(SCA)建模手与物体之间的交互,并通过轨迹交叉注意力注入上下文信息,获得环境优化的交互令牌。利用这些令牌,我们构建了以交互为中心的视频表征用于动作预测。我们将模型命名为InAViT,该模型在大型第一人称视角数据集EPICKTICHENS100(EK100)和EGTEA Gaze+上实现了最先进的动作预测性能。InAViT优于其他基于视觉Transformer的方法,包括以物体为中心的视频表征方法。在EK100评估服务器上,InAViT在公开排行榜中(提交时)排名第一,其平均top-5召回率比第二名模型高出3.3%。