Human-object interaction is one of the most important visual cues and we propose a novel way to represent human-object interactions for egocentric action anticipation. We propose a novel transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms other visual transformer-based methods including object-centric video representation. On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3.3% on mean-top5 recall.
翻译:人-物交互是最重要的视觉线索之一,我们提出了一种新颖的方式来表征以自我为中心的动作预测中的人-物交互。我们设计了一种新型Transformer变体,通过计算执行动作引起的物体和人类手部外观变化来建模交互,并利用这些变化优化视频表征。具体而言,我们通过空间交叉注意力(SCA)建模手与物体之间的交互,并进一步通过轨迹交叉注意力注入上下文信息,从而获得环境优化的交互标记。利用这些标记,我们构建了以交互为中心的视频表征用于动作预测。我们将该模型命名为InAViT,在大型以自我为中心的数据集EPICKTICHENS100(EK100)和EGTEA Gaze+上实现了最先进的动作预测性能。InAViT优于包括以物体为中心的视频表征在内的其他视觉Transformer方法。在EK100评估服务器上,InAViT是公共排行榜上的最优方法(提交时),其平均Top-5召回率比第二名模型高出3.3%。