Human-object interaction is one of the most important visual cues and we propose a novel way to represent human-object interactions for egocentric action anticipation. We propose a novel transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms other visual transformer-based methods including object-centric video representation. On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3.3% on mean-top5 recall.
翻译:人-物交互是最重要的视觉线索之一,我们提出一种新颖的方法来表征人-物交互以进行自我中心动作预测。我们提出一种新型Transformer变体,通过计算执行动作导致的物体和人手外观变化来建模交互,并利用这些变化细化视频表征。具体而言,我们使用空间交叉注意力(SCA)建模手与物体之间的交互,并进一步通过轨迹交叉注意力融入上下文信息,从而获得环境细化的交互标记。基于这些标记,我们构建以交互为中心的视频表征用于动作预测。我们将模型命名为InAViT,其在大型自我中心数据集EPICKTICHENS100(EK100)和EGTEA Gaze+上达到了最先进的动作预测性能。InAViT优于其他基于视觉Transformer的方法,包括以物体为中心的视频表征方法。在EK100评估服务器上,InAViT是公共排行榜(提交时)中表现最佳的方法,其平均top-5召回率比第二名模型高出3.3%。