Integrating Affordances and Attention models for Short-Term Object Interaction Anticipation

Short Term object-interaction Anticipation consists in detecting the location of the next active objects, the noun and verb categories of the interaction, as well as the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants to understand user goals and provide timely assistance, or to enable human-robot interaction. In this work, we present a method to improve the performance of STA predictions. Our contributions are two-fold: 1 We propose STAformer and STAformer plus plus, two novel attention-based architectures integrating frame-guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair; 2 We introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. We explore how to integrate environment affordances via simple late fusion and with an approach which adaptively learns how to best fuse affordances with end-to-end predictions. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. Our results show significant improvements on Overall Top-5 mAP, with gain up to +23p.p on Ego4D and +31p.p on a novel set of curated EPIC-Kitchens STA labels. We released the code, annotations, and pre-extracted affordances on Ego4D and EPIC-Kitchens to encourage future research in this area.

翻译：短期物体交互预测任务旨在通过观察第一人称视角视频，检测即将发生交互的物体位置、交互动作的名词与动词类别以及接触时间。这种能力对于可穿戴助手理解用户意图并提供及时协助，或实现人机交互至关重要。本研究提出一种提升短期物体交互预测性能的方法。我们的贡献包含两方面：1）我们提出STAformer与STAformer++两种新型注意力架构，通过集成帧引导时序池化、双通道图像-视频注意力及多尺度特征融合技术，支持基于图像-视频输入对的短期物体交互预测；2）我们引入两个创新模块，通过建模可供性将短期物体交互预测植根于人类行为模式。首先，我们整合了环境可供性模型，该模型作为特定物理场景中可能发生的交互行为的持久记忆库。我们探索了通过简单后期融合方式集成环境可供性的方法，并提出一种自适应学习最优可供性与端到端预测融合策略的途径。其次，我们通过观测手部与物体轨迹预测交互热点区域，从而提升热点区域周边短期物体交互预测的可信度。实验结果表明，该方法在整体Top-5 mAP指标上取得显著提升，在Ego4D数据集上增益高达+23个百分点，在新构建的EPIC-Kitchens短期物体交互标注集上增益达+31个百分点。我们已公开相关代码、标注数据以及在Ego4D与EPIC-Kitchens数据集上预提取的可供性特征，以推动该领域的后续研究。