This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip where the contact might happen, before any action takes place. The problem is considerably hard, as we aim at estimating the position of such objects in a scenario where the observed clip and the action segment are separated by the so-called ``time to contact'' (TTC) segment. Many methods have been proposed to anticipate the action of a person based on previous hand movements and interactions with the surroundings. However, there have been no attempts to investigate the next possible interactable object, and its future location with respect to the first-person's motion and the field-of-view drift during the TTC window. We define this as the task of Anticipating the Next ACTive Object (ANACTO). To this end, we propose a transformer-based self-attention framework to identify and locate the next-active-object in an egocentric clip. We benchmark our method on three datasets: EpicKitchens-100, EGTEA+ and Ego4D. We also provide annotations for the first two datasets. Our approach performs best compared to relevant baseline methods. We also conduct ablation studies to understand the effectiveness of the proposed and baseline methods on varying conditions. Code and ANACTO task annotations will be made available upon paper acceptance.
翻译:本文针对第一人称视角视频片段,在动作发生前预测未来可能发生交互的下一主动物体位置的问题。该问题具有相当难度,因为我们需要在观测片段与动作片段被"接触时间"(TTC)间隔分离的场景下估计此类物体的位置。现有方法已能基于手部运动轨迹及与环境的交互历史预测人类动作,但尚未有研究探索在TTC窗口期内,基于第一人称运动轨迹和视野漂移特征来预测下一个可交互物体及其未来位置。我们将此定义为"下一主动物体预测(ANACTO)"任务。为此,我们提出基于Transformer的自注意力框架,用于在第一人称视角视频中识别并定位下一主动物体。我们在EpicKitchens-100、EGTEA+和Ego4D三个数据集上对方法进行基准测试,并为前两个数据集提供标注。与相关基线方法相比,我们的方法取得了最优性能。同时通过消融实验验证了所提方法及基线方法在不同条件下的有效性。代码及ANACTO任务标注将在论文被接收后公开。