This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip where the contact might happen, before any action takes place. The problem is considerably hard, as we aim at estimating the position of such objects in a scenario where the observed clip and the action segment are separated by the so-called ``time to contact'' (TTC) segment. Many methods have been proposed to anticipate the action of a person based on previous hand movements and interactions with the surroundings. However, there have been no attempts to investigate the next possible interactable object, and its future location with respect to the first-person's motion and the field-of-view drift during the TTC window. We define this as the task of Anticipating the Next ACTive Object (ANACTO). To this end, we propose a transformer-based self-attention framework to identify and locate the next-active-object in an egocentric clip. We benchmark our method on three datasets: EpicKitchens-100, EGTEA+ and Ego4D. We also provide annotations for the first two datasets. Our approach performs best compared to relevant baseline methods. We also conduct ablation studies to understand the effectiveness of the proposed and baseline methods on varying conditions. Code and ANACTO task annotations will be made available upon paper acceptance.
翻译:本文针对给定自我中心视频片段中未来可能发生接触但尚未执行任何动作的场景,提出了预测下一个活动对象位置的问题。该问题极具挑战性,因为我们旨在观测到动作片段与所观察片段之间存在所谓的“接触时间”间隔的情况下,估算此类对象的位置。已有多种方法基于先前的手部运动及与环境的交互来预测人的动作,但尚未有研究探索下一个可交互对象及其在接触时间窗口内相对于第一人称运动与视野漂移的未来位置。我们将此定义为下一活动对象预测任务。为此,我们提出了一种基于Transformer的自注意力框架,用于识别并定位自我中心视频片段中的下一个活动对象。我们在三个数据集上对方法进行了基准测试:EpicKitchens-100、EGTEA+和Ego4D。同时,我们为前两个数据集提供了标注。与相关基线方法相比,我们的方法表现最优。我们还进行了消融研究,以理解所提方法与基线方法在不同条件下的有效性。代码及ANACTO任务标注将在论文接收后公开。