Predicting the Next Action by Modeling the Abstract Goal

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

The problem of anticipating human actions is an inherently uncertain one. However, we can reduce this uncertainty if we have a sense of the goal that the actor is trying to achieve. Here, we present an action anticipation model that leverages goal information for the purpose of reducing the uncertainty in future predictions. Since we do not possess goal information or the observed actions during inference, we resort to visual representation to encapsulate information about both actions and goals. Through this, we derive a novel concept called abstract goal which is conditioned on observed sequences of visual features for action anticipation. We design the abstract goal as a distribution whose parameters are estimated using a variational recurrent network. We sample multiple candidates for the next action and introduce a goal consistency measure to determine the best candidate that follows from the abstract goal. Our method obtains impressive results on the very challenging Epic-Kitchens55 (EK55), EK100, and EGTEA Gaze+ datasets. We obtain absolute improvements of +13.69, +11.24, and +5.19 for Top-1 verb, Top-1 noun, and Top-1 action anticipation accuracy respectively over prior state-of-the-art methods for seen kitchens (S1) of EK55. Similarly, we also obtain significant improvements in the unseen kitchens (S2) set for Top-1 verb (+10.75), noun (+5.84) and action (+2.87) anticipation. Similar trend is observed for EGTEA Gaze+ dataset, where absolute improvement of +9.9, +13.1 and +6.8 is obtained for noun, verb, and action anticipation. It is through the submission of this paper that our method is currently the new state-of-the-art for action anticipation in EK55 and EGTEA Gaze+ https://competitions.codalab.org/competitions/20071#results Code available at https://github.com/debadityaroy/Abstract_Goal

翻译：人类动作预测问题本质上具有不确定性。然而，如果我们能感知行动者试图达成的目标，便可降低这种不确定性。本文提出一种利用目标信息来减少未来预测不确定性的动作预测模型。由于推理过程中无法直接获取目标信息或已观测动作，我们转而借助视觉表征来封装动作与目标信息。由此提出"抽象目标"这一新概念，该概念以观测到的视觉特征序列为条件，用于动作预测。我们将抽象目标建模为一个分布，其参数通过变分循环网络进行估计。我们为下一步动作采样多个候选，并引入目标一致性度量来确定与抽象目标最匹配的候选。我们的方法在极具挑战性的Epic-Kitchens55（EK55）、EK100及EGTEA Gaze+数据集上取得了瞩目成果。在EK55的已知厨房场景（S1）中，我们的方法在Top-1动词、Top-1名词和Top-1动作预测准确率上分别较先前最优方法获得+13.69、+11.24和+5.19的绝对提升。类似地，在未知厨房场景（S2）的Top-1动词（+10.75）、名词（+5.84）和动作（+2.87）预测中也取得显著提升。EGTEA Gaze+数据集呈现相同趋势，名词、动词和动作预测的绝对提升分别为+9.9、+13.1和+6.8。通过本文的提交，我们的方法目前已成为EK55和EGTEA Gaze+上动作预测的最新最优方法。代码详见 https://github.com/debadityaroy/Abstract_Goal