Predicting future human behavior from egocentric videos is a challenging but critical task for human intention understanding. Existing methods for forecasting 2D hand positions rely on visual representations and mainly focus on hand-object interactions. In this paper, we investigate the hand forecasting task and tackle two significant issues that persist in the existing methods: (1) 2D hand positions in future frames are severely affected by ego-motions in egocentric videos; (2) prediction based on visual information tends to overfit to background or scene textures, posing a challenge for generalization on novel scenes or human behaviors. To solve the aforementioned problems, we propose EMAG, an ego-motion-aware and generalizable 2D hand forecasting method. In response to the first problem, we propose a method that considers ego-motion, represented by a sequence of homography matrices of two consecutive frames. We further leverage modalities such as optical flow, trajectories of hands and interacting objects, and ego-motions, thereby alleviating the second issue. Extensive experiments on two large-scale egocentric video datasets, Ego4D and EPIC-Kitchens 55, verify the effectiveness of the proposed method. In particular, our model outperforms prior methods by $7.0$\% on cross-dataset evaluations. Project page: https://masashi-hatano.github.io/EMAG/
翻译:从第一人称视角视频预测未来人类行为是一项具有挑战性但至关重要的任务,对于理解人类意图具有重要意义。现有预测二维手部位置的方法主要依赖视觉表征,并侧重于手与物体的交互。本文深入研究了手部轨迹预测任务,并针对现有方法中持续存在的两个关键问题提出解决方案:(1)未来帧中的二维手部位置受第一人称视频中自我运动的严重影响;(2)基于视觉信息的预测容易对背景或场景纹理产生过拟合,导致在新场景或人类行为上的泛化能力面临挑战。为解决上述问题,我们提出了EMAG——一种具有自我运动感知能力且可泛化的二维手部轨迹预测方法。针对第一个问题,我们提出了一种考虑自我运动的方法,该运动通过连续两帧之间的单应性矩阵序列进行表征。我们进一步利用光流、手部与交互物体的轨迹以及自我运动等多模态信息,从而缓解第二个问题。在两个大规模第一人称视频数据集Ego4D和EPIC-Kitchens 55上进行的大量实验验证了所提方法的有效性。特别值得注意的是,在跨数据集评估中,我们的模型以$7.0$\%的优势超越了现有方法。项目页面:https://masashi-hatano.github.io/EMAG/