Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort. Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor. In contrast, there has been insufficient research in understanding 2D hand pose for egocentric action recognition, despite the availability of user-friendly smart glasses in the market capable of capturing a single RGB image. Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions. Firstly, we introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective, capturing interactions between hands and objects. Both methods outperform state-of-the-art models on H2O and FPHA public benchmarks. Secondly, we present a robust action recognition architecture from 2D hand and object poses. This method incorporates EffHandEgoNet, and a transformer-based action recognition method. Evaluated on H2O and FPHA datasets, our architecture has a faster inference time and achieves an accuracy of 91.32% and 94.43%, respectively, surpassing state of the art, including 3D-based methods. Our work demonstrates that using 2D skeletal data is a robust approach for egocentric action understanding. Extensive evaluation and ablation studies show the impact of the hand pose estimation approach, and how each input affects the overall performance.
翻译:动作识别对于第一人称视频理解至关重要,能够在不需用户主动参与的情况下,自动持续监测日常生活活动。现有研究主要关注三维手部姿态输入,这需要计算密集的深度估计网络或佩戴不舒适的深度传感器。相比之下,尽管市场已存在能够捕获单帧RGB图像的用户友好型智能眼镜,但针对第一人称动作识别的二维手部姿态理解研究仍显不足。本研究旨在通过探索面向第一人称动作识别的二维手部姿态估计领域来填补这一研究空白,并作出两项贡献。首先,我们提出了两种新颖的二维手部姿态估计方法:用于单手估计的EffHandNet,以及专为第一人称视角设计、能够捕捉手部与物体交互的EffHandEgoNet。两种方法在H2O和FPHA公开基准测试中均优于当前最先进模型。其次,我们提出了一种基于二维手部与物体姿态的鲁棒动作识别架构。该方法整合了EffHandEgoNet以及基于Transformer的动作识别方法。在H2O和FPHA数据集上的评估表明,我们的架构具有更快的推理速度,并分别取得了91.32%和94.43%的准确率,超越了包括基于三维方法在内的现有最优成果。我们的工作证明,使用二维骨骼数据是实现第一人称动作理解的稳健途径。广泛的评估与消融研究揭示了手部姿态估计方法的影响,以及各类输入如何影响整体性能。