Hand trajectory forecasting from egocentric views is crucial for enabling a prompt understanding of human intentions when interacting with AR/VR systems. However, existing methods handle this problem in a 2D image space which is inadequate for 3D real-world applications. In this paper, we set up an egocentric 3D hand trajectory forecasting task that aims to predict hand trajectories in a 3D space from early observed RGB videos in a first-person view. To fulfill this goal, we propose an uncertainty-aware state space Transformer (USST) that takes the merits of the attention mechanism and aleatoric uncertainty within the framework of the classical state-space model. The model can be further enhanced by the velocity constraint and visual prompt tuning (VPT) on large vision transformers. Moreover, we develop an annotation workflow to collect 3D hand trajectories with high quality. Experimental results on H2O and EgoPAT3D datasets demonstrate the superiority of USST for both 2D and 3D trajectory forecasting. The code and datasets are publicly released: https://github.com/Cogito2012/USST.
翻译:从自我中心视角进行手部轨迹预测对于在AR/VR系统交互中快速理解人类意图至关重要。然而,现有方法在二维图像空间处理该问题,不足以满足三维真实世界应用需求。本文构建了一项自我中心视角的三维手部轨迹预测任务,旨在从第一视角的早期观测RGB视频中预测三维空间中的手部轨迹。为此,我们提出了一种不确定性感知状态空间Transformer(USST),该模型融合了注意力机制与偶然不确定性的优势,并基于经典状态空间模型框架构建。通过引入速度约束及在大规模视觉Transformer上的视觉提示调优(VPT),该模型性能可进一步提升。此外,我们开发了一套标注流程以收集高质量的三维手部轨迹。在H2O和EgoPAT3D数据集上的实验结果表明,USST在二维与三维轨迹预测任务中均表现优异。代码与数据集已公开:https://github.com/Cogito2012/USST。