Existing approaches of hand reconstruction predominantly adhere to a multi-stage framework, encompassing detection, left-right classification, and pose estimation. This paradigm induces redundant computation and cumulative errors. In this work, we propose HandOS, an end-to-end framework for 3D hand reconstruction. Our central motivation lies in leveraging a frozen detector as the foundation while incorporating auxiliary modules for 2D and 3D keypoint estimation. In this manner, we integrate the pose estimation capacity into the detection framework, while at the same time obviating the necessity of using the left-right category as a prerequisite. Specifically, we propose an interactive 2D-3D decoder, where 2D joint semantics is derived from detection cues while 3D representation is lifted from those of 2D joints. Furthermore, hierarchical attention is designed to enable the concurrent modeling of 2D joints, 3D vertices, and camera translation. Consequently, we achieve an end-to-end integration of hand detection, 2D pose estimation, and 3D mesh reconstruction within a one-stage framework, so that the above multi-stage drawbacks are overcome. Meanwhile, the HandOS reaches state-of-the-art performances on public benchmarks, e.g., 5.0 PA-MPJPE on FreiHand and 64.6\% PCK@0.05 on HInt-Ego4D. Project page: idea-research.github.io/HandOSweb.
翻译:现有手部重建方法主要遵循多阶段框架,包括检测、左右手分类和姿态估计。这种范式会导致冗余计算和误差累积。本文提出HandOS,一种用于三维手部重建的端到端框架。我们的核心动机在于利用冻结检测器作为基础,同时集成用于二维和三维关键点估计的辅助模块。通过这种方式,我们将姿态估计能力融入检测框架,同时消除了将左右手分类作为前提的必要性。具体而言,我们提出了一种交互式2D-3D解码器,其中二维关节语义从检测线索中推导,而三维表示则从二维关节特征中提升得到。此外,我们设计了分层注意力机制,以支持对二维关节、三维顶点和相机平移的并行建模。因此,我们在单阶段框架内实现了手部检测、二维姿态估计和三维网格重建的端到端集成,从而克服了上述多阶段方法的缺陷。同时,HandOS在公开基准测试中达到了最先进的性能,例如在FreiHand数据集上实现5.0的PA-MPJPE,在HInt-Ego4D数据集上实现64.6%的PCK@0.05。项目页面:idea-research.github.io/HandOSweb。