In this paper, we introduce 4DHands, a robust approach to recovering interactive hand meshes and their relative movement from monocular inputs. Our approach addresses two major limitations of previous methods: lacking a unified solution for handling various hand image inputs and neglecting the positional relationship of two hands within images. To overcome these challenges, we develop a transformer-based architecture with novel tokenization and feature fusion strategies. Specifically, we propose a Relation-aware Two-Hand Tokenization (RAT) method to embed positional relation information into the hand tokens. In this way, our network can handle both single-hand and two-hand inputs and explicitly leverage relative hand positions, facilitating the reconstruction of intricate hand interactions in real-world scenarios. As such tokenization indicates the relative relationship of two hands, it also supports more effective feature fusion. To this end, we further develop a Spatio-temporal Interaction Reasoning (SIR) module to fuse hand tokens in 4D with attention and decode them into 3D hand meshes and relative temporal movements. The efficacy of our approach is validated on several benchmark datasets. The results on in-the-wild videos and real-world scenarios demonstrate the superior performances of our approach for interactive hand reconstruction. More video results can be found on the project page: https://4dhands.github.io.
翻译:本文提出4DHands,一种从单目输入中重建交互式手部网格及其相对运动的鲁棒方法。该方法解决了先前研究的两大局限:缺乏处理多样化手部图像输入的统一方案,以及忽视图像中双手的位置关系。为应对这些挑战,我们开发了一种基于Transformer的架构,并设计了新颖的标记化与特征融合策略。具体而言,我们提出关系感知双手标记化方法,将位置关系信息嵌入手部标记中。通过这种方式,我们的网络既能处理单手输入也能处理双手输入,并显式利用双手相对位置信息,从而促进真实场景中复杂手部交互的重建。由于该标记化方法能表征双手的相对关系,因此也支持更高效的特征融合。为此,我们进一步开发了时空交互推理模块,通过注意力机制融合四维空间中的手部标记,并将其解码为三维手部网格与相对时序运动。我们在多个基准数据集上验证了方法的有效性。在自然场景视频和真实环境中的实验结果表明,本方法在交互式手部重建任务上具有优越性能。更多视频结果请访问项目页面:https://4dhands.github.io。