Reconstructing interacting hands from a single RGB image is a very challenging task. On the one hand, severe mutual occlusion and similar local appearance between two hands confuse the extraction of visual features, resulting in the misalignment of estimated hand meshes and the image. On the other hand, there are complex interaction patterns between interacting hands, which significantly increases the solution space of hand poses and increases the difficulty of network learning. In this paper, we propose a decoupled iterative refinement framework to achieve pixel-alignment hand reconstruction while efficiently modeling the spatial relationship between hands. Specifically, we define two feature spaces with different characteristics, namely 2D visual feature space and 3D joint feature space. First, we obtain joint-wise features from the visual feature map and utilize a graph convolution network and a transformer to perform intra- and inter-hand information interaction in the 3D joint feature space, respectively. Then, we project the joint features with global information back into the 2D visual feature space in an obfuscation-free manner and utilize the 2D convolution for pixel-wise enhancement. By performing multiple alternate enhancements in the two feature spaces, our method can achieve an accurate and robust reconstruction of interacting hands. Our method outperforms all existing two-hand reconstruction methods by a large margin on the InterHand2.6M dataset. Meanwhile, our method shows a strong generalization ability for in-the-wild images.
翻译:从单张RGB图像中重建交互双手是一项极具挑战性的任务。一方面,双手之间的严重相互遮挡和相似的局部外观会混淆视觉特征提取,导致估计的手部网格与图像对齐偏离;另一方面,交互双手间存在复杂的交互模式,这显著扩大了手部姿态的解空间,增加了网络学习的难度。本文提出一种解耦迭代优化框架,在有效建模双手空间关系的同时实现像素级对齐的手部重建。具体而言,我们定义了两种具有不同特征的特征空间:二维视觉特征空间和三维关节点特征空间。首先从视觉特征图中提取关节点级特征,利用图卷积网络和Transformer分别在三维关节点特征空间内进行手内与手间的信息交互;随后,将融合全局信息的关节点特征以无混淆方式投影回二维视觉特征空间,利用二维卷积进行像素级增强。通过在这两种特征空间中执行多次交替增强,我们的方法能够实现对交互双手的准确鲁棒重建。在InterHand2.6M数据集上,本方法以显著优势超越现有所有双手重建方法,同时对野外图像展现出强大的泛化能力。