Existing methods proposed for hand reconstruction tasks usually parameterize a generic 3D hand model or predict hand mesh positions directly. The parametric representations consisting of hand shapes and rotational poses are more stable, while the non-parametric methods can predict more accurate mesh positions. In this paper, we propose to reconstruct meshes and estimate MANO parameters of two hands from a single RGB image simultaneously to utilize the merits of two kinds of hand representations. To fulfill this target, we propose novel Mesh-Mano interaction blocks (MMIBs), which take mesh vertices positions and MANO parameters as two kinds of query tokens. MMIB consists of one graph residual block to aggregate local information and two transformer encoders to model long-range dependencies. The transformer encoders are equipped with different asymmetric attention masks to model the intra-hand and inter-hand attention, respectively. Moreover, we introduce the mesh alignment refinement module to further enhance the mesh-image alignment. Extensive experiments on the InterHand2.6M benchmark demonstrate promising results over the state-of-the-art hand reconstruction methods.
翻译:现有手部重建方法通常通过参数化通用3D手部模型或直接预测手部网格位置。参数化表示(包含手部形状和旋转姿态)更加稳定,而非参数化方法能预测更精确的网格位置。本文提出从单张RGB图像同时重建双手网格并估计MANO参数,以融合两类手部表示的优势。为实现此目标,我们设计了新型网格-手部模型交互模块(MMIBs),将网格顶点位置和MANO参数作为两种查询令牌。MMIB由用于聚集局部信息的图残差块和两个建模长程依赖关系的Transformer编码器组成。Transformer编码器配备不同非对称注意力掩码,分别建模手内和手间注意力。此外,我们引入网格对齐优化模块进一步增强网格与图像的匹配。在InterHand2.6M基准上的大量实验表明,该方法在手部重建任务中取得了优于现有技术的成果。