Pyramid Deep Fusion Network for Two-Hand Reconstruction from RGB-D Images

Accurately recovering the dense 3D mesh of both hands from monocular images poses considerable challenges due to occlusions and projection ambiguity. Most of the existing methods extract features from color images to estimate the root-aligned hand meshes, which neglect the crucial depth and scale information in the real world. Given the noisy sensor measurements with limited resolution, depth-based methods predict 3D keypoints rather than a dense mesh. These limitations motivate us to take advantage of these two complementary inputs to acquire dense hand meshes on a real-world scale. In this work, we propose an end-to-end framework for recovering dense meshes for both hands, which employ single-view RGB-D image pairs as input. The primary challenge lies in effectively utilizing two different input modalities to mitigate the blurring effects in RGB images and noises in depth images. Instead of directly treating depth maps as additional channels for RGB images, we encode the depth information into the unordered point cloud to preserve more geometric details. Specifically, our framework employs ResNet50 and PointNet++ to derive features from RGB and point cloud, respectively. Additionally, we introduce a novel pyramid deep fusion network (PDFNet) to aggregate features at different scales, which demonstrates superior efficacy compared to previous fusion strategies. Furthermore, we employ a GCN-based decoder to process the fused features and recover the corresponding 3D pose and dense mesh. Through comprehensive ablation experiments, we have not only demonstrated the effectiveness of our proposed fusion algorithm but also outperformed the state-of-the-art approaches on publicly available datasets. To reproduce the results, we will make our source code and models publicly available at {\url{https://github.com/zijinxuxu/PDFNet}}.

翻译：从单目图像精确恢复双手的密集三维网格因遮挡和投影模糊而面临极大挑战。现有方法大多从彩色图像中提取特征以估计根对齐的手部网格，忽略了真实世界中关键的深度和尺度信息。考虑到传感器测量存在噪声且分辨率有限，基于深度的方法仅预测三维关键点而非密集网格。这些局限性促使我们利用这两种互补输入，在真实世界尺度上获取密集手部网格。本文提出了一种端到端框架，以单视图RGB-D图像对作为输入，恢复双手的密集网格。主要挑战在于有效利用两种不同输入模态，以减轻RGB图像的模糊效应和深度图像的噪声。不同于将深度图直接作为RGB图像的附加通道，我们将深度信息编码为无序点云，以保留更多几何细节。具体而言，该框架分别采用ResNet50和PointNet++从RGB和点云中提取特征。此外，我们引入了一种新颖的金字塔深度融合网络（PDFNet），用于聚合不同尺度的特征，相较于以往的融合策略展现出更优效果。同时，采用基于图卷积网络（GCN）的解码器处理融合特征，恢复对应的三维姿态和密集网格。通过全面的消融实验，我们不仅验证了所提融合算法的有效性，并在公开数据集上超越了现有最优方法。为复现结果，我们将在{\url{https://github.com/zijinxuxu/PDFNet}}公开发布源代码和模型。