Reconstructing hand-held objects from a single RGB image is a challenging task in computer vision. In contrast to prior works that utilize deterministic modeling paradigms, we employ a point cloud denoising diffusion model to account for the probabilistic nature of this problem. In the core, we introduce centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction (D-SCo), tackling two predominant challenges. First, to avoid the object centroid from deviating, we utilize a novel hand-constrained centroid fixing paradigm, enhancing the stability of diffusion and reverse processes and the precision of feature projection. Second, we introduce a dual-stream denoiser to semantically and geometrically model hand-object interactions with a novel unified hand-object semantic embedding, enhancing the reconstruction performance of the hand-occluded region of the object. Experiments on the synthetic ObMan dataset and three real-world datasets HO3D, MOW and DexYCB demonstrate that our approach can surpass all other state-of-the-art methods. Codes will be released.
翻译:摘要:从单张RGB图像重建手持物体是计算机视觉领域的一项挑战性任务。与以往采用确定性建模范式的研究不同,我们利用点云去噪扩散模型来应对该问题的概率特性。核心上,我们提出了质心固定的双流条件扩散模型用于单目手持物体重建(D-SCo),以解决两个主要挑战。首先,为避免物体质心发生偏移,我们引入了一种新颖的手部约束质心固定范式,增强了扩散与逆扩散过程的稳定性以及特征投影的精度。其次,我们设计了一个双流去噪器,通过一种新颖的统一手-物语义嵌入来从语义和几何角度对手-物交互进行建模,从而提升了物体被手部遮挡区域的重建性能。在合成数据集ObMan以及三个真实世界数据集HO3D、MOW和DexYCB上的实验表明,我们的方法能够超越所有其他现有最优方法。代码将予以公开。