Our work aims to reconstruct hand-object interactions from a single-view image, which is a fundamental but ill-posed task. Unlike methods that reconstruct from videos, multi-view images, or predefined 3D templates, single-view reconstruction faces significant challenges due to inherent ambiguities and occlusions. These challenges are further amplified by the diverse nature of hand poses and the vast variety of object shapes and sizes. Our key insight is that current foundational models for segmentation, inpainting, and 3D reconstruction robustly generalize to in-the-wild images, which could provide strong visual and geometric priors for reconstructing hand-object interactions. Specifically, given a single image, we first design a novel pipeline to estimate the underlying hand pose and object shape using off-the-shelf large models. Furthermore, with the initial reconstruction, we employ a prior-guided optimization scheme, which optimizes hand pose to comply with 3D physical constraints and the 2D input image content. We perform experiments across several datasets and show that our method consistently outperforms baselines and faithfully reconstructs a diverse set of hand-object interactions. Here is the link of our project page: https://lym29.github.io/EasyHOI-page/
翻译:本研究旨在从单视角图像重建手物交互,这是一个基础但病态的任务。与基于视频、多视角图像或预定义3D模板的重建方法不同,单视角重建因固有的模糊性和遮挡问题面临巨大挑战。手部姿态的多样性以及物体形状与尺寸的庞大多变性进一步加剧了这些挑战。我们的核心洞见在于:当前用于分割、修复和3D重建的基础模型能够稳健地泛化至野外图像,这可为手物交互重建提供强有力的视觉与几何先验。具体而言,给定单张图像,我们首先设计了一种新颖的流程,利用现成的大模型估计底层手部姿态与物体形状。此外,基于初始重建结果,我们采用先验引导的优化方案,通过优化手部姿态使其符合3D物理约束与2D输入图像内容。我们在多个数据集上进行实验,结果表明本方法持续优于基线模型,并能准确重建多样化的手物交互场景。项目页面链接:https://lym29.github.io/EasyHOI-page/