We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from dynamic egoview videos, supporting open-vocabulary objects. Accurate W-HOI reconstruction is critical for embodied intelligence yet remains challenging. Existing HOI methods are largely restricted to local camera coordinates or single frames, failing to capture global temporal dynamics. While some recent approaches attempt world-space hand estimation, they overlook object poses and HOI constraints. Moreover, previous HOI estimation methods either fail to handle open-set categories due to their reliance on object templates or employ differentiable rendering that requires per-instance optimization, resulting in prohibitive computational costs. Finally, frequent occlusions in egocentric videos severely degrade performance. To overcome these challenges, we propose a multi-stage framework: (i) a robust pre-processing pipeline leveraging vision foundation models for initial 3D scene, hand and object reconstruction; (ii) a body-guided diffusion model that incorporates explicit egocentric body priors for hand pose estimation; and (iii) an HOI-prior-informed diffusion model for hand-aware 6DoF pose infilling, ensuring physically plausible and temporally consistent W-HOI estimation. We experimentally demonstrate that EgoGrasp can achieve state-of-the-art performance in W-HOI reconstruction, handling multiple and open vocabulary objects robustly.
翻译:我们提出了EgoGrasp,这是首个从动态第一人称视角视频中重建世界空间手-物体交互(W-HOI)的方法,并支持开放词汇的物体。精确的W-HOI重建对于具身智能至关重要,但仍具有挑战性。现有的HOI方法大多局限于局部相机坐标系或单帧图像,无法捕捉全局时间动态。尽管近期一些方法尝试进行世界空间的手部姿态估计,但它们忽略了物体姿态和HOI约束。此外,以往的HOI估计方法要么因依赖物体模板而无法处理开放集类别,要么采用需要逐实例优化的可微分渲染,导致计算成本过高。最后,第一人称视频中频繁的遮挡严重降低了性能。为克服这些挑战,我们提出了一个多阶段框架:(i)利用视觉基础模型进行初始3D场景、手部及物体重建的鲁棒预处理流程;(ii)结合显式第一人称身体先验的手部姿态估计身体引导扩散模型;以及(iii)用于手部感知6自由度姿态补全的HOI先验信息扩散模型,确保物理合理且时间一致的W-HOI估计。实验表明,EgoGrasp能够在W-HOI重建中实现最先进的性能,并鲁棒地处理多个及开放词汇的物体。