We propose G-HOP, a denoising diffusion based generative prior for hand-object interactions that allows modeling both the 3D object and a human hand, conditioned on the object category. To learn a 3D spatial diffusion model that can capture this joint distribution, we represent the human hand via a skeletal distance field to obtain a representation aligned with the (latent) signed distance field for the object. We show that this hand-object prior can then serve as generic guidance to facilitate other tasks like reconstruction from interaction clip and human grasp synthesis. We believe that our model, trained by aggregating seven diverse real-world interaction datasets spanning across 155 categories, represents a first approach that allows jointly generating both hand and object. Our empirical evaluations demonstrate the benefit of this joint prior in video-based reconstruction and human grasp synthesis, outperforming current task-specific baselines. Project website: https://judyye.github.io/ghop-www
翻译:我们提出G-HOP,一种基于去噪扩散的生成式手-物交互先验模型,能够根据物体类别对三维物体和人类手部进行联合建模。为学习可捕捉此联合分布的三维空间扩散模型,我们通过骨骼距离场表示人手,从而获得与物体(潜)符号距离场对齐的表征。我们证明该手-物先验可充当通用引导信号,辅助交互片段重建与人类抓取合成等任务。相信通过聚合跨越155个类别的七个多样化真实世界交互数据集训练出的模型,是首种允许联合生成手部与物体的方法。实验评估表明,该联合先验在基于视频的重建与人类抓取合成中具有显著优势,且超越现有任务专用基线模型。项目官网:https://judyye.github.io/ghop-www