We propose Hand-Object\emph{(HO)GraspFlow}, an affordance-centric approach that retargets a single RGB with hand-object interaction (HOI) into multi-modal executable parallel jaw grasps without explicit geometric priors on target objects. Building on foundation models for hand reconstruction and vision, we synthesize $SE(3)$ grasp poses with denoising flow matching (FM), conditioned on the following three complementary cues: RGB foundation features as visual semantics, HOI contact reconstruction, and taxonomy-aware prior on grasp types. Our approach demonstrates high fidelity in grasp synthesis without explicit HOI contact input or object geometry, while maintaining strong contact and taxonomy recognition. Another controlled comparison shows that \emph{HOGraspFlow} consistently outperforms diffusion-based variants (\emph{HOGraspDiff}), achieving high distributional fidelity and more stable optimization in $SE(3)$. We demonstrate a reliable, object-agnostic grasp synthesis from human demonstrations in real-world experiments, where an average success rate of over $83\%$ is achieved. Code: https://github.com/YitianShi/HOGraspFlow
翻译:我们提出Hand-Object\emph{(HO)GraspFlow},一种以功能可供性为中心的方法,可将单幅带有手-物体交互(HOI)的RGB图像重定向为多模态可执行的平行夹爪抓取,而无需目标物体的显式几何先验。基于手部重建与视觉基础模型,我们通过去噪流匹配(FM)合成$SE(3)$抓取位姿,其条件整合了以下三个互补线索:作为视觉语义的RGB基础特征、HOI接触重建,以及抓取类型的分类感知先验。我们的方法在无需显式HOI接触输入或物体几何信息的情况下,展现出高保真度的抓取合成能力,同时保持了强大的接触与分类识别性能。另一项对照实验表明,\emph{HOGraspFlow}始终优于基于扩散的变体(\emph{HOGraspDiff}),在$SE(3)$空间中实现了更高的分布保真度与更稳定的优化。我们在真实世界实验中展示了从人类演示中实现可靠、物体无关的抓取合成,平均成功率超过$83\%$。代码:https://github.com/YitianShi/HOGraspFlow