FSAG: Enhancing Human-to-Dexterous-Hand Finger-Specific Affordance Grounding via Diffusion Models

Dexterous grasp synthesis remains a central challenge: the high dimensionality and kinematic diversity of multi-fingered hands prevent direct transfer of algorithms developed for parallel-jaw grippers. Existing approaches typically depend on large, hardware-specific grasp datasets collected in simulation or through costly real-world trials, hindering scalability as new dexterous hand designs emerge. To this end, we propose a data-efficient framework, which is designed to bypass robot grasp data collection by exploiting the rich, object-centric semantic priors latent in pretrained generative diffusion models. Temporally aligned and fine-grained grasp affordances are extracted from raw human video demonstrations and fused with 3D scene geometry from depth images to infer semantically grounded contact targets. A kinematics-aware retargeting module then maps these affordance representations to diverse dexterous hands without per-hand retraining. The resulting system produces stable, functionally appropriate multi-contact grasps that remain reliably successful across common objects and tools, while exhibiting strong generalization across previously unseen object instances within a category, pose variations, and multiple hand embodiments. This work (i) introduces a semantic affordance extraction pipeline leveraging vision-language generative priors for dexterous grasping, (ii) demonstrates cross-hand generalization without constructing hardware-specific grasp datasets, and (iii) establishes that a single depth modality suffices for high-performance grasp synthesis when coupled with foundation-model semantics. Our results highlight a path toward scalable, hardware-agnostic dexterous manipulation driven by human demonstrations and pretrained generative models.

翻译：灵巧抓取合成仍然是一个核心挑战：多指手的高维度和运动学多样性阻碍了为平行夹爪开发的算法的直接迁移。现有方法通常依赖于在仿真中收集或通过昂贵的真实世界试验获得的大型、硬件特定的抓取数据集，这在新灵巧手设计出现时阻碍了可扩展性。为此，我们提出了一种数据高效的框架，旨在通过利用预训练生成扩散模型中潜在的、以物体为中心的丰富语义先验，绕过机器人抓取数据收集。从原始人类视频演示中提取时间对齐且细粒度的抓取可供性，并与来自深度图像的3D场景几何信息融合，以推断语义接地的接触目标。随后，一个运动学感知的重定向模块将这些可供性表示映射到不同的灵巧手上，而无需针对每只手进行重新训练。所生成的系统能产生稳定、功能适当的多接触抓取，在常见物体和工具上保持可靠的稳定性，同时在类别内先前未见过的物体实例、姿态变化以及多种手部形态上展现出强大的泛化能力。本工作（i）引入了一个利用视觉-语言生成先验进行灵巧抓取的语义可供性提取流程，（ii）展示了无需构建硬件特定抓取数据集的跨手泛化能力，以及（iii）证明了当与基础模型的语义相结合时，单一深度模态足以实现高性能的抓取合成。我们的结果突显了一条由人类演示和预训练生成模型驱动的、可扩展且硬件无关的灵巧操作路径。