This paper explores a novel task "Dexterous Grasp as You Say" (DexGYS), enabling robots to perform dexterous grasping based on human commands expressed in natural language. However, the development of this field is hindered by the lack of datasets with natural human guidance; thus, we propose a language-guided dexterous grasp dataset, named DexGYSNet, offering high-quality dexterous grasp annotations along with flexible and fine-grained human language guidance. Our dataset construction is cost-efficient, with the carefully-design hand-object interaction retargeting strategy, and the LLM-assisted language guidance annotation system. Equipped with this dataset, we introduce the DexGYSGrasp framework for generating dexterous grasps based on human language instructions, with the capability of producing grasps that are intent-aligned, high quality and diversity. To achieve this capability, our framework decomposes the complex learning process into two manageable progressive objectives and introduce two components to realize them. The first component learns the grasp distribution focusing on intention alignment and generation diversity. And the second component refines the grasp quality while maintaining intention consistency. Extensive experiments are conducted on DexGYSNet and real world environments for validation.
翻译:本文探索了一项新颖任务"基于语言指令的灵巧抓取",使机器人能够根据人类用自然语言表达的指令执行灵巧抓取。然而,该领域的发展因缺乏包含自然人类指导的数据集而受到阻碍;为此,我们提出了一个语言引导的灵巧抓取数据集DexGYSNet,该数据集提供高质量的灵巧抓取标注以及灵活、细粒度的人类语言指导。我们的数据集构建具有成本效益,采用了精心设计的手-物交互重定向策略以及大语言模型辅助的语言指导标注系统。基于此数据集,我们提出了DexGYSGrasp框架,用于根据人类语言指令生成灵巧抓取,该框架能够生成意图对齐、高质量且多样化的抓取姿态。为实现此能力,我们的框架将复杂学习过程分解为两个可管理的渐进式目标,并引入两个组件来实现它们。第一个组件学习抓取分布,重点关注意图对齐与生成多样性。第二个组件在保持意图一致性的同时优化抓取质量。我们在DexGYSNet数据集及真实世界环境中进行了大量实验以验证框架性能。