To perform household tasks, assistive robots receive commands in the form of user language instructions for tool manipulation. The initial stage involves selecting the intended tool (i.e., object grounding) and grasping it in a task-oriented manner (i.e., task grounding). Nevertheless, prior researches on visual-language grasping (VLG) focus on object grounding, while disregarding the fine-grained impact of tasks on object grasping. Task-incompatible grasping of a tool will inevitably limit the success of subsequent manipulation steps. Motivated by this problem, this paper proposes GraspCLIP, which addresses the challenge of task grounding in addition to object grounding to enable task-oriented grasp prediction with visual-language inputs. Evaluation on a custom dataset demonstrates that GraspCLIP achieves superior performance over established baselines with object grounding only. The effectiveness of the proposed method is further validated on an assistive robotic arm platform for grasping previously unseen kitchen tools given the task specification. Our presentation video is available at: https://www.youtube.com/watch?v=e1wfYQPeAXU.
翻译:为执行家务任务,辅助机器人通过用户语言指令接收工具操作命令。初始阶段涉及选择预期工具(即物体定位)并以任务导向方式抓取该工具(即任务定位)。然而,先前关于视觉-语言抓取(VLG)的研究主要集中于物体定位,而忽略了任务对物体抓取的细粒度影响。与任务不兼容的工具抓取将不可避免地限制后续操作步骤的成功。受此问题启发,本文提出GraspCLIP,该方法在物体定位之外同时解决任务定位的挑战,从而通过视觉-语言输入实现面向任务的抓取预测。在自定义数据集上的评估表明,GraspCLIP相较于仅进行物体定位的基线方法取得了更优性能。进一步在辅助机械臂平台上验证了所提方法的有效性,该平台能够根据任务说明抓取未见过的厨房工具。我们的演示视频见:https://www.youtube.com/watch?v=e1wfYQPeAXU。