Vision-based deformable object manipulation is a challenging problem in robotic manipulation, requiring a robot to infer a sequence of manipulation actions leading to the desired state from solely visual observations. Most previous works address this problem in a goal-conditioned way and adapt the goal image to specify a task, which is not practical or efficient. Thus, we adapted natural language specification and proposed a language-conditioned deformable object manipulation policy learning framework. We first design a unified Transformer-based architecture to understand multi-modal data and output picking and placing action. Besides, we have introduced the visible connectivity graph to tackle nonlinear dynamics and complex configuration of the deformable object in the manipulation process. Both simulated and real experiments have demonstrated that the proposed method is general and effective in language-conditioned deformable object manipulation policy learning. Our method achieves much higher success rates on various language-conditioned deformable object manipulation tasks (87.3% on average) than the state-of-the-art method in simulation experiments. Besides, our method is much lighter and has a 75.6% shorter inference time than state-of-the-art methods. We also demonstrate that our method performs well in real-world applications. Supplementary videos can be found at https://sites.google.com/view/language-deformable.
翻译:基于视觉的可变形物体操作是机器人操作中的一个挑战性问题,要求机器人仅从视觉观测中推断出能达成目标状态的一系列操作动作。以往研究多采用目标条件化方法,通过目标图像来指定任务,这种方式既不实用也不高效。为此,我们引入自然语言规范,提出了一种语言条件的可变形物体操作策略学习框架。首先,设计了一个统一的基于Transformer的架构,用于理解多模态数据并输出拾取与放置动作。此外,我们引入了可见连通图,以应对操作过程中可变形物体的非线性动力学与复杂构型。仿真与真实实验均表明,所提方法在语言条件可变形物体操作策略学习中具有通用性和有效性。在多种语言条件可变形物体操作任务中,我们的方法在仿真实验中取得了远高于现有最佳方法的成功率(平均87.3%)。同时,我们的模型更轻量,推理时间比现有最佳方法缩短75.6%。真实环境应用也验证了该方法良好的性能。补充视频见 https://sites.google.com/view/language-deformable。