Generalizable articulated object manipulation is essential for home-assistant robots. Recent efforts focus on imitation learning from demonstrations or reinforcement learning in simulation, however, due to the prohibitive costs of real-world data collection and precise object simulation, it still remains challenging for these works to achieve broad adaptability across diverse articulated objects. Recently, many works have tried to utilize the strong in-context learning ability of Large Language Models (LLMs) to achieve generalizable robotic manipulation, but most of these researches focus on high-level task planning, sidelining low-level robotic control. In this work, building on the idea that the kinematic structure of the object determines how we can manipulate it, we propose a kinematic-aware prompting framework that prompts LLMs with kinematic knowledge of objects to generate low-level motion trajectory waypoints, supporting various object manipulation. To effectively prompt LLMs with the kinematic structure of different objects, we design a unified kinematic knowledge parser, which represents various articulated objects as a unified textual description containing kinematic joints and contact location. Building upon this unified description, a kinematic-aware planner model is proposed to generate precise 3D manipulation waypoints via a designed kinematic-aware chain-of-thoughts prompting method. Our evaluation spanned 48 instances across 16 distinct categories, revealing that our framework not only outperforms traditional methods on 8 seen categories but also shows a powerful zero-shot capability for 8 unseen articulated object categories. Moreover, the real-world experiments on 7 different object categories prove our framework's adaptability in practical scenarios. Code is released at \href{https://github.com/GeWu-Lab/LLM_articulated_object_manipulation/tree/main}{here}.
翻译:通用铰接物体操作是家庭辅助机器人的关键能力。近期研究主要聚焦于基于示范的模仿学习或仿真环境中的强化学习,然而,由于真实世界数据采集的高昂成本以及精确物体仿真的复杂性,这些方法在实现跨不同铰接物体的广泛适应性方面仍面临挑战。近年来,许多工作尝试利用大语言模型强大的上下文学习能力实现通用机器人操作,但多数研究集中于高层任务规划,忽视了底层机器人控制。本文基于"物体运动学结构决定其操作方式"这一核心理念,提出运动学感知提示框架,通过向大语言模型注入物体的运动学知识生成底层运动轨迹路径点,从而支持多样化的物体操作。为有效提示大语言模型理解不同物体的运动学结构,我们设计了统一运动学知识解析器,将各类铰接物体表示为包含运动学关节和接触位置的统一文本描述。基于该统一描述,进一步提出运动学感知规划器模型,通过设计的运动学感知思维链提示方法生成精确的三维操作路径点。我们在16个不同类别的48个实例上进行评估,结果表明该框架不仅在8个已知类别上优于传统方法,更在8个未见铰接物体类别上展现出强大的零样本能力。此外,在7种不同物体类别上的真实世界实验验证了该框架在实际场景中的适应性。代码已发布于 \href{https://github.com/GeWu-Lab/LLM_articulated_object_manipulation/tree/main}{此处}。