Humans possess an extraordinary ability to understand and execute complex manipulation tasks by interpreting abstract instruction manuals. For robots, however, this capability remains a substantial challenge, as they cannot interpret abstract instructions and translate them into executable actions. In this paper, we present Manual2Skill, a novel framework that enables robots to perform complex assembly tasks guided by high-level manual instructions. Our approach leverages a Vision-Language Model (VLM) to extract structured information from instructional images and then uses this information to construct hierarchical assembly graphs. These graphs represent parts, subassemblies, and the relationships between them. To facilitate task execution, a pose estimation model predicts the relative 6D poses of components at each assembly step. At the same time, a motion planning module generates actionable sequences for real-world robotic implementation. We demonstrate the effectiveness of Manual2Skill by successfully assembling several real-world IKEA furniture items. This application highlights its ability to manage long-horizon manipulation tasks with both efficiency and precision, significantly enhancing the practicality of robot learning from instruction manuals. This work marks a step forward in advancing robotic systems capable of understanding and executing complex manipulation tasks in a manner akin to human capabilities.Project Page: https://owensun2004.github.io/Furniture-Assembly-Web/
翻译:人类具备通过解读抽象说明书来理解和执行复杂操作任务的非凡能力。然而,对于机器人而言,这种能力仍是一个重大挑战,因为它们无法解析抽象指令并将其转化为可执行动作。本文提出Manual2Skill,这是一个新颖的框架,使机器人能够在高级手册指令的指导下执行复杂组装任务。我们的方法利用视觉语言模型从教学图像中提取结构化信息,然后利用这些信息构建层次化组装图。这些图表示零件、子组件及其相互关系。为促进任务执行,姿态估计模型预测每个组装步骤中组件的相对6D姿态。同时,运动规划模块生成可实际执行的机器人动作序列。我们通过成功组装多个真实世界的宜家家具物品,证明了Manual2Skill的有效性。该应用突显了其高效且精确地处理长程操作任务的能力,显著提升了机器人从说明书学习的实用性。这项工作标志着在推进机器人系统以类人方式理解和执行复杂操作任务方面迈出了重要一步。项目页面:https://owensun2004.github.io/Furniture-Assembly-Web/