SAGE: Bridging Semantic and Actionable Parts for GEneralizable Articulated-Object Manipulation under Language Instructions

Generalizable manipulation of articulated objects remains a challenging problem in many real-world scenarios, given the diverse object structures, functionalities, and goals. In these tasks, both semantic interpretations and physical plausibilities are crucial for a policy to succeed. To address this problem, we propose SAGE, a novel framework that bridges the understanding of semantic and actionable parts of articulated objects to achieve generalizable manipulation under language instructions. Given a manipulation goal specified by natural language, an instruction interpreter with Large Language Models (LLMs) first translates them into programmatic actions on the object's semantic parts. This process also involves a scene context parser for understanding the visual inputs, which is designed to generate scene descriptions with both rich information and accurate interaction-related facts by joining the forces of generalist Visual-Language Models (VLMs) and domain-specialist part perception models. To further convert the action programs into executable policies, a part grounding module then maps the object semantic parts suggested by the instruction interpreter into so-called Generalizable Actionable Parts (GAParts). Finally, an interactive feedback module is incorporated to respond to failures, which greatly increases the robustness of the overall framework. Experiments both in simulation environments and on real robots show that our framework can handle a large variety of articulated objects with diverse language-instructed goals. We also provide a new benchmark for language-guided articulated-object manipulation in realistic scenarios.

翻译：通用铰接物体操控在诸多真实场景中仍具挑战性，其难点在于物体结构、功能及目标的多样性。在这类任务中，语义理解与物理可行性对策略的成功至关重要。针对该问题，我们提出SAGE框架，通过弥合铰接物体语义部件与可操作部件之间的认知鸿沟，实现语言指令下的通用化操控。给定自然语言指定的操控目标，基于大语言模型（LLMs）的指令解析器首先将指令转化为作用于物体语义部件的程序化动作。该流程还包含场景上下文解析器，它通过结合通用视觉-语言模型（VLMs）与领域特化的部件感知模型，生成兼具丰富信息与精准交互事实的场景描述，以理解视觉输入。为将动作程序转化为可执行策略，部件定位模块将指令解析器建议的语义部件映射至所谓的通用可操作部件（GAParts）。最后，集成交互反馈模块以响应失败情况，显著提升整体框架的鲁棒性。在仿真环境与真实机器人上的实验表明，我们的框架能够处理多种铰接物体及多样化的语言指令目标。我们还提供了面向真实场景的语言引导铰接物体操控新基准。