SAGE: Bridging Semantic and Actionable Parts for GEneralizable Manipulation of Articulated Objects

To interact with daily-life articulated objects of diverse structures and functionalities, understanding the object parts plays a central role in both user instruction comprehension and task execution. However, the possible discordance between the semantic meaning and physics functionalities of the parts poses a challenge for designing a general system. To address this problem, we propose SAGE, a novel framework that bridges semantic and actionable parts of articulated objects to achieve generalizable manipulation under natural language instructions. More concretely, given an articulated object, we first observe all the semantic parts on it, conditioned on which an instruction interpreter proposes possible action programs that concretize the natural language instruction. Then, a part-grounding module maps the semantic parts into so-called Generalizable Actionable Parts (GAParts), which inherently carry information about part motion. End-effector trajectories are predicted on the GAParts, which, together with the action program, form an executable policy. Additionally, an interactive feedback module is incorporated to respond to failures, which closes the loop and increases the robustness of the overall framework. Key to the success of our framework is the joint proposal and knowledge fusion between a large vision-language model (VLM) and a small domain-specific model for both context comprehension and part perception, with the former providing general intuitions and the latter serving as expert facts. Both simulation and real-robot experiments show our effectiveness in handling a large variety of articulated objects with diverse language-instructed goals.

翻译：为与日常环境中结构各异、功能多样的铰接物体进行交互，理解物体部件在用户指令理解与任务执行中均发挥着核心作用。然而，部件语义含义与物理功能之间的潜在不一致性，为设计通用系统带来了挑战。为解决这一问题，我们提出SAGE——一个弥合铰接物体语义部件与可操作部件之间鸿沟的新框架，以实现基于自然语言指令的通用操作。具体而言，给定一个铰接物体，我们首先观测其所有语义部件，并基于此由指令解释器提出可能的具体化自然语言指令的动作程序。随后，部件定位模块将语义部件映射为所谓的“通用可操作部件（GAParts）”，该部件固有地携带部件运动信息。基于GAParts预测末端执行器轨迹，并与动作程序共同形成可执行策略。此外，集成交互式反馈模块以应对失败，该模块形成闭环并增强整体框架的鲁棒性。本框架成功的关键在于：联合提议并融合大型视觉语言模型（VLM）与小型领域专用模型的知识，前者提供通用直觉，后者作为专家事实，共同实现上下文理解与部件感知。仿真与真实机器人实验均验证了本方法在处理大量不同铰接物体及多样化语言指令目标方面的有效性。