Object rearrangement is pivotal in robotic-environment interactions, representing a significant capability in embodied AI. In this paper, we present SG-Bot, a novel rearrangement framework that utilizes a coarse-to-fine scheme with a scene graph as the scene representation. Unlike previous methods that rely on either known goal priors or zero-shot large models, SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics, seamlessly blending the consideration of commonsense knowledge with automatic generation capabilities. SG-Bot employs a three-fold procedure--observation, imagination, and execution--to adeptly address the task. Initially, objects are discerned and extracted from a cluttered scene during the observation. These objects are first coarsely organized and depicted within a scene graph, guided by either commonsense or user-defined criteria. Then, this scene graph subsequently informs a generative model, which forms a fine-grained goal scene considering the shape information from the initial scene and object semantics. Finally, for execution, the initial and envisioned goal scenes are matched to formulate robotic action policies. Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.
翻译:物体重排在机器人-环境交互中至关重要,是具身智能的一项关键能力。本文提出SG-Bot——一种新颖的重排框架,采用由粗到细的方案,并以场景图作为场景表征。与先前依赖已知目标先验或零样本大模型的方法不同,SG-Bot体现出轻量级、实时且用户可控的特性,将常识知识与自动生成能力无缝融合。SG-Bot通过三项流程——观察、想象与执行——巧妙应对该任务。首先,在观察阶段,从杂乱场景中识别并提取物体。这些物体在场景图中先进行粗略组织与描述,其过程可由常识或用户定义标准指导。随后,该场景图引导一个生成模型,结合初始场景的形状信息与物体语义,生成精细化的目标场景。最后,在执行阶段,将初始场景与想象的目标场景进行匹配,以制定机器人动作策略。实验结果表明,SG-Bot以较大优势优于竞争方法。