Object rearrangement is pivotal in robotic-environment interactions, representing a significant capability in embodied AI. In this paper, we present SG-Bot, a novel rearrangement framework that utilizes a coarse-to-fine scheme with a scene graph as the scene representation. Unlike previous methods that rely on either known goal priors or zero-shot large models, SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics, seamlessly blending the consideration of commonsense knowledge with automatic generation capabilities. SG-Bot employs a three-fold procedure--observation, imagination, and execution--to adeptly address the task. Initially, objects are discerned and extracted from a cluttered scene during the observation. These objects are first coarsely organized and depicted within a scene graph, guided by either commonsense or user-defined criteria. Then, this scene graph subsequently informs a generative model, which forms a fine-grained goal scene considering the shape information from the initial scene and object semantics. Finally, for execution, the initial and envisioned goal scenes are matched to formulate robotic action policies. Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.
翻译:物体重排在机器人-环境交互中至关重要,是具身智能的一项关键能力。本文提出SG-Bot,一种新颖的重排框架,采用粗到细的方案,并以场景图作为场景表征。与以往依赖已知目标先验或零样本大模型的方法不同,SG-Bot兼具轻量级、实时性和用户可控性,无缝融合了常识推理与自动生成能力。SG-Bot通过“观测-想象-执行”三步流程巧妙完成该任务。首先,在观测阶段从杂乱场景中识别并提取物体。这些物体先被粗略组织,并依据常识或用户定义的标准在场景图中描述。随后,该场景图输入生成模型,结合初始场景的形状信息和物体语义生成细粒度的目标场景。最后,在执行阶段,通过匹配初始场景与构想的目标场景来制定机器人动作策略。实验结果表明,SG-Bot的性能大幅优于同类方法。