Scene rearrangement, like table tidying, is a challenging task in robotic manipulation due to the complexity of predicting diverse object arrangements. Web-scale trained generative models such as Stable Diffusion can aid by generating natural scenes as goals. To facilitate robot execution, object-level representations must be extracted to match the real scenes with the generated goals and to calculate object pose transformations. Current methods typically use a multi-step design that involves separate models for generation, segmentation, and feature encoding, which can lead to a low success rate due to error accumulation. Furthermore, they lack control over the viewing perspectives of the generated goals, restricting the tasks to 3-DoF settings. In this paper, we propose PACA, a zero-shot pipeline for scene rearrangement that leverages perspective-aware cross-attention representation derived from Stable Diffusion. Specifically, we develop a representation that integrates generation, segmentation, and feature encoding into a single step to produce object-level representations. Additionally, we introduce perspective control, thus enabling the matching of 6-DoF camera views and extending past approaches that were limited to 3-DoF top-down views. The efficacy of our method is demonstrated through its zero-shot performance in real robot experiments across various scenes, achieving an average matching accuracy and execution success rate of 87% and 67%, respectively.
翻译:场景重排(如桌面整理)是机器人操作中的一项挑战性任务,其难点在于预测多样化的物体布局。通过Web规模训练的生成模型(如Stable Diffusion)可通过生成自然场景作为目标来提供帮助。为实现机器人执行,必须提取物体级表示以匹配真实场景与生成目标,并计算机器人姿态变换。现有方法通常采用多步骤设计,涉及生成、分割和特征编码的独立模型,这可能导致误差累积而降低成功率。此外,这些方法缺乏对生成目标视角的控制,将任务限制在3自由度设置中。本文提出PACA——一种利用Stable Diffusion衍生的视角感知交叉注意力表示的零样本场景重排流程。具体而言,我们开发了一种将生成、分割和特征编码集成到单一步骤的表示方法,以产生物体级表示。此外,我们引入视角控制机制,从而实现对6自由度相机视图的匹配,突破了以往方法仅限于3自由度俯视视角的限制。通过在多种场景下的真实机器人实验,我们的方法在零样本性能中展现出显著效果,平均匹配准确率和执行成功率分别达到87%和67%。