We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 6-DoF rearrangement tasks.
翻译:本文提出Dream2Real,一种将基于二维数据训练的视觉语言模型(VLMs)集成到三维物体重排流程中的机器人框架。该框架通过机器人自主构建场景的三维表征实现,可在虚拟环境中对物体进行重排并渲染生成重排后的图像。这些渲染图像由VLM进行评估,从而选择最符合用户指令的排列方案,并通过抓取放置操作在现实世界中复现。该方法实现了语言条件化的零样本重排,无需收集示例排列的训练数据集。在一系列真实世界任务上的实验结果表明,该框架对干扰物具有鲁棒性,可通过语言进行控制,能够理解复杂的多物体空间关系,并可直接应用于桌面场景与六自由度重排任务。