We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 6-DoF rearrangement tasks.
翻译:我们提出Dream2Real框架,该机器人系统将基于二维数据训练的视觉语言模型(VLM)融入三维物体重排流程。通过机器人自主构建场景三维表征,可在虚拟环境中重排物体并渲染生成布局图像。利用VLM评估这些渲染结果,选择最优满足用户指令的布局方案,最终通过抓取-放置操作在真实世界中复现该布局。这使语言条件驱动的重排任务实现零样本执行,无需收集示例布局的训练数据集。在系列真实世界任务中的实验表明,该框架对干扰物具有鲁棒性,可受语言控制,能理解复杂多物体关系,并同时适用于桌面级与六自由度重排任务。