We present Splat-MOVER, a modular robotics stack for open-vocabulary robotic manipulation, which leverages the editability of Gaussian Splatting (GSplat) scene representations to enable multi-stage manipulation tasks. Splat-MOVER consists of: (i) ASK-Splat, a GSplat representation that distills latent codes for language semantics and grasp affordance into the 3D scene. ASK-Splat enables geometric, semantic, and affordance understanding of 3D scenes, which is critical for many robotics tasks; (ii) SEE-Splat, a real-time scene-editing module using 3D semantic masking and infilling to visualize the motions of objects that result from robot interactions in the real-world. SEE-Splat creates a "digital twin" of the evolving environment throughout the manipulation task; and (iii) Grasp-Splat, a grasp generation module that uses ASK-Splat and SEE-Splat to propose candidate grasps for open-world objects. ASK-Splat is trained in real-time from RGB images in a brief scanning phase prior to operation, while SEE-Splat and Grasp-Splat run in real-time during operation. We demonstrate the superior performance of Splat-MOVER in hardware experiments on a Kinova robot compared to two recent baselines in four single-stage, open-vocabulary manipulation tasks, as well as in four multi-stage manipulation tasks using the edited scene to reflect scene changes due to prior manipulation stages, which is not possible with the existing baselines. Code for this project and a link to the project page will be made available soon.
翻译:本文提出Splat-MOVER,一个面向开放式词汇机器人操作的模块化机器人系统,通过利用高斯泼溅(GSplat)场景表示的可编辑性实现多阶段操作任务。Splat-MOVER包含:(i)ASK-Splat,一种将语言语义与抓取能力的潜在编码蒸馏至3D场景的GSplat表示,可实现对3D场景的几何、语义与抓取能力理解,这对众多机器人任务至关重要;(ii)SEE-Splat,一个利用3D语义掩码与补全实现机器人实际交互所引起物体运动可视化的实时场景编辑模块。该模块可在整个操作任务中创建动态环境的“数字孪生”;(iii)Grasp-Splat,一个基于ASK-Splat与SEE-Splat为开放世界物体生成候选抓取姿态的抓取生成模块。ASK-Splat在操作前通过短时扫描阶段基于RGB图像实时训练,而SEE-Splat与Grasp-Splat在操作过程中实时运行。通过在Kinova机器人上的硬件实验,我们展示了Splat-MOVER相较于两种近期基线方法在四项单阶段开放式词汇操作任务及四项多阶段操作任务中的优越性能——后者通过编辑场景反映先前操作阶段导致的场景变化,这是现有基线方法无法实现的。本项目代码与项目页面链接将很快公开。