$\textbf{Splat-MOVER}$: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting

We present Splat-MOVER, a modular robotics stack for open-vocabulary robotic manipulation, which leverages the editability of Gaussian Splatting (GSplat) scene representa- tions to enable multi-stage manipulation tasks. Splat-MOVER consists of: (i) $\textit{ASK-Splat}$, a GSplat representation that distills latent codes for language semantics and grasp affordance into the 3D scene. ASK-Splat enables geometric, semantic, and affordance understanding of 3D scenes, which is critical for many robotics tasks; (ii) $\textit{SEE-Splat}$, a real-time scene-editing module using 3D semantic masking and infilling to visualize the motions of objects that result from robot interactions in the real-world. SEE-Splat creates a "digital twin" of the evolving environment throughout the manipulation task; and (iii) $\textit{Grasp- Splat}$, a grasp generation module that uses ASK-Splat and SEE-Splat to propose candidate grasps for open-world objects. ASK-Splat is trained in real-time from RGB images in a brief scanning phase prior to operation, while SEE-Splat and Grasp- Splat run in real-time during operation. We demonstrate the superior performance of Splat-MOVER in hardware experiments on a Kinova robot compared to two recent baselines in four single-stage, open-vocabulary manipulation tasks, as well as in four multi-stage manipulation tasks using the edited scene to reflect scene changes due to prior manipulation stages, which is not possible with the existing baselines. Code for this project and a link to the project page will be made available soon.

翻译：我们提出Splat-MOVER，一种用于开放词汇机器人操作的模块化机器人技术栈，其利用高斯泼溅（GSplat）场景表征的可编辑性来实现多阶段操作任务。Splat-MOVER包含：(i) $\textit{ASK-Splat}$，一种将语言语义和抓取可行性的潜码提取到3D场景中的GSplat表征。ASK-Splat能够实现3D场景的几何、语义和可行性理解，这对许多机器人任务至关重要；(ii) $\textit{SEE-Splat}$，一种使用3D语义掩码和填充来可视化机器人交互在现实世界中引起物体运动过程的实时场景编辑模块。SEE-Splat在整个操作任务过程中为不断变化的环境创建"数字孪生"；(iii) $\textit{Grasp-Splat}$，一种利用ASK-Splat和SEE-Splat为开放世界物体提出候选抓取方案的抓取生成模块。ASK-Splat在操作前通过简短的扫描阶段从RGB图像中实时训练，而SEE-Splat和Grasp-Splat在操作过程中实时运行。我们通过在Kinova机器人上进行的硬件实验展示了Splat-MOVER的优越性能，与两个近期基线方法相比，在四项单阶段开放词汇操作任务以及四项多阶段操作任务中均表现更优，其中多阶段任务利用编辑后的场景反映先前操作阶段导致的场景变化，这是现有基线方法无法实现的。该项目代码及项目页面链接将很快开放。