Combining accurate geometry with rich semantics has been proven to be highly effective for language-guided robotic manipulation. Existing methods for dynamic scenes either fail to update in real-time or rely on additional depth sensors for simple scene editing, limiting their applicability in real-world. In this paper, we introduce MSGField, a representation that uses a collection of 2D Gaussians for high-quality reconstruction, further enhanced with attributes to encode semantic and motion information. Specially, we represent the motion field compactly by decomposing each primitive's motion into a combination of a limited set of motion bases. Leveraging the differentiable real-time rendering of Gaussian splatting, we can quickly optimize object motion, even for complex non-rigid motions, with image supervision from only two camera views. Additionally, we designed a pipeline that utilizes object priors to efficiently obtain well-defined semantics. In our challenging dataset, which includes flexible and extremely small objects, our method achieve a success rate of 79.2% in static and 63.3% in dynamic environments for language-guided manipulation. For specified object grasping, we achieve a success rate of 90%, on par with point cloud-based methods. Code and dataset will be released at:https://shengyu724.github.io/MSGField.github.io.
翻译:将精确几何与丰富语义相结合已被证明对于语言引导的机器人操作极为有效。现有动态场景方法要么无法实时更新,要么依赖额外深度传感器进行简单场景编辑,限制了其在实际场景中的应用。本文提出MSGField,一种采用二维高斯集合进行高质量重建的表示方法,并通过属性增强以编码语义与运动信息。特别地,我们通过将每个基元的运动分解为有限运动基底的组合,实现了运动场的紧凑表示。借助高斯泼溅的可微分实时渲染能力,我们能够仅通过两个相机视角的图像监督,快速优化物体运动(包括复杂的非刚性运动)。此外,我们设计了利用物体先验高效获取明确语义的流程。在包含柔性及极小物体的挑战性数据集中,我们的方法在语言引导操作任务中实现了静态环境79.2%、动态环境63.3%的成功率。在指定物体抓取任务中,我们取得了90%的成功率,与基于点云的方法性能相当。代码与数据集发布于:https://shengyu724.github.io/MSGField.github.io。