Performing language-conditioned robotic manipulation tasks in unstructured environments is highly demanded for general intelligent robots. Conventional robotic manipulation methods usually learn semantic representation of the observation for action prediction, which ignores the scene-level spatiotemporal dynamics for human goal completion. In this paper, we propose a dynamic Gaussian Splatting method named ManiGaussian for multi-task robotic manipulation, which mines scene dynamics via future scene reconstruction. Specifically, we first formulate the dynamic Gaussian Splatting framework that infers the semantics propagation in the Gaussian embedding space, where the semantic representation is leveraged to predict the optimal robot action. Then, we build a Gaussian world model to parameterize the distribution in our dynamic Gaussian Splatting framework, which provides informative supervision in the interactive environment via future scene reconstruction. We evaluate our ManiGaussian on 10 RLBench tasks with 166 variations, and the results demonstrate our framework can outperform the state-of-the-art methods by 13.1\% in average success rate.
翻译:在非结构化环境中执行语言条件驱动的机器人操控任务,是通用智能机器人的迫切需求。传统机器人操控方法通常学习观测数据的语义表征以预测动作,却忽略了人类目标完成过程中场景级的时空动态特性。本文提出一种名为ManiGaussian的动态高斯泼溅方法,通过未来场景重建挖掘场景动态特征,实现多任务机器人操控。具体而言,我们首先构建动态高斯泼溅框架,在嵌入空间推断语义传播机制,并利用语义表征预测最优机器人动作。随后建立高斯世界模型,对该动态高斯泼溅框架中的分布进行参数化表征,通过交互环境中的未来场景重建提供有效监督。我们在包含166个变体的10个RLBench任务上评估ManiGaussian,结果表明本框架在平均成功率上超越现有最优方法13.1%。